PEP for the Python package index JSON API?

pelson · November 11, 2020, 4:50pm

PEP 503 defines the “Simple API” exposed by warehouse. That same wheelhouse documentation describes the “JSON API”, but I can’t find the associated PEP.

I’m interested because I’m raising a bug with some software to do package index proxying (Nexus) to describe a bug with their JSON API implementation (honestly, they just proxy through to PyPI and should re-write a few URLs).

Assuming that the PEP doesn’t exist, and that generally the JSON API is preferred, does it make sense to either have a new PEP for the JSON API or to simply clarify in PEP 503 that there is another scheme, and point to the canonical implementation in the wheelhouse?

Cheers,

dustin · November 11, 2020, 5:09pm

Hey Phil, thanks for starting this discussion and welcome to the forum.

The PEP doesn’t exist, and thus the current API probably shouldn’t be considered “standardized”.

PyPI has been planning for a while to replace this with a standardized API (see Determine new API URL structure for warehouse (starting with new JSON API) · Issue #284 · pypi/warehouse · GitHub) but this is a huge endeavor: PyPI has so many users that as soon as a new API is released, we immediately have users depending on it which we need to support indefinitely.

I think it probably makes the most sense to just point to the most widely accepted implementation (PyPI) but I’m curious what specifically the discrepancy/bug is here.

pelson · November 11, 2020, 6:57pm

Thanks for the warm -and quick- welcome!

I think it probably makes the most sense to just point to the most widely accepted implementation

I agree in principle, but from an outsiders perspective the PEP for the simple API gives it more standing than the non-standardized JSON API. In essence the existence of PEP 503 makes it harder to argument in favour of implementing both the simple and the JSON APIs.

I’m curious what specifically the discrepancy/bug is here

This is a very pertinent question . As I recall it was pip search (which I know has a colourful history) which was being used as part of the API of ensurepip-upgrade (which I no longer use, and prefer a deterministic approach to upgrading ensurepip - but that is a whole other thread ).

honestly, they just proxy through to PyPI and should re-write a few URLs

I was (unintentionally) being very disingenuous here as the software also implements a repository service as well as a proxy. So really, to support the JSON API properly they’d need to implement the JSON API for their stored artefacts too.

thus the current API probably shouldn’t be considered “standardized”

To what extent is this a problem? In the thread I linked you said that “pretty much nothing uses /simple/”, which at the time of reading I read as “Simple API”, but upon reflection I guess you literally meant that endpoint (which lists all packages on PyPI in a non-paginated form, and is extremely slow).
Given pretty-much everything else “just works” with pip with a Nexus-based index, perhaps pip is only using the JSON API for search, and everything else is a specific endpoint of the simple API (just not the top level one /simple/)?

kpfleming · November 11, 2020, 9:05pm

Our company uses Artifactory, not Nexus Repository Manager, but this is exactly how it behaves. On the upstream side (talking to pypi.org) it only uses the Simple API. On the downstream side (internal users using pip) it implements the Simple API and only that one element of the JSON API required for pip search.

dustin · November 11, 2020, 9:31pm

It’s a problem mostly because there isn’t a clear standard to point to for implementers, i.e. it’s hard to know if you got it right or not.

Yep, literally meant that endpoint

There is no JSON search API – pip uses the deprecated XML-RPC API for this.

Furthermore, I think the pip maintainers are planning to remove it entirely in the future (Remove the pip search command · Issue #5216 · pypa/pip · GitHub) and the PyPI admins are definitely keen on shutting down the API sooner than later, so if the issue is really about pip search, I’m not sure it’s worth the effort.

pf_moore · November 11, 2020, 9:34pm

pip search uses the XML-RPC API, not the JSON API. The XML-RPC API is even less standardised than the JSON one, and I doubt there’s any intention to standardise it in the foreseeable future. It’s more likely that the important features, like mirroring and maybe search¹, will be migrated into the JSON API at some point, and then the XML-RPC API will be dumped.

I speak purely as a consumer of the XML-RPC API, so don’t take any of the above as anything but speculation…

¹ Although the XML-RPC search API gives notoriously bad results, so maybe there will be a complete redesign instead.

kpfleming · January 25, 2021, 3:21pm

This week we’re deploying an internal tool which needs to monitor the ‘updates’ feed from pypi.org, and I almost used the XML-RPC API because it’s much more efficient and less likely to miss updates, but the docs telling me not to do that scared me off

As a result we’re polling the top-level RSS feed for ‘updates’ every 15 seconds, and that should work except in cases where are more than ~40 new packages uploaded in a 15 second period.

westurner · January 25, 2021, 5:49pm

Every 15 seconds is 17,280 HTTP requests per day.

How many updates per day are there?

Could the BigQuery API be less expensive for you?

kpfleming · January 25, 2021, 6:26pm

Unfortunately the problem we’re solving is the invalidation of some internal caches so that users can see the newly-released versions on pypi.org, so delaying the cache invalidation defeats the purpose.

westurner · January 25, 2021, 7:41pm

That seems like quite a few requests for again how many actual data updates a day?

I’m not sure what polling interval Pulp python support has for a default?

https://pulp-python.readthedocs.io/en/latest/

https://github.com/pulp/pulp_python/blob/master/pulp_python/tests/functional/api/test_download_content.py#L198

https://github.com/pulp/pulp_python/blob/master/pulp_python/app/tasks/sync.py

pf_moore · January 25, 2021, 8:50pm

The problem is that the RSS feed only returns a set number of changes, and if more than that come in between polls, you’ve lost them. Whether losing some updates is an issue depends on the application - it sounds like it is for @kpfleming and it certainly was for my application that uses the XML-RPC changelog API.

I personally couldn’t host something that continually polled, so RSS wasn’t an option for me.

PyPI could make a decision to stop supporting use cases that want to know about every change (although that’s pretty key to mirroring, so I’d be surprised if they do). But assuming they don’t, then the only API that covers that requirement is the XML-RPC changelog API, at the moment.

kpfleming · January 25, 2021, 10:31pm

It’s not critical for us, if we miss an update here and there we’ll live with it (there are only a few packages for which we really want to get every update, and for those I’ll modify our tool to watch the feeds specific to those packages). If Warehouse had some ability to push all update notifications to some sort of message queuing platform, to which anyone could then subscribe (at their cost, if necessary), that might be a useful way to attack this part of the problem space, but it’s off-topic here I think.

westurner · January 26, 2021, 9:49am

Yeah, you’d need to name some message queues; maybe with https://pypi.org/project/*/ URIs.

Would a per-release message need to be sent if there are no subscribers to that particular project’s channel/queue? (Push)
And then how do we retrieve the messages that were missed due to e.g channel error? (Pull)

Does it make sense to have a more general notification service that
supports e.g. Web Notifications, App Notifications, and email? Getting the
page or app to update without waiting for a full page refresh is basically
the same problem?

Apprise allows you to send a notification to almost all of the most popular notification services available to us today such as: Telegram, Discord, Slack, Amazon SNS, Gotify, etc.

Looks like apprise also supports SMTPS, D-Bus, etc.

westurner · January 27, 2021, 3:20am

These are the RSS feed views:
https://github.com/pypa/warehouse/blob/main/warehouse/rss/views.py

Is there an ETag HTTP header on the views?

It should be easy enough to generate JSON with a window spec within the document range: [seq_start_id, seq_end_id] (and an ETag HTTP header)?

sumanah · February 5, 2021, 9:10pm

cc @cooperlees and @woodruffw in case you’d like to weigh in.

sumanah · February 5, 2021, 9:12pm

In Request: PEP to describe current Warehouse JSON API · Issue #367 · pypa/packaging-problems · GitHub @cooperlees has been working on peps/pep-9999.rst at warehouse_json_api · cooperlees/peps · GitHub to try to describe the current Warehouse JSON API, to:

lock in the existing standard as a guarantee for consumers (client applications like pip, pipenv, and more)
help other indexes (such as devpi and pypiserver and Artifactory) to implement the standard and be assured of interoperability

wim · June 15, 2022, 6:06pm

The links in the previous post are dead now, but PEP RFC: Python Package Index (Warehouse) JSON API v1 has updated links.