Backwards Incompatible Change to PyPI JSON API

dstufft · July 7, 2022, 3:42pm

This is just a heads up to let everyone know we merged a backwards incompatible change to PyPI’s legacy JSON API today. This change makes it so that the JSON responses from pypi/<project>/<version>/json does not contain the releases key, which listed the files for every single version of that project on PyPI.

We’ve been debugging intermittent crashing on PyPI (technically it’s not a crash, the health check API stops responding and the memory usage of the containers grows until k8s kills the pod and restarts it) causing a flurry of 503 errors several times a day.

We believe we’ve traced this back to the legacy JSON API, and how it had several large (mbs to 10s of mbs) responses that all got removed from the CDN at the same time during a release. Every single release of a project increases the size of the legacy JSON API and adds an additional URL that can return this data, compounding the problem.

To remedy this, we’ve made the /pypi/<project>/<version>/json URLs drop the releases key, and only contain the data specific to <version>. The urls key still exists which has all of the files for that release.

The /pypi/<project>/json URL still has the releases key that contains all of the files. While we would like to eventually remove the releases key from this URL too, we’ve identified popular software that currently relies on it. Ideally this software would be updated to use the PEP 691 JSON API instead, and should consider the releases key to be deprecated.

Due to the caching CDN, you will continue to get the releases key on the version specific URLs over the next few days and it will slowly be replaced with the releases key omitted.

bernatgabor · July 7, 2022, 4:07pm

This will break virtualenv background wheel update feature - virtualenv/periodic_update.py at main · pypa/virtualenv · GitHub

dstufft · July 7, 2022, 4:16pm

That should be fine currently, as you’re using f"https://pypi.org/pypi/{distribution}/json", which still has the releases key, since that isn’t the versioned URL.

If/when we remove the releases key from the unversioned URL that would break, but you should be able to fix that by switching to using the versioned URL, and using the urls key instead.

bernatgabor · July 7, 2022, 4:18pm

Ah, I see. Should we migrate virtualenv logic to PEP 691 or it’s fine to keep using the unversioned link?

dstufft · July 7, 2022, 4:32pm

PEP 691 doesn’t have the upload time, so you won’t be able to use that without a PEP to add that.

For now it’s OK to keep using the unversioned link, but it appears that you know the version of the wheel you’re trying to get the upload time from, so you’d actually be best off using the versioned url, which has the urls key which has the same data the releases key had, except scoped to just that version.

That will have you using less bandwidth (since the versioned urls have smaller responses) and will protect you from a future when we want to ditch the releases key from the unversioned URL too.

That would be a longer term thing though, so it’s not a great rush to do so.

pf_moore · July 7, 2022, 4:46pm

It appears that the expectation in the longer term is that people will be expected to make multiple requests to the JSON API, each returning smaller payloads. Is that a fair expectation? I’m not particularly familiar with the trade-offs in web APIs, and I’ve generally assumed that it’s better (as a client) to minimise round trips to the server - is that not actually the case?

In particular, I have a script that grabs the metadata for a project and uses the unversioned API because that gives me everything in one go. I simply assumed that was better than using multiple calls to the versioned API, one for each version. (There’s also the fact that I have to use the unversioned API to get the list of all valid versions, so it’s not as if I can avoid using the unversioned API anyway…)

dstufft · July 7, 2022, 5:32pm

It depends!

Technically speaking, if you need all the data, then one request that returns all of the data is the best from the client’s POV. Every round trip has overhead both in time and bandwidth, so by getting it all into one request you minimize that overhead.

The same is true on the server side if all you’re trying to minimize is the total time/bandwidth spent.

However, the flip side of that is that when you have requests that take a long time (some of these big responses took 2+ seconds) that you’re tying up resources on the server, and the web server can only handle so many concurrent connections before it starts to queue new connections. So that ends up causing latency to go up for everyone as people end up sitting queued even for requests that would otherwise be fast.

So ultimately best practices are that you want your web requests to return quickly, ideally sub 100ms to provide a better overall experience for everyone as the service will just be more responsive overall, even if it’s ultimately a little slower for someone trying to pull tons of data.

Computing the large JSON response on a project like ccxt which has 4632 releases was taking upwards of 2500ms, and since there were 4632 total URLs that could return that large response, each slightly different so they couldn’t be cached the same, that had the potential to tie up all of our web workers for an extend period of time if people were requesting a lot of different URLs that all took ~2s each.

So we needed to reduce the time that most of those URLs took.

We profiled those handlers and roughly 25% of the time was spent parsing UUIDs that we use as primary keys into python’s uuid.UUID object, another 25% of the time was buried deep inside SQLAlchemy in a function that individually had sub ms run time, but was being called thousands of times to handle all of the rows.

I spent a few days trying to find some way to speed up the code there, and I could disable turning the primary keys into uuid.UUID objects, that still left them much too slow (and would require widescale changes throughout the code base because it’s a global option, not a per query option).

That left us with two real options:

Reduce the scope of the response so that the response has to do less stuff, so that it’s faster.
Move the doing of stuff out of the request/response cycle to precompute some or all of the response.

For APIs like the Simple Repository API, we’re doing the second option (well we aren’t yet, but that’s the plan) as that API is very well suited to being generated once, statically, and serving a precomputed response. Particularly since PEP 458 will requires that the response be stable, so precomputing it makes it less likely we accidentally invalidate signatures.

That was harder to do for the legacy JSON API, because we would need to precompute a slightly different response for every single release for every single project, so a project like ccxt that would be precomputing 4600+ different responses (and we need to decide what we would do in the interim while we’re precomputing them).

So we instead did the first option, reduced the scope which brought those responses down from ~900ms to ~70ms when running locally on my desktop.

This also has the added benefit that we don’t have to load as many huge responses into memory, so the overall memory of the web workers is more stable than before.

Ultimately, this is what our backends looked like before and after:

(can you guess when the deploy happened on the graph? )

So basically, for the web the most important thing is that your responses return quickly OR you have enough web servers to serve your highest concurrent requests (which we don’t, because our request patterns are “bursty”, most things are served by Fastly out of the CDN, but a release causes a burst of traffic related to that project… but maintaining capacity for all of those bursts would be a waste of resources).

woodruffw · July 7, 2022, 6:56pm

Just wanted to say: thanks for making this public announcement! Thanks to it, we were able to quickly diagnose our dependency on the releases key in pip-audit and get a new patch version out the door.

wkoorn · July 11, 2022, 8:30am

Out of curiosity, would you mind sharing this list?

pf_moore · August 25, 2022, 10:13pm

I can’t see how you would get a list of all releases of a project from the PEP 691 API (just the version numbers themselves, not the actual file data which can be found from the urls key of /pypi/<project>/<version>/json). Before the releases key is removed, can this information be added somewhere? Otherwise there’s no way to get from a project name to the /pypi/<project>/<version>/json pages.

If it helps, I’d be happy to write a PEP that updated PEP 691 to say:

The projects element of the root URL should return {"name": "Frob", "versions": ["0.1", "0.2"]}, i.e. adding a “versions” key.
The elements of the files item in the project details value should add a new key, upload_time containing an ISO 8601 date string.

which are the two items we’ve now identified as not available except in the JSON API (if we ignore the project metadata).

Edit: Correction. The versions should be added to the project details page. I don’t know what I was thinking before (it was late at night!) - adding it to the root page is a significant and unnecessary overhead.

dstufft · August 26, 2022, 6:09pm

Those seem like reasonable additions for a PEP, and likely fairly uncontroversial.

pf_moore · August 26, 2022, 7:23pm

OK, I’ll put something together. As it’s PyPI-related, would you be willing to be the PEP-delegate?

dstufft · August 26, 2022, 7:53pm

Yea that’s fine

dstufft · August 30, 2022, 2:16pm

It occurred to me something about your rough proposal above, specifically:

I think that we would want this data on the project level for ease of handling caching, there’s more information here, but currently /simple/ is heavily cached and doesn’t invalidate on every package change (and I don’t think it should) so should only contain data that rarely changes and/or we’re OK with some reasonable level of stale ^[1].

Meanwhile, /simple/$project/ has cache invalidation setup so that on any change to the related project, the object is ~immediately marked stale in our CDN ^[2], so data is kept fresh.

The version numbers are something that, it feels to me, people are going to expect to be reasonably up to date, and should probably live at the project specific endpoint unless we have a really compelling reason to require the caching strategy for /simple/ to change OR strong evidence / argument that people won’t expect that data to be “fresh” ^[3].

I mention in the linked issue that the current “level of stale” is 24h, but we could likely lower that safely, but it would never be as fresh as /simple/$project/. ↩︎
The “stale while revalidate” behavior mentioned in the GitHub issue still applies. ↩︎
The third, hidden option is a bigger change to redesign the /simple/ response to include pagination so each individual response is less heavyweight, but I don’t think any of the data mentioned here needs to be on the simple index to warrant blocking it on that bigger change. ↩︎

pf_moore · August 30, 2022, 3:41pm

Yes, I edited my comment. That was a dumb error on my part, there’s no reason to put it on the root. Sorry for the confusion.

pradyunsg · August 30, 2022, 5:46pm

IIUC, the proposal is to list the versions in the project listing? Don’t we already have the version listing implied, by the entire list of files for the package?

My concern is that there might be scope for discrepancies, due to normalisation or whatnot, between the versions that can be parsed out of the list of files and the list of versions if we add that.

Right now, it’s possible to figure out what versions are there by iterating though all the files and parsing the filename. Given that tooling is going to be the primary consumer, I’m not sure that it’s particularly valuable to add that potential for discrepancy.

If we really want to add a no-parse way to get the version information, my 2 cents we add it in a per-file basis, based on what’s in the file name or metadata (pick one, they should be consistent, and if they’re not, reject the upload?). I’m not sure that is valuable either tho, but I’m just one voice.

pradyunsg · August 30, 2022, 5:47pm

Upload time seems like a useful+uncontroversial thing to add though, and I can imagine that would unblock pip, to be able to grow the functionality that pip-timestamp provides.

The only caveat around it, that I can think of, is that it still won’t protect against deletions and that would need to be communicated.

pf_moore · August 30, 2022, 7:03pm

My use case includes getting legacy projects, which don’t necessarily follow modern conventions, and for which the version may well not be parseable from the filename. And in any case, the discrepancy exists, and is real (Warehouse stores version numbers for projects independently of whether any files exist for that release).

There’s 34076 releases in the existing JSON API with no files associated with them. Make of that what you will, but I needed the existing version data to be able to extract that information.

It’s a small point, I know, but we’re not starting from a clean slate here. Maybe if we were designing things from scratch, or we had the luxury of deleting all projects that didn’t conform to the new rules, then that would be different. And remember that projects can^[1] upload files that aren’t standardised, like eggs, bdist_msi and bdist_wininst - and a release could consist of only those files.

Here’s what file types exist:

Package Type	Count
sdist	2534922
bdist_wheel	2334815
bdist_egg	105990
bdist_wininst	17172
bdist_dumb	6030
bdist_msi	585
bdist_rpm	543
bdist_dmg	45

And here’s 10 (fairly random) project releases out of 7861 that have files, but no sdist or wheels:

project_name	version
73-unlockitems	0.3
a-bit-racey	1.0.0
aarddict	0.7.4
abi2doc	0.1
abp	0.4
accessible-output	0.4.2
accordionwidget	0.1
accountingmodules	0.1
accountssso	0.0.1
aceto	0.0.1

Also, we need to consider other index servers. Are we sure that devpi, or artifactory, or something else, might not have the ability to create a project version with no files uploaded?

Sorry, I rather laboured the point here. Once I start digging into PyPI stats, it’s quite fun (and rather horrifying ) to see what’s in there.

Or at least could until quite recently, I don’t know if it’s been prohibited now. ↩︎

pf_moore · August 31, 2022, 2:23pm

In fact, I’m not sure these are uncontroversial. I just noticed that I need file size data (to look at sizes of wheels on PyPI). So my first thought was “add that, too”. But why? none of these fields are needed for the core purpose of the simple index API, which is to give tools a list of all files associated with a project.

If we don’t move these fields to the simple index, what’s the alternative? What is the Warehouse view on the JSON API^[1] - and in particular backward compatibility? This discussion all started because it appears that there’s a possibility that the releases key in /pypi/<project>/json might get removed, and “Ideally this software would be updated to use the PEP 691 JSON API instead” - but that implies that Warehouse sees PEP 691 as the (standardised) successor to the JSON API, hence we’re having this discussion.

But is that fair? When the simple index page was standardised in PEP 503, no-one thought it should hold anything more than a list of files. Subsequent additions were directly in support of things installers need - and were driven by the expectation that installers don’t user other APIs like the JSON API. And PEP 691 was framed as simply offering a JSON view of the simple index. And the FAQ for PEP 691 said (emphasis mine)

The API was generally designed to allow further extension through adding new keys, so if there’s some new piece of data that an installer might need, future PEPs can easily make that available.

So adding data that non-installer tools need, which is currently covered by the JSON API, is arguably out of scope for the simple index (HTML or JSON). And furthermore, if we add information to the simple index, do we add it to just the JSON representation, or to both forms? Failing to add it to the HTML form turns that into a “second class citizen”, and will inevitably lead to that form being considered obsolete.

So, in the light of all this, I’m coming to a couple of conclusions:

Warehouse is wrong to deprecate parts of their existing API that provide data unavailable elsewhere. The suggestion that the releases key might get removed should be retracted until an alternative source of the data is available.
The simple API is not the natural successor to the JSON and XML-RPC APIs. We probably need a new endpoint that’s JSON-only, and it should be designed with migration from existing APIs in mind, while still addressing whatever issues Warehouse has with the current APIs.
Standardising the new API would be a perfectly reasonable thing to do. But I don’t have any problem with Warehouse designing a PyPI-specific API if they need to migrate faster than the standardisation process would allow.

This might need a new thread. I’m going to have limited time in September, so I don’t want to start that discussion just yet, plus I’m not sure I’m the best person to propose an API (I’d probably end up going “let’s just make what we already have a standard, now someone tell me what’s wrong with it” ) But if no-one else beats me to it, I’ll start a discussion at some point. (And in the meantime, I’ll hold off on writing a PEP 691 extension).

And the XML-RPC API. ↩︎

dstufft · August 31, 2022, 3:42pm

I’d say more accurately would be installer and mirroring and other related tools are the primary audience.

Poetry uses the releases key to generate a list of all of the versions for a project (pypi/warehouse#11991), so that’s an installer that would (or could) use the hypothetical versions key.

There’s a project called pypi-timemachine that filters the list of files by a date, you could easily imagine moving that feature into the installer if the installer API provides the upload date.

File size as well, is pretty justifiable as a form of a faster checksum before passing it into the hash function to validate downloads that also makes any theoretical pre-image attacks harder since not only do you need a pre-image attack, but you need one that produces a payload with the same file size. Also of particular use to bandersnatch that has plugins that let you filter downloads by file size.

Make a new API that is supportable long term.

They’re not supportable or scalable long term, particularly the XMLRPC API. We attempt not to break compatibility where we have a reasonable option, but we can and will as needed, particularly to aid in supporting the health of the service.

For instance, we removed the releases key from the versioned URL because it’s existence was causing intermittent bursts of downtime to happen upwards of 10-20 times per day.

The XMLRPC API is particularly bad for this as due to the nature of XMLRPC, we cannot cache those results in our CDN so 100% of the XMLRPC API requests ultimately hit our backend servers (which we ended up layering a cache in the backend server, but it still represents a significant resource drain on our backends).

The JSON API is also bad for a similar, but different reason-- the response is fairly compute intensive to produce and there’s over 3 million possible responses (actually over 6 million once other factors are put into place) and due to the sheer number of objects the CDN evicts them from the cache more aggressively.

We run into things like this happening semi regularly:

That’s requests that make it to the PyPI origin, the dark purple is the json endpoints and it was maxing out the CPU on our database server, which then causes resource starvation across the entire site and degrades the entire service.

So the general view here is that both of these APIs need to be shut down long term and people should shift to either the simple API and/or we should replace them with a non installer focused API that is designed in a way that we can scale it and maintain it.

In the mean time we try not to break things willy nilly which is why we haven’t just fully shut thing down and continue to expend our time trying to deal with the problems that regularly come up from them, but if we need to, we will break compatibility including removing functionality with no replacement ^[1].

The intent of PEP 691 is that new features do not need to be added to the HTML representation which yes, means that form will eventually be considered obsolete, from PEP 691:

Future versions of the API may add things that can only be represented in a subset of the available serializations of that version. All serializations version numbers, within a major version, SHOULD be kept in sync, but the specifics of how a feature serializes into each format may differ, including whether or not that feature is present at all.

So the idea behind PEP 691 is that eventually HTML will (hopefully) fall out of use and we do not need to constrain the simple api to the lowest common denominator of what we can or want to add to each serialization.

There’s no requirement that every piece of data be exposed in an API nor that because some data was once exposed that it needs to continue to be exposed forever.

The releases key was marked as such because it’s inclusion on the versioned URLs was a major driver in service health issues leading us to remove it. It’s currently not causing us problems, but many of the same reasons that we ran into problems with the versioned URLs can cause problems on the non-versioned URLs and if it does will be something we might have to remove in the future.

If people want that data in a form that is guaranteed not to go away then it needs to be added to the simple API or a new API needs to be designed that removes the scaling issues with the existing API.

The simple API is not the natural successor to the JSON and XML-RPC APIs. We probably need a new endpoint that’s JSON-only, and it should be designed with migration from existing APIs in mind, while still addressing whatever issues Warehouse has with the current APIs.

I agree that it’s not a natural successor to the JSON/XML-RPC APIs, but it exists and we’ve expended significant energy to scaling it and making it supportable long term, so for data that installers or mirroring and related tools either need or may be useful, then it is a good option to point people towards.

Standardising the new API would be a perfectly reasonable thing to do. But I don’t have any problem with Warehouse designing a PyPI-specific API if they need to migrate faster than the standardisation process would allow.

A new API doesn’t have to be standardized via PEP, but it does need significant design work to make sure we don’t just recreate the problems of the existing non standard APIs.

XMLRPC search api anyone? ↩︎