Backwards Incompatible Change to PyPI JSON API

This is just a heads up to let everyone know we merged a backwards incompatible change to PyPI’s legacy JSON API today. This change makes it so that the JSON responses from pypi/<project>/<version>/json does not contain the releases key, which listed the files for every single version of that project on PyPI.

We’ve been debugging intermittent crashing on PyPI (technically it’s not a crash, the health check API stops responding and the memory usage of the containers grows until k8s kills the pod and restarts it) causing a flurry of 503 errors several times a day.

We believe we’ve traced this back to the legacy JSON API, and how it had several large (mbs to 10s of mbs) responses that all got removed from the CDN at the same time during a release. Every single release of a project increases the size of the legacy JSON API and adds an additional URL that can return this data, compounding the problem.

To remedy this, we’ve made the /pypi/<project>/<version>/json URLs drop the releases key, and only contain the data specific to <version>. The urls key still exists which has all of the files for that release.

The /pypi/<project>/json URL still has the releases key that contains all of the files. While we would like to eventually remove the releases key from this URL too, we’ve identified popular software that currently relies on it. Ideally this software would be updated to use the PEP 691 JSON API instead, and should consider the releases key to be deprecated.

Due to the caching CDN, you will continue to get the releases key on the version specific URLs over the next few days and it will slowly be replaced with the releases key omitted.

4 Likes

This will break virtualenv background wheel update feature - virtualenv/periodic_update.py at main · pypa/virtualenv · GitHub :thinking:

That should be fine currently, as you’re using f"https://pypi.org/pypi/{distribution}/json", which still has the releases key, since that isn’t the versioned URL.

If/when we remove the releases key from the unversioned URL that would break, but you should be able to fix that by switching to using the versioned URL, and using the urls key instead.

Ah, I see. Should we migrate virtualenv logic to PEP 691 or it’s fine to keep using the unversioned link?

PEP 691 doesn’t have the upload time, so you won’t be able to use that without a PEP to add that.

For now it’s OK to keep using the unversioned link, but it appears that you know the version of the wheel you’re trying to get the upload time from, so you’d actually be best off using the versioned url, which has the urls key which has the same data the releases key had, except scoped to just that version.

That will have you using less bandwidth (since the versioned urls have smaller responses) and will protect you from a future when we want to ditch the releases key from the unversioned URL too.

That would be a longer term thing though, so it’s not a great rush to do so.

It appears that the expectation in the longer term is that people will be expected to make multiple requests to the JSON API, each returning smaller payloads. Is that a fair expectation? I’m not particularly familiar with the trade-offs in web APIs, and I’ve generally assumed that it’s better (as a client) to minimise round trips to the server - is that not actually the case?

In particular, I have a script that grabs the metadata for a project and uses the unversioned API because that gives me everything in one go. I simply assumed that was better than using multiple calls to the versioned API, one for each version. (There’s also the fact that I have to use the unversioned API to get the list of all valid versions, so it’s not as if I can avoid using the unversioned API anyway…)

It depends!

Technically speaking, if you need all the data, then one request that returns all of the data is the best from the client’s POV. Every round trip has overhead both in time and bandwidth, so by getting it all into one request you minimize that overhead.

The same is true on the server side if all you’re trying to minimize is the total time/bandwidth spent.

However, the flip side of that is that when you have requests that take a long time (some of these big responses took 2+ seconds) that you’re tying up resources on the server, and the web server can only handle so many concurrent connections before it starts to queue new connections. So that ends up causing latency to go up for everyone as people end up sitting queued even for requests that would otherwise be fast.

So ultimately best practices are that you want your web requests to return quickly, ideally sub 100ms to provide a better overall experience for everyone as the service will just be more responsive overall, even if it’s ultimately a little slower for someone trying to pull tons of data.

Computing the large JSON response on a project like ccxt which has 4632 releases was taking upwards of 2500ms, and since there were 4632 total URLs that could return that large response, each slightly different so they couldn’t be cached the same, that had the potential to tie up all of our web workers for an extend period of time if people were requesting a lot of different URLs that all took ~2s each.

So we needed to reduce the time that most of those URLs took.

We profiled those handlers and roughly 25% of the time was spent parsing UUIDs that we use as primary keys into python’s uuid.UUID object, another 25% of the time was buried deep inside SQLAlchemy in a function that individually had sub ms run time, but was being called thousands of times to handle all of the rows.

I spent a few days trying to find some way to speed up the code there, and I could disable turning the primary keys into uuid.UUID objects, that still left them much too slow (and would require widescale changes throughout the code base because it’s a global option, not a per query option).

That left us with two real options:

  • Reduce the scope of the response so that the response has to do less stuff, so that it’s faster.
  • Move the doing of stuff out of the request/response cycle to precompute some or all of the response.

For APIs like the Simple Repository API, we’re doing the second option (well we aren’t yet, but that’s the plan) as that API is very well suited to being generated once, statically, and serving a precomputed response. Particularly since PEP 458 will requires that the response be stable, so precomputing it makes it less likely we accidentally invalidate signatures.

That was harder to do for the legacy JSON API, because we would need to precompute a slightly different response for every single release for every single project, so a project like ccxt that would be precomputing 4600+ different responses (and we need to decide what we would do in the interim while we’re precomputing them).

So we instead did the first option, reduced the scope which brought those responses down from ~900ms to ~70ms when running locally on my desktop.

This also has the added benefit that we don’t have to load as many huge responses into memory, so the overall memory of the web workers is more stable than before.

Ultimately, this is what our backends looked like before and after:

(can you guess when the deploy happened on the graph? :smiley: )

So basically, for the web the most important thing is that your responses return quickly OR you have enough web servers to serve your highest concurrent requests (which we don’t, because our request patterns are “bursty”, most things are served by Fastly out of the CDN, but a release causes a burst of traffic related to that project… but maintaining capacity for all of those bursts would be a waste of resources).

3 Likes

Just wanted to say: thanks for making this public announcement! Thanks to it, we were able to quickly diagnose our dependency on the releases key in pip-audit and get a new patch version out the door.

4 Likes

Out of curiosity, would you mind sharing this list?