Offering a static PyPi data dump

orf · July 19, 2022, 10:33pm

In order to resolve project dependencies tools like Poetry and pip make repeated requests to a PyPi compatible API (either the simple endpoint or the newer JSON endpoint). This can sometimes result in long waits to resolve dependencies - at my work, when using a PyPi mirror without the JSON endpoint, it’s not uncommon to have to wait 30 minutes to an hour for a successful dependency resolution with Poetry for a non-trivial project.

Rust, in contrast, manages to lock complex dependency trees instantly. I believe part of the reason why is that the packaging index is distributed as a git repository, with all packages and their dependencies being expressed as JSON files. Check out the Rayon crate for example.

Would it be possible, or even a good idea, to publish periodic dumps of package data stored in PyPi in a way that doesn’t require multiple network requests to fetch? There are many ways to approach this, from automating commits to a git repository like crates.io does to creating a periodic sqlite database with package + dependency information at a known URL.

It seems that being able to easily access this data, offline and in bulk, could reduce requests to PyPi whilst improving the UX for end-users.

brettcannon · July 20, 2022, 9:30pm

It also has to do with the fact that Rust allows each dependency to have its own set of dependencies, and so the resolver doesn’t have to do as much work to calculate what to eventually install.

I believe distlib has some pre-processed data that it uses from a web site. There’s also GitHub - sethmlarson/pypi-data: Data about packages and maintainers on PyPI if you’re just after PyPI data as-is.

The problem with any of this is if the service goes down, you’re in trouble. There’s also the issue when the information goes stale. I have had both happen to me in regards to distlib.

njs · July 21, 2022, 12:12am

Fetching metadata generally isn’t the slow part of resolving package constraints. Even for hundreds of packages you’re talking seconds, not minutes.

The slow parts are:

Building binaries, if there are packages you want that don’t have prebuilt wheels
Figuring out which combination of versions you want for all those packages. Resolving packages in python is NP-hard, ie in the worst case it can take time exponential (!) in the number of available packages and versions. Practical resolvers use heuristics to try to avoid this, but all heuristics have failure modes. It sounds like you’re hitting one of the failure modes in poetry’s heuristics.

BTW, cargo says that those database dumps they use are becoming a bottleneck as the crates.io registry grows, and they’re working on switching to the pypi style of fetching metadata: Call for testing: Cargo sparse-registry | Rust Blog

orf · July 21, 2022, 2:39pm

Fetching metadata generally isn’t the slow part of resolving package constraints. Even for hundreds of packages you’re talking seconds, not minutes.

Yeah, this is true. I did some benchmarking today and most of the time is indeed spent building binaries and extracting dependency information from some releases.

However I still feel like there are analysis or exploratory use cases that are better served by a static, updatable data dump.

I believe distlib has some pre-processed data that it uses from a web site. There’s also GitHub - sethmlarson/pypi-data: Data about packages and maintainers on PyPI if you’re just after PyPI data as-is.

This approach seems to rely on manual Github releases, I feel like there’s a better way to surface this data. I’ve created a proof of concept that incrementally dumps the API data to a Git repo on a 15 minute cadence: GitHub - pypi-data/pypi-json-data: Automatically updated pypi API data, available in bulk via git or sqlite

Maybe it isn’t as useful as I think, especially not for dependency resolution, but there have been a few times that I’ve been interested to analyze a large number of packages (specifically requires_dist) and this would have made it a fair bit easier.

pf_moore · July 21, 2022, 2:46pm

Note that the JSON API data for package metadata (specifically dependency data) is not reliable, as it isn’t necessarily extracted from the wheel (and even wheel data may be inaccurate if it’s from a wheel that isn’t compatible with yourt platform).

But yes, for exploratory applications, where accuracy is not critical, this may be useful. I have something similar that I created myself. But I wouldn’t advocate it for anything that isn’t doing “summary” processing.

pradyunsg · July 22, 2022, 12:33pm

There’s also data dumps on BigQuery:

https://warehouse.pypa.io/api-reference/bigquery-datasets.html

Namely, I think the project metadata table is what you’re looking for. PyPI is at a big-enough scale that putting things in a single git clone / JSON blurb do not work.

orf · July 22, 2022, 12:56pm

I do agree that it’s way too heavy to clone for anything other than analytics or exploration, but the entire PyPi releases dataset combined with the complete changelog is only about 2.5gb gzipped, which makes BigQuery seem like a fairly big overhead to explore.

hugovk · July 22, 2022, 5:33pm

Note the API’s policies on caching (respect the ETag header) and rate limiting (set a User-Agent):

https://warehouse.pypa.io/api-reference/