Dataset for efficiently querying files and metadata within Python distributions on PyPI

Wanted to share this group of datasets curated by Tom Forbes and my own accompanying article on how to get started querying quickly.

To my knowledge, there isn’t an easier way to access this type of information. I figure that this dataset would be useful for folks looking to answer questions on an ecosystem-scale such as use or adoption of packaging features, metadata fields and values, build backends, etc. This lets PEP authors and packaging maintainers make more data-driven decisions about the current state of Python packages both historically and in the current day-to-day.

I hope everyone that is interested finds the information and guide useful, happy to answer questions.

10 Likes

Thanks! Tom (@orf) posted some details about this dataset here a little while ago - You can now download PyPI locally

There’s some extra information that might be useful to people in that thread as well.

2 Likes

How does one query info about that? It looks like there is just core metadata exposed (build backend name is in the WHEEL file).

You can find the content of individual files in a distribution, so look for all sdists with a pyproject.toml and parse that for the backend information. Any sdist without a pyproject.toml is setuptools based.

Does the dataset store the deserialized TOML data in a table or is there postprocessing one must do?

@ofek You would need to grab every pyproject.toml and WHEEL and do some post-processing, but you can do that by downloading those individual files instead of whole archives.

I actually was going to ask about this separately (also related to the debate around Python version upper-caps), but.

Would it be possible for dependency resolvers (especially installers that have to resolve dependencies) to query a dataset like this, such that they don’t have to repeatedly download entire wheels just to check metadata?

Would it be possible to keep that data up to date automatically when wheels are published?

Does the (official) metadata really have to live exclusively inside wheels?

1 Like

Isn’t that more or less what PEP 658 has already done? PEP 658 & 714 are now live on PyPI

3 Likes

It’s entirely possible that I haven’t had a serious test of this since May, so thanks for the heads-up.

PEP 643 would allow similar optimisations for some sdists (ones that don’t compute their dependency data dynamically) - note that PEP 658 isn’t limited to wheel metadata (although the PyPI implementation might be, currently).

Edit: Backfilling wouldn’t be possible, though, as sdists have to choose to publish metadata 2.2, and older sdists won’t have done so.

1 Like

Does PyPI support that yet?

Not yet. Support Metadata Version 2.2 · Issue #9660 · pypi/warehouse · GitHub

I think it’s basically waiting on someone to have the time to take the work to completion. Once PyPI supports metadata 2.2, then build backends like setuptools, hatch, etc., can start generating it.

1 Like