Dataset for efficiently querying files and metadata within Python distributions on PyPI

sethmlarson · November 15, 2023, 4:35pm

Wanted to share this group of datasets curated by Tom Forbes and my own accompanying article on how to get started querying quickly.

To my knowledge, there isn’t an easier way to access this type of information. I figure that this dataset would be useful for folks looking to answer questions on an ecosystem-scale such as use or adoption of packaging features, metadata fields and values, build backends, etc. This lets PEP authors and packaging maintainers make more data-driven decisions about the current state of Python packages both historically and in the current day-to-day.

I hope everyone that is interested finds the information and guide useful, happy to answer questions.

pf_moore · November 15, 2023, 6:40pm

Thanks! Tom (@orf) posted some details about this dataset here a little while ago - You can now download PyPI locally

There’s some extra information that might be useful to people in that thread as well.

ofek · November 15, 2023, 11:09pm

How does one query info about that? It looks like there is just core metadata exposed (build backend name is in the WHEEL file).

pf_moore · November 15, 2023, 11:38pm

You can find the content of individual files in a distribution, so look for all sdists with a pyproject.toml and parse that for the backend information. Any sdist without a pyproject.toml is setuptools based.

ofek · November 16, 2023, 12:03am

Does the dataset store the deserialized TOML data in a table or is there postprocessing one must do?

sethmlarson · November 16, 2023, 4:30am

@ofek You would need to grab every pyproject.toml and WHEEL and do some post-processing, but you can do that by downloading those individual files instead of whole archives.

kknechtel · November 16, 2023, 5:28pm

I actually was going to ask about this separately (also related to the debate around Python version upper-caps), but.

Would it be possible for dependency resolvers (especially installers that have to resolve dependencies) to query a dataset like this, such that they don’t have to repeatedly download entire wheels just to check metadata?

Would it be possible to keep that data up to date automatically when wheels are published?

Does the (official) metadata really have to live exclusively inside wheels?

jeanas · November 16, 2023, 5:30pm

Isn’t that more or less what PEP 658 has already done? PEP 658 & 714 are now live on PyPI

kknechtel · November 16, 2023, 5:32pm

It’s entirely possible that I haven’t had a serious test of this since May, so thanks for the heads-up.

pf_moore · November 16, 2023, 5:57pm

PEP 643 would allow similar optimisations for some sdists (ones that don’t compute their dependency data dynamically) - note that PEP 658 isn’t limited to wheel metadata (although the PyPI implementation might be, currently).

Edit: Backfilling wouldn’t be possible, though, as sdists have to choose to publish metadata 2.2, and older sdists won’t have done so.

ofek · November 16, 2023, 8:21pm

Does PyPI support that yet?

pf_moore · November 16, 2023, 8:38pm

Not yet. Support Metadata Version 2.2 · Issue #9660 · pypi/warehouse · GitHub

I think it’s basically waiting on someone to have the time to take the work to completion. Once PyPI supports metadata 2.2, then build backends like setuptools, hatch, etc., can start generating it.