A tool for aggregating Python package metadata

Hi everyone,

we would like to share a tool with Python packaging community. The tool is called “thoth-solver” [1] and we use it to aggregate package metadata. As PyPI does not provide additional metadata about packages, we developed this tool to get metadata from published packages. The tool aggregated metadata about Python packages hosted on PyPI for us (roughly 20% of all package releases on PyPI were analyzed - the most popular ones). Part of the dataset produced can be found at [2,3].

In short, the tool installs specified packages from any PEP-503 compliant package index and extracts metadata available for the given package using standard library functions. Most valuable for us were dependency information (hence the name), but the tool can extract additional info as well (see the project README file for more info [1]).

Feel free to use the tool, we would be happy for any input or suggestions from the community.


[1] GitHub - thoth-station/solver: Dependency solver for the Thoth project
[2] datasets/notebooks/thoth-solver-dataset at master · thoth-station/datasets · GitHub
[3] Thoth Solver Dataset v1.0 | Kaggle


Cool! This sounds really interesting to me, I’ve been working on a similar dataset for some time now.

Can I ask how you extracted dependency data? The dependency information from the JSON API is extremely unreliable. I’m working on extracting metadata from the published wheel files, but that’s a major task as there’s 7TB of wheels on PyPI! I also have code that will extract metadata from sdists (by downloading and building them) but obviously (a) that’s only possible if I can build the package locally, and (b) it means running potentially-untrusted remote code.

I’d be interested to know how you approached and solved this issue, in case you had any ideas I could use :slightly_smiling_face:

1 Like

The specified package is downloaded and installed into a separate virtual environment without dependencies. Then, the metadata is extracted by using importlib-metadata. The approach with sdist you shared might be similar, but:

a) We do not care if the build fails - thoth-solver captures the error log we subsequently classify. The fact that sdist build fails is an observation for us, so we know the given package was not installable into whatever container environment thoth-solver run in (eg. fedora:34 running python 3.9).

b) We run thoth-solver in an isolated+controlled cluster environment, so untrusted code should be fine for us.

Ah, OK. That seems reasonable, I guess you have a bigger environment than I do (I’m running on my personal laptop). How did you select the packages you analyzed? You say “the most popular ones” - how did you identify those? Did you simply look at “most downloaded” figures and work from those, or did you have a more specific metric?

The tool is part of a bigger system that decides which packages should be analyzed with thoth-solver based on their use within the company (based on users we have). We bootstrap the database with a set of selected packages [1], transitive dependencies are automatically identified and analyzed as well.

[1] init-job/hundredsDatasciencePackages.yaml at master · thoth-station/init-job · GitHub

1 Like

Very cool!

I’m immediately reminded of Expose the METADATA file of wheels in the simple API · Issue #8254 · pypa/warehouse · GitHub when I read this.


FYI Frido forgot to mention that it’s a project born at Red Hat. And there’s much more to it.

There’s a GitHub App, for example, functionally equivalent to what dependabot does but with machine learning-driven dependency resolution:
GitHub Apps - Qeb-Hwt · GitHub (GitHub - thoth-station/qeb-hwt: I'm Kebechet bot, goddess of freshness - I will keep your dependencies fresh and up-to-date).

My understanding is that the wide Python community should be able to take advantage of such a service that can produce lockfiles for pip or pipenv based on more factors than just version numbers and inter-package dependency restrictions.

1 Like