Enabling large-scale exploration of published PyPi code

Over the last few months I’ve been working on a tool that I believe would be useful to readers of core-development related topics. I’d like to get some feedback on it before I continue development.

The aim of the tool is to enable anyone to scan/parse/read or otherwise have access to the entire set of all published Python code on PyPi from their local machines. By leveraging the deduplication and compression features of Git we able to shrink all 14tb of PyPi releases (and the 3 tb of python code within) into just 120gb. We can then distribute it with Github and leverage various git libraries to access the compressed code without extracting it. This looks like so (sped up 5x):

Screen Recording 2023-03-07 at 18.13.20

This ran the RustPython parser over 5.6 million Python files in PyPi to look for risky usages of asyncio.create_task(). It found 12550 OK usages and 4604 risky ones (where the task reference does not seem to be persisted), such as this usage in the aionostr package or this one in the aioradio package.

The use cases I think this project could help with are:

  • Quantitative analysis of how the standard library is used, including investigating improvements/warts like the one above
  • Investigating how the usage of syntax features evolve over time
  • Creating tooling like linters (ruff) that involves parsing code
  • Alternative implementations of Python looking for repositories of Python code to test with
  • Detailed analysis on how third party packages are used in the ecosystem

And this could all be done interactively on a local laptop or desktop in a couple of hours, whereas before the storage/bandwidth requirements severely limited accessibility.

Is this project useful to you, an audience of people heavily involved with the development of Python or related projects? If so then what kinds of things would you use it for, and if not, why?

6 Likes

When thinking about deprecating parts of the language, we sometimes search the most popular X projects on PyPI, to get an idea of how much something is used. For example:

Having the full set of PyPI would of course allow even wider searches.

Some downsides:

  • the full set will include a lot of projects which are seldom used or updated, could skew the results
  • more stuff means longer to download, longer to search
  • 120 GB is way above GitHub’s repo size limits of 2-5 GB
2 Likes

You’re right, but there’s a long tail of Python packages from smaller communities that make up the bulk of code that is published to PyPi. Looking only at the most popular packages is a definite bias towards certain communities, types of libraries and maturity levels of projects and won’t be generally representative. I agree about recency though, you generally care about the most recent packages rather than something that was uploaded in 2008.

This pushes the “time taken” from downloading to searching. If you’re just running a regular expression over the input then this now becomes very quick - searching all the Python code on PyPi for LegacyInterpolation using ripgrep takes about 5 minutes on my laptop (402 matches).

In my example above I was parsing all the code to an AST, then walking through all the statements and expressions to find relevant asyncio.create_task() calls. This is much more expensive and slower, but that’s a different problem. With a repository, and each package a commit, it becomes fairly simple to exclude or include packages by author, version, type or date.

Indeed. These repositories are also a special case - the compact size is much smaller than the logical size due to the large amount of duplicated code between releases and even between projects. Initially I took the size limit to mean just the size of the git object DB, in which case even a 5GB limit would be enough to store ~500k release files in. After speaking to Github there are some processes they run that scales with the logical size of the repo, so that was a no-go.

As such we just need to shard them across many smaller repositories. I’ve done this already (with an index), but it clearly needs some tooling to manage the downloads for you + automate reading from git.

Which is why I was asking for feedback because I’m not sure what form this tooling should take or how useful people would find it if I was to build it.

1 Like