Over the last few months I’ve been working on a tool that I believe would be useful to readers of core-development related topics. I’d like to get some feedback on it before I continue development.
The aim of the tool is to enable anyone to scan/parse/read or otherwise have access to the entire set of all published Python code on PyPi from their local machines. By leveraging the deduplication and compression features of Git we able to shrink all 14tb of PyPi releases (and the 3 tb of python code within) into just 120gb. We can then distribute it with Github and leverage various git libraries to access the compressed code without extracting it. This looks like so (sped up 5x):
You’re right, but there’s a long tail of Python packages from smaller communities that make up the bulk of code that is published to PyPi. Looking only at the most popular packages is a definite bias towards certain communities, types of libraries and maturity levels of projects and won’t be generally representative. I agree about recency though, you generally care about the most recent packages rather than something that was uploaded in 2008.
This pushes the “time taken” from downloading to searching. If you’re just running a regular expression over the input then this now becomes very quick - searching all the Python code on PyPi for LegacyInterpolation using ripgrep takes about 5 minutes on my laptop (402 matches).
In my example above I was parsing all the code to an AST, then walking through all the statements and expressions to find relevant asyncio.create_task() calls. This is much more expensive and slower, but that’s a different problem. With a repository, and each package a commit, it becomes fairly simple to exclude or include packages by author, version, type or date.
Indeed. These repositories are also a special case - the compact size is much smaller than the logical size due to the large amount of duplicated code between releases and even between projects. Initially I took the size limit to mean just the size of the git object DB, in which case even a 5GB limit would be enough to store ~500k release files in. After speaking to Github there are some processes they run that scales with the logical size of the repo, so that was a no-go.
As such we just need to shard them across many smaller repositories. I’ve done this already (with an index), but it clearly needs some tooling to manage the downloads for you + automate reading from git.
Which is why I was asking for feedback because I’m not sure what form this tooling should take or how useful people would find it if I was to build it.