You can now download PyPI locally

Hello :wave:,
I’ve been working on a side project over the last few months with the aim to make the corpus of Python code uploaded to PyPI more accessible to people. The project is essentially done, and I wanted to share it here because one of the use cases I had in mind was helping with the development of Python itself.

With this project anyone is able to download nearly the entire corpus of code uploaded to PyPI to their machine, parse and examine the AST of every Python file within it, completely locally in only ~6 hours.

Details: I’ve put all the information I can on a project website: https://py-code.org/. It has live statistics on the contents of PyPI, a searchable index of projects and instructions on how to use it.

The project has two main components: The code from PyPI is mirrored to Github where anyone can download it. Git is particulalrly great at compressing this kind of code, so the total download size is less than 370GB.

There are also a series of Parquet datasets indexing file metadata (size, lines, hash, etc), which lets you run some analytical queries without needing all the data.

Is this useful? I’m not really sure - I built it to see if I could, but I’d really like some feedback on if it is actually useful in any way outside of a curiosity. If it’s not, any feedback on how it could be would also be fantastic.

I’m not a core developer but some ideas I had in my head for ways this could be used are:

  1. Seeing how new language features are being adopted, and by what segments of the community
  2. Be able to quantity the impact of changes to the language (adding new keywords for example)
  3. Seeing how standard library usage evolves over time and spot improvements
  4. Test any parser changes on all the code

For example I looked at the use of various language features when parsing. 50% of project releases uploaded to PyPI now have type annotations and nearly 60% have f-strings, making them both the most popular and quickly addopted Python language features.

We can also see that pyproject.toml usage has taken off and looks like it will soon overtake setup.py usage:

Some more random fun facts

The longest Python file ever uploaded to PyPI is within this project: EvenOrOdd · PyPI

It’s 20,010,001 lines long, and is… just this:

The most complex Python file ever uploaded can be found here: The most complex Python file on PyPI · GitHub

It contains an expression that has 54,188 components and predictably stack overflows anything that parses it, Python included.

You can find more stats and information on the project website

31 Likes

This looks really interesting. I’ve done very limited versions of the same sort of thing (mostly just metadata rather than code) and one thing that makes it frustrating is the issue of keeping things up to date - the rate of change of data on PyPI is pretty big, so you can download a snapshot and it’s out of date within a few days, and even incremental downloads are a big job. This is particularly frustrating if you want to focus on newer data (on the assumption that older stuff is out of date and hence not as informative).

Do you keep the data up to date? If so, how much time do incremental updates consume? I ask because I’m not entirely sure how much use I’d get from this data, and I don’t want to spend 6 hours (and a lot of someone’s bandwidth) downloading it only to find out that if I have a question every few weeks, I need to re-download every time to get useful results (adoption of new packaging features is the sort of thing I’d be interested in).

2 Likes

Thanks for the reply and the kind words!

Yes, the data is kept up to date: new packages are added twice a day. The code is automatically sharded across over 200 individual repositories (list here PyPI Data), and ordered by date.

So if you only want code newer than a specific date, you only need to fetch a subset of the repositories. Similarly, incremental updates are simple and small thanks to git: you’re just fetching newer commits. Once a repository hits a specific number of packages (40k) it is “closed” and no longer updated, and a new one is created. These repositories are themselves discoverable via this auto-updated file

What’s fantastic about using git for this is de-duplication: most of the code in pypi is not unique. Code within releases should be identical for example. So fetching updates does not require not a huge amount of bandwidth. Each repo is about a ~1.4gb download and contains roughly ~5 days or so of releases.

This plays really nicely with large-scale parsing: we only need to parse the unique files within PyPI by their git OID (hash). From this we can then extrapolate which projects, versions and releases contain those files. This means a full parse “only” requires handling 30 million files, but from that we can then build out the data as if we had parsed the full set. So we can count “releases that contain files that contain f-strings, over time” without parsing/counting every individual file.

Another thing that I didn’t mention, but figure I might as well, is that this now means all PyPI uploads are automatically scanned for accidentally published credentials thanks to Github secrets scanning and services like GitGuardian. My analysis isn’t finished yet, but there is a lot of credentials lurking in PyPI.

8 Likes

This looks great!

This number seems to be diminishing rapidly

1 Like

Indeed, but we can adjust the number of packages per partition. The GitHub soft limit is about ~2.5gb per repo, and the size scales non-linearly with the number of releases due to compression and deduplication.

But yeah, as uploads increase so too will the frequency of the repos. What I’m banking on is that the volume of non-unique code uploaded is not completely correlated with the number of releases themselves

I do wonder how fragile this is, given that you are relying on Github acting as free cloud storage…

I’ve been in contact with them, and they suggested a split repository approach. GitHub hosts a lot of code, and as long as an individual repository doesn’t cause significant impact to their service or is in breach of their TOS I don’t believe it’s an issue.

That being said: if the actual concept proves valuable and GitHub objects to this, you can host ~360gb of data on S3 for less than 10$ per month.

5 Likes

I ran this overnight last night (well, technically it’s still running, in the “writing objects” phase). I got a lot of messages of the form

warning: unable to access ‘packages/Pootle/Pootle-2.7.0.tar.bz2/Pootle-2.7.0/pootle/static/js/node_modules/webpack/node_modules/watchpack/node_modules/chokidar/node_modules/anymatch/node_modules/micromatch/node_modules/regex-cache/node_modules/is-equal-shallow/.gitattributes’: Filename too long

This is on Windows. Having done some research, it looks like you not only need the “Enable long paths” setting switched on (which is documented in the Python setup docs and I believe can be done in the Python installer) but you also need to set core.longpaths in git’s configuration.

I haven’t tested core.longpaths yet, as the initial download is still running (and I don’t know if I’ll need to do it again!) but the linked StackOverflow question mentions “some limitations”, so it may have problems…

1 Like

Hey @pf_moore, thanks for giving it a go. Documenting these kinds of projects has never been my forte, and while I touch on what I’m going to write below on the website it’s really implicit and not well described at all. So I’m going to write something to get you and anyone else started by explaining things a bit clearer, then transpose that onto the website later.

I’m also aiming to provide some tooling to make this a bit easier and more fluid, but right now you need to do something like what is described below:

tl;dr: don’t checkout the code, just clone the repo.

Let’s say you want to parse all the code from 2023-07-23 to 2023-07-28. You clone the repository like so:

git clone https://github.com/pypi-data/pypi-mirror-221.git

What this is doing is just cloning the main branch, which doesn’t check out the code to your filesystem. But all the data within the code branch is retained within the packfile:

❯ ls -la pypi-mirror-221/.git/objects/pack/ 
.r--r--r--@  41M tom  1 Sep 09:58 pack-6b7a152d6928490ed4c3538726520875ec6f76d8.idx
.r--r--r--@ 1.7G tom  1 Sep 09:58 pack-6b7a152d6928490ed4c3538726520875ec6f76d8.pack
.r--r--r--@ 5.9M tom  1 Sep 09:58 pack-6b7a152d6928490ed4c3538726520875ec6f76d8.rev

The intuition here is that this packfile can be treated as essentially just a tarball. It’s too big and complex to unpack (as you’re finding!) but we can still access the contents without unpacking it. Below we’re listing all files (blobs) that end with .py and taking the first one:

❯ git rev-list --all --objects --filter=object:type=blob --all -- 'packages/' \
    | grep ".py\$" \
    | head -n1
46f8cb309e77d660d222b4554ea6b277b2b4b84c packages/AHItoDICOMInterface/AHItoDICOMInterface-0.1.3.1.tar.gz/AHItoDICOMInterface-0.1.3.1/AHItoDICOMInterface/AHIClientFactory.py

With the object ID (46f8cb309e77d660d222b4554ea6b277b2b4b84c) we can read it:

❯ git cat-file blob 46f8cb309e77d660d222b4554ea6b277b2b4b84c
"""
AHItoDICOM Module : This class contains the logic to create the AHI boto3 client.
...

This isn’t as easy to work with as plain old files on a filesystem, but we can string things together quite nicely. Let’s parse all unique Python files within this repository.

❯ git rev-list --all --objects --filter=object:type=blob --all -- 'packages/' \
    | grep ".py\$" \
    | wc -l
  595286

~600k files. There is a command called git cat-file which we can feed in a stream of OIDs and have it output the contents to stdout. The following Python program reads this data in the format git cat-file produces and passes the contents to ast.parse(), printing OIDs that succeeded/failed:

import ast, sys

fd = sys.stdin.buffer

while line := fd.readline().decode():
    # Input line example:
    # 46f8cb309e77d660d222b4554ea6b277b2b4b84c blob 819
    # Split this, then read 819 bytes from stdin.
    oid, oid_type, oid_len = line.split(' ')
    data = fd.read(int(oid_len))

    try:
        ast.parse(data)
    except Exception:
        print(oid, "failed")
    else:
        print(oid, "success")
    # Output is always followed by a trailing newline. Consume it.
    fd.read(1)

And we can chain it together like so:

❯ git rev-list --all --objects --filter=object:type=blob --all -- 'packages/' \
       | grep ".py\$" \ 
       | cut -d' ' -f1 \
       | git cat-file --batch \
       | python parser.py
48387012e4cd6229b1b02d9afb4a0e85c8189102 success
4827e4901e91395a2601c94dac01ff7252d1b720 success
...

This is pretty slow and single-threaded, but you can parallelize this in a number of different ways. You can even use pygit2 and do it all within Python directly. For anyone brave enough, here is my undocumented, WIP rust tool for parsing everything using libcst. I aim to expand this to automate most of the steps here, letting you write your parsing code in Python and have it handle the rest.

The various datasets I publish can be used to find specific needles, or whittle down the number of files you need to read/parse. For example, you could look for specific large Python files:

In [1]: import pandas as pd

In [2]: from pygit2 import Oid

In [3]: df = pd.read_parquet("https://github.com/pypi-data/data/releases/download/2023-08-31-03-12/index-12.parquet")

In [9]: needles = df[(df.repository == 221) & (df.path.str.endswith('.py')) & (df.lines > 10000)]

In [14]: print(Oid(needles.hash.values[0]))
911cf51d642721cb930f449808e13cca23d29324

Then just read it via pygit2 or some other method:

❯ git cat-file blob 911cf51d642721cb930f449808e13cca23d29324 | wc -l
   31606

❯ git cat-file blob 911cf51d642721cb930f449808e13cca23d29324 | head -n 5
import FWCore.ParameterSet.Config as cms

process = cms.Process("HLT")

process.source = cms.Source("PoolSource",
1 Like

Right, I think I had a misunderstanding about what’s going on here.

What I did was basically follow the download.sh file. I couldn’t use it directly as I’m on Windows, not Unix, but I wrote a Python equivalent:

from urllib.request import urlopen
import subprocess
import sys
import os

REPOS = "https://raw.githubusercontent.com/pypi-data/data/main/links/repositories.txt"

def repo_list():
    with urlopen(REPOS) as f:
        for repo in f:
            yield repo.decode("utf-8").strip()

if __name__ == "__main__":
    base_repo = sys.argv[1]
    subprocess.run(["git", "init", base_repo])
    for url in repo_list():
        remote_name = os.path.basename(url)
        subprocess.run([
            "git", "-C", base_repo,
            "remote", "add", remote_name, url,
        ])
    #subprocess.run([
    #    "git", "-C", base_repo,
    #    "fetch", "--multiple", "--jobs=4",
    #    "--depth=1", "--progress", "--all"
    #])

The last git fetch is commented out as I ran it manually.

That gave me (after about 12 hours) a directory all_of_pypi with no visible files in it. There’s a .git directory, of course, but that’s set as a hidden file on WIndows. I got the aforementioned long filename errors during the git fetch, but I’ve ignored those for now and set core.longfiles in .git\config.

Now things get weird. git log says “fatal: your current branch ‘main’ does not have any commits yet” even though the progress report said “Writing out commit graph in 4 passes: 100% (23787348/23787348), done.”.

Running git rev-list --objects --all takes 30 minutes, and if I add --filter=object:type=blob -- $n where $n is one of the “long filenames” from the error message (to check if it’s now readable) hadn’t completed after 45 minutes. Unfortunately, that’s basically unusably slow for exploratory work.

As you can probably tell, I don’t know anything about commands like git rev-list. My normal usage is little more than git checkout, git add and git commit

When I tried git rev-list --all --objects --filter=object:type=blob --all -- 'packages/' from your message, I got a list of OIDs, but the first ones had no filename, so I didn’t know if the grep was going to work. Picking an OID at random got me “fatal: git cat-file 152ba17a43062170fac44bd9f4ce414f0d22cb62: bad file”. I don’t know if OIDs are portable, or machine-specific, by the way…

I tried grepping for .py. But it looks like it might take ages to run (the 30 minutes I quoted above was while the download was still running, probably only 25% of the way through, so I could be looking at 2 hours for a full run). So I probably won’t do that…

Hmm, the fact that you publish datasets containing OIDs suggests that they are usabe across machines. But if I run git cat-file blob 911cf51d642721cb930f449808e13cca23d29324 (one of the examples you give) I get

fatal: git cat-file 911cf51d642721cb930f449808e13cca23d29324: bad file

I suspect this means my download is somehow corrupt? Whether that’s the long file issue or something else, I’ve no idea how to tell. But I’m not really willing to do another download “just in case” - I want to know what’s wrong with this one first.

Hmm, one thought I had is that I could do git fetch --multiple --jobs=4 --depth=1 --progress --all again and see what happens. (It ran quickly to mirror-153 and then started taking its time for some reason, so I’ll post this and report further when it’s complete).

This is actually expected: What we are doing here is creating an empty repository, then adding a large number of git remotes to it. Each of those remotes has a main branch and a code branch, but your main branch is and should be empty.

We do this to keep things all in one place, and let git handle things for us. Conceptually it’s similar to just running git clone for each repository and storing them in a different directory - perhaps that might be a better/simpler path to suggest in the future :thinking:.

Interesting - printing the data to the console is quite expensive, but there are a lot of files. Are you printing the data to the console, or are you running something like git rev-list --objects --all --filter=object:type=blob > NUL? I’d hope this is faster for sure.

Yeah, the UX of these git commands is not great and they are not common ones. It’s clear I need to make this more fluid and automated.

The idea I currently have in my head is a single tool that lets you specify some kind of pre-defined filters (e.g file extension), and it outputs the contents and information of matching files to stdout in a JSON format. It would handle all the “git stuff” under the hood, and let you just write some Python code to “do stuff with the contents”, e.g just:

./py-data download-repos
./pypi-data --extension="*.py" | python parse_everything.py

Would this be a better exploratory interface? Or do you have any other suggestions/ideas?

They are indeed - the OIDs are just the SHA1 hash of the file contents, which is core to how git de-duplicates files.

The error you’re getting is likely because somehow the download is indeed corrupt. I don’t have a Windows machine to test on, and I’m really not sure of the performance characteristics of git on Windows. I expect it’s more optimized for Linux/MacOS :frowning_face:.

I would suggest running git fetch to see if this fixes things, but it might re-download a lot. However, potentially the issue is the overall idea of combining all the upstream repositories into a single git directory - this perhaps doesn’t work very well on Windows.

As a proof of concept, you could try cloning just the https://github.com/pypi-data/pypi-mirror-221.git repository into a single directory and working with that? If it works fine with a single clone on Windows, I expect that this is the way we should suggest cloning them.

Thanks so much for trying this by the way, this is exactly the kind of feedback I was looking to gather. I had an inkling that it wouldn’t work perfectly for everyone first time :joy:

I’ve heard that Windows has worse performance for git both as Unixes are better optimised for many small files, and as Git is optimised for the metadata (stat etc) interfaces of Unix. I don’t have a source, though.

A

1 Like

@orf would scalar [1][2] be useful in this scenario? It claims to be targeted at large repositories etc.


  1. https://git-scm.com/docs/scalar ↩︎

  2. ↩︎

I didn’t get any output at all in the time I let it run, so console IO isn’t the issue. But also, the 30 minute run was piped into wc -l, so that definitely wasn’t IO related.

Git on Windows is known to be slow for some of the script-based commands, because it creates a lot of subprocesses and subprocesses are expensive in Windows.

Things I’d want to do would be, for example:

  • Find all pyproject.toml files and get the content, plus the project, version and file (presumably sdist) they are contained in.
  • Same for METADATA and PKG-INFO files.
  • Look at an individual project (yes, I can do this by downloading a sdist, but if I already have it in the repo, why do it again?)
  • Having found something interesting (like the fact that pootle seems to include a weirdly nested set of node_modules folders, which is presumably a node.js thing) look for other projects that do similar.

So filtering by project is useful. If I assume that the internal structure is packages/<projectname>/<filename> then I can construct lists of prefixes, so “get me stuff that matches one of the prefixes in this file” would be a good start. The killer, though is “do this fast, as I may want to do it 20 times while I work out what I’m looking for”…

:frowning: That sucks. I’m re-fetching now, but I don’t know how good git is at error-correcting, so whether this will fix things I don’t know.

That took a not-unreasonable 20 minutes, and git lg looks good. git rev-list --objects --all | wc -l took 36 seconds and returned a value of 1473759. git cat-file worked fine, too.

What does the “lots of remotes” approach gain? Is compression better because everything is in the one repo? Obviously having to work out which repo contains which file(s) is a PITA, so usability is, let’s say “worse” (although given my git knowledge, the bar is pretty low to start with :slight_smile:) But it addresses the main goal which is to have everything locally.

One other thought - corruption issues aside, is there a git command to clone just one remote from the “big” repo into its own separate repo? That might be useful to save time and bandwidth re-downloading everything.

Now you tell me :slight_smile:

I expected to have problems, if only because you clearly built it on Unix, and will therefore have made a bunch of non-portable assumptions (because everyone does, whatever platform they build on). I’m a bit disappointed if “git is unusably slow on a repo of this size” is the main one. Although there are libraries for working with git repos (which I’m guessing don’t just shell out to git) so that might actually be perfectly solveable. As I said above, the key point here is that I now have a local copy of everything - accessing it may be a pain for now, but it’s still much better than downloading everything over and over again (which is something I’ve done in the past…).

One other thing I noticed, in the mirror-221 data, there’s no .whl.metadata files (which are specified in PEP 658). Is this just because I’m unlucky over what’s in mirror 221, or is it a result of how you get the data? The files aren’t explicitly reported in the simple index, they are implied by the “core-metadata” attribute, so (like signatures) they are easy to not pick up by default. Although as they are just the extracted METADATA file from the corresponding wheel, there’s no particular reason to collect them separately, so omitting them is not an unreasonable choice.

1 Like

The main thing it gives you is not having to work out which repo contains which individual files - you can just ask git to give you an object and it just works :tm:. The compression isn’t actually better - the objects are only compressed within their own repositories. I thought it would just be easier to manage as a single repo.

I think there is - you can git clone file://foo/bar, and then specify the specific remote you want. But my git knowledge is lacking there unforunately :sweat:.

There is this sub project which might be useful to you for this case? GitHub - pypi-data/pypi-json-data: Automatically updated pypi API data, available in bulk via git or sqlite. It’s a static dump of all the PyPI JSON API metadata available via git or a sqlite file. I’m not sure if this contains all the same details that .whl.metadata contains, but it’s just a literal dump of whatever the PyPI JSON api returns.

Right now the all git repositories contain purely the raw contents of the packages, so the METADATA files are present in the repositories but nothing from the API is. Looking at the index of all the contents the project has parsed:

select count(*)
from 'pypi-dataset/*.parquet'
where archive_path LIKE '%/METADATA'
AND skip_reason='';

4577459

Some files within releases are not suitable for inclusion: binary files, gigantic CSVs/text files, etc. These would cause the repositories to be a lot larger as these unique files can’t be effectively compressed. But basically all the METADATA files are present in the repositories.

Thanks for this! This gives me a good feel for what would be useful. I’m going to work on handling these issues better with some improved tooling. I’ll add a comment here when i’ve got something worth testing, but I think I have a good idea on how to handle all these use cases for you and give a nice exploratory, fast interface.

Overall I think there are few different uses we can cover:

  • I want to “grep” through package contents to find interesting things, filtering by various attributes and explore
  • I’ve found interesting things, I want to read all the interesting things and do something with them (parsing, etc)
  • I want to extract a subset of the code (by package, date, version, file) to disk, or otherwise move it out of git

You can do all of this via git, but as you’ve found the interface is not great and it’s confusing. So I’ll wrap this in a tool to manage the various details of finding objects and interacting with git.

1 Like

It’s not, but that’s not a problem, as I have my own version of that project/data.

I think what you have here is right, I just happened to notice the .whl.metadata files weren’t included when looking at something else, so I thought I’d ask.

One thought I had was that it wouldn’t be that hard (he said, optimistically) to put the file listing in something fast-access like a database, along with the OIDs, and then layer a pyfilesystem virtual filesystem on top of that, using one of the git libraries to fetch blobs by OID. That would allow the PyPI data to be accessible as a straight filesystem by Python code at least.

Once I have the data downloaded properly, I’ll look at doing something like that, I think.

1 Like

One thing I spotted on the re-fetch was a few 408 (timeout) responses. So maybe that’s why I had some of the original corruptions. Re-fetching a few times might fix this, I’ll see how it goes.

This would be quite interesting to do, for sure! I’d love to see if that’s workable. @AA-Turner suggested scalar, which might do what we need. I’ve never worked with anything like that before though.

In the meantime I’ve whipped up this that I hope might solve some of your initial problems:

If you download the windows release from the releases page you can use it to extract files based on their path or contents. This assumes you’re cloning each repository to a separate directory:

./pypi-data.exe extract ~/pypi_repositories/ ~/output_dir/ "*/pyproject.toml" --contents="django"

Unique files are written out to a directory, with their OIDs in the path, suffixed by the file name:

$ head 799/c7/799c71e601402e5678dbe050ef35bfaae9574411.pyproject.toml
[tool.poetry]
name = "django-restit"
version = "4.1.29"
description = "A Rest Framework for DJANGO"
authors = ["Ian Starnes <ians@311labs.com>"]

The tricky part of all of this is duplicates and file paths. It’s not really feasible to extract the contents to the filesystem using the same directory/file name as the project, and extracting duplicates is redundant.

It has some initial work on simplifying parsing as well, by writing objects to JSON via stdout for you to process with Python, or grep/whatever.

I’ll think about how to improve this later: maybe a virtual filesystem is the way to go for exploratory stuff, but for parsing “lots of things” I feel there must be a way with some lower overhead.

But for now I hope this helps!

After the second fetch completed, it decided to auto-pack. That’s now been running for hours, and it’s got to 119M objects and still going. I’ve no idea how long it’ll take, but at this point I’ve had it with waiting. I think the only realistic option is to give up on the multi-remote repo, and clone each repo individually.

I can copy the data from the multi-remote repo to a local repo, using

git -C <old repo> push <new-repo> \
    remotes/pypi-mirror-N.git/main \
    remotes/pypi-mirror-N.git/code

but it leaves the new repo with a slightly different structure than a new clone, and I’m not sure refreshing would work properly. So I’m going to just go with starting from scratch.

One thought I did have - am I right in thinking that because the mirrors are time-based, I’ll only ever need to fetch new data for the latest mirror, as once a new mirror is created, all the older ones become static? If so, that makes the updating process for a multi-repo setup a lot easier.

OK, so the following works for Windows.

  1. I created a directory pypi-mirror, and a subdirectory repos under that.
  2. Now, git clone each repository from this list into repos/pypi-mirror-N. This takes a long time but can be done bit by bit. I used a Python script (shown below) to do them in batches. I just edited the slice of the full list each time to grab a new batch. The script could be made better, but it did the job for me.
  3. In Powershell, do dir .\repos\pypi-mirro* | % { git -C "$_" config --local core.longpaths true } to set the longpaths option for each repo. This only takes a couple of minutes.
  4. Now create a subdirectory objects and run dir .\repos\pypi-mirro* | Foreach-Object -Parallel { git -C "$_" rev-list --objects --all | Out-File -Encoding UTF8 (Join-Path objects $_.name)}. This creates an object list for each repo. This takes about an hour to run, but once done, you can find the OID for any file you want just by running grep on the contents of the objects directory.

Fetching new data is just dir .\repos\pypi-mirro* | Foreach-Object -Parallel { git -C "$_" fetch }. This takes about 30 seconds if there’s nothing to fetch.

I’m assuming that if you get any new data on a fetch, you should rebuild the appropriate objects file.

The repos and objects directories together take up about 375GB right now (just under 40G for the objects files, which you could probably omit if you were comfortable with git rev-list and didn’t mind slow searches).

And that seems to be it! Now I have this, all I need to do is work out how to extract some interesting data from it :slightly_smiling_face: Thanks so much for doing this, @orf - it should be a super valuable resource!

Edit: I forgot to add the script. I’ve made it a gist, at PyPI downloader for py-code.org · GitHub