You can now download PyPI locally

I’ve been in contact with them, and they suggested a split repository approach. GitHub hosts a lot of code, and as long as an individual repository doesn’t cause significant impact to their service or is in breach of their TOS I don’t believe it’s an issue.

That being said: if the actual concept proves valuable and GitHub objects to this, you can host ~360gb of data on S3 for less than 10$ per month.

5 Likes

I ran this overnight last night (well, technically it’s still running, in the “writing objects” phase). I got a lot of messages of the form

warning: unable to access ‘packages/Pootle/Pootle-2.7.0.tar.bz2/Pootle-2.7.0/pootle/static/js/node_modules/webpack/node_modules/watchpack/node_modules/chokidar/node_modules/anymatch/node_modules/micromatch/node_modules/regex-cache/node_modules/is-equal-shallow/.gitattributes’: Filename too long

This is on Windows. Having done some research, it looks like you not only need the “Enable long paths” setting switched on (which is documented in the Python setup docs and I believe can be done in the Python installer) but you also need to set core.longpaths in git’s configuration.

I haven’t tested core.longpaths yet, as the initial download is still running (and I don’t know if I’ll need to do it again!) but the linked StackOverflow question mentions “some limitations”, so it may have problems…

1 Like

Hey @pf_moore, thanks for giving it a go. Documenting these kinds of projects has never been my forte, and while I touch on what I’m going to write below on the website it’s really implicit and not well described at all. So I’m going to write something to get you and anyone else started by explaining things a bit clearer, then transpose that onto the website later.

I’m also aiming to provide some tooling to make this a bit easier and more fluid, but right now you need to do something like what is described below:

tl;dr: don’t checkout the code, just clone the repo.

Let’s say you want to parse all the code from 2023-07-23 to 2023-07-28. You clone the repository like so:

git clone https://github.com/pypi-data/pypi-mirror-221.git

What this is doing is just cloning the main branch, which doesn’t check out the code to your filesystem. But all the data within the code branch is retained within the packfile:

❯ ls -la pypi-mirror-221/.git/objects/pack/ 
.r--r--r--@  41M tom  1 Sep 09:58 pack-6b7a152d6928490ed4c3538726520875ec6f76d8.idx
.r--r--r--@ 1.7G tom  1 Sep 09:58 pack-6b7a152d6928490ed4c3538726520875ec6f76d8.pack
.r--r--r--@ 5.9M tom  1 Sep 09:58 pack-6b7a152d6928490ed4c3538726520875ec6f76d8.rev

The intuition here is that this packfile can be treated as essentially just a tarball. It’s too big and complex to unpack (as you’re finding!) but we can still access the contents without unpacking it. Below we’re listing all files (blobs) that end with .py and taking the first one:

❯ git rev-list --all --objects --filter=object:type=blob --all -- 'packages/' \
    | grep ".py\$" \
    | head -n1
46f8cb309e77d660d222b4554ea6b277b2b4b84c packages/AHItoDICOMInterface/AHItoDICOMInterface-0.1.3.1.tar.gz/AHItoDICOMInterface-0.1.3.1/AHItoDICOMInterface/AHIClientFactory.py

With the object ID (46f8cb309e77d660d222b4554ea6b277b2b4b84c) we can read it:

❯ git cat-file blob 46f8cb309e77d660d222b4554ea6b277b2b4b84c
"""
AHItoDICOM Module : This class contains the logic to create the AHI boto3 client.
...

This isn’t as easy to work with as plain old files on a filesystem, but we can string things together quite nicely. Let’s parse all unique Python files within this repository.

❯ git rev-list --all --objects --filter=object:type=blob --all -- 'packages/' \
    | grep ".py\$" \
    | wc -l
  595286

~600k files. There is a command called git cat-file which we can feed in a stream of OIDs and have it output the contents to stdout. The following Python program reads this data in the format git cat-file produces and passes the contents to ast.parse(), printing OIDs that succeeded/failed:

import ast, sys

fd = sys.stdin.buffer

while line := fd.readline().decode():
    # Input line example:
    # 46f8cb309e77d660d222b4554ea6b277b2b4b84c blob 819
    # Split this, then read 819 bytes from stdin.
    oid, oid_type, oid_len = line.split(' ')
    data = fd.read(int(oid_len))

    try:
        ast.parse(data)
    except Exception:
        print(oid, "failed")
    else:
        print(oid, "success")
    # Output is always followed by a trailing newline. Consume it.
    fd.read(1)

And we can chain it together like so:

❯ git rev-list --all --objects --filter=object:type=blob --all -- 'packages/' \
       | grep ".py\$" \ 
       | cut -d' ' -f1 \
       | git cat-file --batch \
       | python parser.py
48387012e4cd6229b1b02d9afb4a0e85c8189102 success
4827e4901e91395a2601c94dac01ff7252d1b720 success
...

This is pretty slow and single-threaded, but you can parallelize this in a number of different ways. You can even use pygit2 and do it all within Python directly. For anyone brave enough, here is my undocumented, WIP rust tool for parsing everything using libcst. I aim to expand this to automate most of the steps here, letting you write your parsing code in Python and have it handle the rest.

The various datasets I publish can be used to find specific needles, or whittle down the number of files you need to read/parse. For example, you could look for specific large Python files:

In [1]: import pandas as pd

In [2]: from pygit2 import Oid

In [3]: df = pd.read_parquet("https://github.com/pypi-data/data/releases/download/2023-08-31-03-12/index-12.parquet")

In [9]: needles = df[(df.repository == 221) & (df.path.str.endswith('.py')) & (df.lines > 10000)]

In [14]: print(Oid(needles.hash.values[0]))
911cf51d642721cb930f449808e13cca23d29324

Then just read it via pygit2 or some other method:

❯ git cat-file blob 911cf51d642721cb930f449808e13cca23d29324 | wc -l
   31606

❯ git cat-file blob 911cf51d642721cb930f449808e13cca23d29324 | head -n 5
import FWCore.ParameterSet.Config as cms

process = cms.Process("HLT")

process.source = cms.Source("PoolSource",
1 Like

Right, I think I had a misunderstanding about what’s going on here.

What I did was basically follow the download.sh file. I couldn’t use it directly as I’m on Windows, not Unix, but I wrote a Python equivalent:

from urllib.request import urlopen
import subprocess
import sys
import os

REPOS = "https://raw.githubusercontent.com/pypi-data/data/main/links/repositories.txt"

def repo_list():
    with urlopen(REPOS) as f:
        for repo in f:
            yield repo.decode("utf-8").strip()

if __name__ == "__main__":
    base_repo = sys.argv[1]
    subprocess.run(["git", "init", base_repo])
    for url in repo_list():
        remote_name = os.path.basename(url)
        subprocess.run([
            "git", "-C", base_repo,
            "remote", "add", remote_name, url,
        ])
    #subprocess.run([
    #    "git", "-C", base_repo,
    #    "fetch", "--multiple", "--jobs=4",
    #    "--depth=1", "--progress", "--all"
    #])

The last git fetch is commented out as I ran it manually.

That gave me (after about 12 hours) a directory all_of_pypi with no visible files in it. There’s a .git directory, of course, but that’s set as a hidden file on WIndows. I got the aforementioned long filename errors during the git fetch, but I’ve ignored those for now and set core.longfiles in .git\config.

Now things get weird. git log says “fatal: your current branch ‘main’ does not have any commits yet” even though the progress report said “Writing out commit graph in 4 passes: 100% (23787348/23787348), done.”.

Running git rev-list --objects --all takes 30 minutes, and if I add --filter=object:type=blob -- $n where $n is one of the “long filenames” from the error message (to check if it’s now readable) hadn’t completed after 45 minutes. Unfortunately, that’s basically unusably slow for exploratory work.

As you can probably tell, I don’t know anything about commands like git rev-list. My normal usage is little more than git checkout, git add and git commit

When I tried git rev-list --all --objects --filter=object:type=blob --all -- 'packages/' from your message, I got a list of OIDs, but the first ones had no filename, so I didn’t know if the grep was going to work. Picking an OID at random got me “fatal: git cat-file 152ba17a43062170fac44bd9f4ce414f0d22cb62: bad file”. I don’t know if OIDs are portable, or machine-specific, by the way…

I tried grepping for .py. But it looks like it might take ages to run (the 30 minutes I quoted above was while the download was still running, probably only 25% of the way through, so I could be looking at 2 hours for a full run). So I probably won’t do that…

Hmm, the fact that you publish datasets containing OIDs suggests that they are usabe across machines. But if I run git cat-file blob 911cf51d642721cb930f449808e13cca23d29324 (one of the examples you give) I get

fatal: git cat-file 911cf51d642721cb930f449808e13cca23d29324: bad file

I suspect this means my download is somehow corrupt? Whether that’s the long file issue or something else, I’ve no idea how to tell. But I’m not really willing to do another download “just in case” - I want to know what’s wrong with this one first.

Hmm, one thought I had is that I could do git fetch --multiple --jobs=4 --depth=1 --progress --all again and see what happens. (It ran quickly to mirror-153 and then started taking its time for some reason, so I’ll post this and report further when it’s complete).

This is actually expected: What we are doing here is creating an empty repository, then adding a large number of git remotes to it. Each of those remotes has a main branch and a code branch, but your main branch is and should be empty.

We do this to keep things all in one place, and let git handle things for us. Conceptually it’s similar to just running git clone for each repository and storing them in a different directory - perhaps that might be a better/simpler path to suggest in the future :thinking:.

Interesting - printing the data to the console is quite expensive, but there are a lot of files. Are you printing the data to the console, or are you running something like git rev-list --objects --all --filter=object:type=blob > NUL? I’d hope this is faster for sure.

Yeah, the UX of these git commands is not great and they are not common ones. It’s clear I need to make this more fluid and automated.

The idea I currently have in my head is a single tool that lets you specify some kind of pre-defined filters (e.g file extension), and it outputs the contents and information of matching files to stdout in a JSON format. It would handle all the “git stuff” under the hood, and let you just write some Python code to “do stuff with the contents”, e.g just:

./py-data download-repos
./pypi-data --extension="*.py" | python parse_everything.py

Would this be a better exploratory interface? Or do you have any other suggestions/ideas?

They are indeed - the OIDs are just the SHA1 hash of the file contents, which is core to how git de-duplicates files.

The error you’re getting is likely because somehow the download is indeed corrupt. I don’t have a Windows machine to test on, and I’m really not sure of the performance characteristics of git on Windows. I expect it’s more optimized for Linux/MacOS :frowning_face:.

I would suggest running git fetch to see if this fixes things, but it might re-download a lot. However, potentially the issue is the overall idea of combining all the upstream repositories into a single git directory - this perhaps doesn’t work very well on Windows.

As a proof of concept, you could try cloning just the https://github.com/pypi-data/pypi-mirror-221.git repository into a single directory and working with that? If it works fine with a single clone on Windows, I expect that this is the way we should suggest cloning them.

Thanks so much for trying this by the way, this is exactly the kind of feedback I was looking to gather. I had an inkling that it wouldn’t work perfectly for everyone first time :joy:

I’ve heard that Windows has worse performance for git both as Unixes are better optimised for many small files, and as Git is optimised for the metadata (stat etc) interfaces of Unix. I don’t have a source, though.

A

1 Like

@orf would scalar [1][2] be useful in this scenario? It claims to be targeted at large repositories etc.


  1. https://git-scm.com/docs/scalar ↩︎

  2. ↩︎

I didn’t get any output at all in the time I let it run, so console IO isn’t the issue. But also, the 30 minute run was piped into wc -l, so that definitely wasn’t IO related.

Git on Windows is known to be slow for some of the script-based commands, because it creates a lot of subprocesses and subprocesses are expensive in Windows.

Things I’d want to do would be, for example:

  • Find all pyproject.toml files and get the content, plus the project, version and file (presumably sdist) they are contained in.
  • Same for METADATA and PKG-INFO files.
  • Look at an individual project (yes, I can do this by downloading a sdist, but if I already have it in the repo, why do it again?)
  • Having found something interesting (like the fact that pootle seems to include a weirdly nested set of node_modules folders, which is presumably a node.js thing) look for other projects that do similar.

So filtering by project is useful. If I assume that the internal structure is packages/<projectname>/<filename> then I can construct lists of prefixes, so “get me stuff that matches one of the prefixes in this file” would be a good start. The killer, though is “do this fast, as I may want to do it 20 times while I work out what I’m looking for”…

:frowning: That sucks. I’m re-fetching now, but I don’t know how good git is at error-correcting, so whether this will fix things I don’t know.

That took a not-unreasonable 20 minutes, and git lg looks good. git rev-list --objects --all | wc -l took 36 seconds and returned a value of 1473759. git cat-file worked fine, too.

What does the “lots of remotes” approach gain? Is compression better because everything is in the one repo? Obviously having to work out which repo contains which file(s) is a PITA, so usability is, let’s say “worse” (although given my git knowledge, the bar is pretty low to start with :slight_smile:) But it addresses the main goal which is to have everything locally.

One other thought - corruption issues aside, is there a git command to clone just one remote from the “big” repo into its own separate repo? That might be useful to save time and bandwidth re-downloading everything.

Now you tell me :slight_smile:

I expected to have problems, if only because you clearly built it on Unix, and will therefore have made a bunch of non-portable assumptions (because everyone does, whatever platform they build on). I’m a bit disappointed if “git is unusably slow on a repo of this size” is the main one. Although there are libraries for working with git repos (which I’m guessing don’t just shell out to git) so that might actually be perfectly solveable. As I said above, the key point here is that I now have a local copy of everything - accessing it may be a pain for now, but it’s still much better than downloading everything over and over again (which is something I’ve done in the past…).

One other thing I noticed, in the mirror-221 data, there’s no .whl.metadata files (which are specified in PEP 658). Is this just because I’m unlucky over what’s in mirror 221, or is it a result of how you get the data? The files aren’t explicitly reported in the simple index, they are implied by the “core-metadata” attribute, so (like signatures) they are easy to not pick up by default. Although as they are just the extracted METADATA file from the corresponding wheel, there’s no particular reason to collect them separately, so omitting them is not an unreasonable choice.

1 Like

The main thing it gives you is not having to work out which repo contains which individual files - you can just ask git to give you an object and it just works :tm:. The compression isn’t actually better - the objects are only compressed within their own repositories. I thought it would just be easier to manage as a single repo.

I think there is - you can git clone file://foo/bar, and then specify the specific remote you want. But my git knowledge is lacking there unforunately :sweat:.

There is this sub project which might be useful to you for this case? GitHub - pypi-data/pypi-json-data: Automatically updated pypi API data, available in bulk via git or sqlite. It’s a static dump of all the PyPI JSON API metadata available via git or a sqlite file. I’m not sure if this contains all the same details that .whl.metadata contains, but it’s just a literal dump of whatever the PyPI JSON api returns.

Right now the all git repositories contain purely the raw contents of the packages, so the METADATA files are present in the repositories but nothing from the API is. Looking at the index of all the contents the project has parsed:

select count(*)
from 'pypi-dataset/*.parquet'
where archive_path LIKE '%/METADATA'
AND skip_reason='';

4577459

Some files within releases are not suitable for inclusion: binary files, gigantic CSVs/text files, etc. These would cause the repositories to be a lot larger as these unique files can’t be effectively compressed. But basically all the METADATA files are present in the repositories.

Thanks for this! This gives me a good feel for what would be useful. I’m going to work on handling these issues better with some improved tooling. I’ll add a comment here when i’ve got something worth testing, but I think I have a good idea on how to handle all these use cases for you and give a nice exploratory, fast interface.

Overall I think there are few different uses we can cover:

  • I want to “grep” through package contents to find interesting things, filtering by various attributes and explore
  • I’ve found interesting things, I want to read all the interesting things and do something with them (parsing, etc)
  • I want to extract a subset of the code (by package, date, version, file) to disk, or otherwise move it out of git

You can do all of this via git, but as you’ve found the interface is not great and it’s confusing. So I’ll wrap this in a tool to manage the various details of finding objects and interacting with git.

1 Like

It’s not, but that’s not a problem, as I have my own version of that project/data.

I think what you have here is right, I just happened to notice the .whl.metadata files weren’t included when looking at something else, so I thought I’d ask.

One thought I had was that it wouldn’t be that hard (he said, optimistically) to put the file listing in something fast-access like a database, along with the OIDs, and then layer a pyfilesystem virtual filesystem on top of that, using one of the git libraries to fetch blobs by OID. That would allow the PyPI data to be accessible as a straight filesystem by Python code at least.

Once I have the data downloaded properly, I’ll look at doing something like that, I think.

1 Like

One thing I spotted on the re-fetch was a few 408 (timeout) responses. So maybe that’s why I had some of the original corruptions. Re-fetching a few times might fix this, I’ll see how it goes.

This would be quite interesting to do, for sure! I’d love to see if that’s workable. @AA-Turner suggested scalar, which might do what we need. I’ve never worked with anything like that before though.

In the meantime I’ve whipped up this that I hope might solve some of your initial problems:

If you download the windows release from the releases page you can use it to extract files based on their path or contents. This assumes you’re cloning each repository to a separate directory:

./pypi-data.exe extract ~/pypi_repositories/ ~/output_dir/ "*/pyproject.toml" --contents="django"

Unique files are written out to a directory, with their OIDs in the path, suffixed by the file name:

$ head 799/c7/799c71e601402e5678dbe050ef35bfaae9574411.pyproject.toml
[tool.poetry]
name = "django-restit"
version = "4.1.29"
description = "A Rest Framework for DJANGO"
authors = ["Ian Starnes <ians@311labs.com>"]

The tricky part of all of this is duplicates and file paths. It’s not really feasible to extract the contents to the filesystem using the same directory/file name as the project, and extracting duplicates is redundant.

It has some initial work on simplifying parsing as well, by writing objects to JSON via stdout for you to process with Python, or grep/whatever.

I’ll think about how to improve this later: maybe a virtual filesystem is the way to go for exploratory stuff, but for parsing “lots of things” I feel there must be a way with some lower overhead.

But for now I hope this helps!

After the second fetch completed, it decided to auto-pack. That’s now been running for hours, and it’s got to 119M objects and still going. I’ve no idea how long it’ll take, but at this point I’ve had it with waiting. I think the only realistic option is to give up on the multi-remote repo, and clone each repo individually.

I can copy the data from the multi-remote repo to a local repo, using

git -C <old repo> push <new-repo> \
    remotes/pypi-mirror-N.git/main \
    remotes/pypi-mirror-N.git/code

but it leaves the new repo with a slightly different structure than a new clone, and I’m not sure refreshing would work properly. So I’m going to just go with starting from scratch.

One thought I did have - am I right in thinking that because the mirrors are time-based, I’ll only ever need to fetch new data for the latest mirror, as once a new mirror is created, all the older ones become static? If so, that makes the updating process for a multi-repo setup a lot easier.

OK, so the following works for Windows.

  1. I created a directory pypi-mirror, and a subdirectory repos under that.
  2. Now, git clone each repository from this list into repos/pypi-mirror-N. This takes a long time but can be done bit by bit. I used a Python script (shown below) to do them in batches. I just edited the slice of the full list each time to grab a new batch. The script could be made better, but it did the job for me.
  3. In Powershell, do dir .\repos\pypi-mirro* | % { git -C "$_" config --local core.longpaths true } to set the longpaths option for each repo. This only takes a couple of minutes.
  4. Now create a subdirectory objects and run dir .\repos\pypi-mirro* | Foreach-Object -Parallel { git -C "$_" rev-list --objects --all | Out-File -Encoding UTF8 (Join-Path objects $_.name)}. This creates an object list for each repo. This takes about an hour to run, but once done, you can find the OID for any file you want just by running grep on the contents of the objects directory.

Fetching new data is just dir .\repos\pypi-mirro* | Foreach-Object -Parallel { git -C "$_" fetch }. This takes about 30 seconds if there’s nothing to fetch.

I’m assuming that if you get any new data on a fetch, you should rebuild the appropriate objects file.

The repos and objects directories together take up about 375GB right now (just under 40G for the objects files, which you could probably omit if you were comfortable with git rev-list and didn’t mind slow searches).

And that seems to be it! Now I have this, all I need to do is work out how to extract some interesting data from it :slightly_smiling_face: Thanks so much for doing this, @orf - it should be a super valuable resource!

Edit: I forgot to add the script. I’ve made it a gist, at PyPI downloader for py-code.org · GitHub

Thank you @pf_moore! I really appreciate you giving it a go. Please let me know when you do find something interesting.

Yep - it becomes static. I’ve built everything to be pretty reproducible, so if some serious issue is discovered with the data all the repositories can be deleted and re-created in a day or so. Barring that they won’t change.

I’m experimenting with a few things based on your feedback and some other things that have been suggested. Could you tell me if downloading https://b.py-code.org/packfiles/pypi-mirror-221.pack is noticably faster or slower than cloning a repository?

1 Like

Doing a simple wget on that single file took 3min 15s. I had a disconnect but wget restarted from where it left off with very little delay, so “just over 3 minutes” seems about right.
Cloning mirror-221 took 4 min 25s.

Both were reporting about 8MB/s, so the difference in time is probably git “admin”.

One benefit (for me, at least) of the git clone approach is that git (I assume) automatically handles retrying on errors. It can fail (I had 2 out of the 228 repos fail and I had to redo them) but it was pretty reliable. Straight downloads, I’d tend to do with something like requests (because I’m more comfortable in Python than with scripting CLI tools) and so I’d probably not have any error handling or restarting.

Faster downloads is nice, but ultimately the initial download is going to be at least an overnight job, so it’s not that critical to shave a bit of time off (IMO at least). Keeping the “get updates” step simple, which is what git fetch does, is the most important thing IMO.

Yeah, perhaps I’m prematurely optimising. I was doing some experiments and found that repacking several repositories together can significantly reduce the size (by about 60%). I was thinking that if older repositories become immutable perhaps I could repack them together and serve the data another way, which would bring the total size down to about ~120GB but at the cost of extra complexity.

I might do this anyway, as it does reduce the dependency on Github and might be kinder to them.

1 Like

Space optimisations, I’m very much in favour of! I misunderstood and thought you were simply looking at this to speed up downloads of the data as it stands. But having a smaller repository would be great - as you say, even if Github has said it’s OK, it still seems nicer not to use more than we have to. Plus, it’s kinder to my disk space, which I’d appreciate :slight_smile:

1 Like

Just an update to this: repacking the repositories by chunks of 5 reduces the total size to ~160GB, which is pretty good IMO. I need to work out some kinks and document it, but the packfiles are accessible via:

https://b.py-code.org/packfiles/[start]-[end].{idx,pack,rev,objects.gz}

i.e:

this increments in chunks of 5, so the next one would be `https://b.py-code.org/packfiles/5-10.idx.

Someone at Clickhouse also added the dataset to their public instance, so you can use it to run queries in your browser:

https://play.clickhouse.com/play?user=play#c2VsZWN0IHNwbGl0QnlDaGFyKCcuJywgc3BsaXRCeUNoYXIoJy8nLCBwYXRoKVstMV0pWy0xXSBhcyBleHRlbnNpb24sCiAgICAgICBjb3VudCgqKSBhcyB0b3RhbF9maWxlcywKICAgICAgIGZvcm1hdFJlYWRhYmxlU2l6ZShzdW0oc2l6ZSkpIGFzIHRvdGFsX3NpemUKIGZyb20gcHlwaQp3aGVyZSBza2lwX3JlYXNvbj0nJwpncm91cCBieSAxCm9yZGVyIGJ5IHN1bShzaXplKSBkZXNjCmxpbWl0IDEwOw==

I also added some example queries to the dataset page

Cool. I’m not sufficiently familiar with the “innards” of git, so I’ll wait until you’ve documented how to make these into an actual repository before I do anything with them.

FWIW, I’ve been doing some work analyzing projects with a pyproject.toml in their sdist. Not a huge amount to report so far, but depressingly there are 208 sdists where pyproject.toml isn’t even valid TOML. It took me a surprisingly long time to work out why my queries were aborting part way through… :slightly_frowning_face:

That’s out of 590,526 sdists with a pyproject.toml, which is a vanishingly small percentage, but it does act as a reminder of just how defensive tools have to be when processing PyPI data.

3 Likes