You can now download PyPI locally

orf · September 2, 2023, 3:08pm

Thank you @pf_moore! I really appreciate you giving it a go. Please let me know when you do find something interesting.

Yep - it becomes static. I’ve built everything to be pretty reproducible, so if some serious issue is discovered with the data all the repositories can be deleted and re-created in a day or so. Barring that they won’t change.

I’m experimenting with a few things based on your feedback and some other things that have been suggested. Could you tell me if downloading https://b.py-code.org/packfiles/pypi-mirror-221.pack is noticably faster or slower than cloning a repository?

pf_moore · September 2, 2023, 4:04pm

Doing a simple wget on that single file took 3min 15s. I had a disconnect but wget restarted from where it left off with very little delay, so “just over 3 minutes” seems about right.
Cloning mirror-221 took 4 min 25s.

Both were reporting about 8MB/s, so the difference in time is probably git “admin”.

One benefit (for me, at least) of the git clone approach is that git (I assume) automatically handles retrying on errors. It can fail (I had 2 out of the 228 repos fail and I had to redo them) but it was pretty reliable. Straight downloads, I’d tend to do with something like requests (because I’m more comfortable in Python than with scripting CLI tools) and so I’d probably not have any error handling or restarting.

Faster downloads is nice, but ultimately the initial download is going to be at least an overnight job, so it’s not that critical to shave a bit of time off (IMO at least). Keeping the “get updates” step simple, which is what git fetch does, is the most important thing IMO.

orf · September 2, 2023, 5:44pm

Yeah, perhaps I’m prematurely optimising. I was doing some experiments and found that repacking several repositories together can significantly reduce the size (by about 60%). I was thinking that if older repositories become immutable perhaps I could repack them together and serve the data another way, which would bring the total size down to about ~120GB but at the cost of extra complexity.

I might do this anyway, as it does reduce the dependency on Github and might be kinder to them.

pf_moore · September 2, 2023, 5:51pm

Space optimisations, I’m very much in favour of! I misunderstood and thought you were simply looking at this to speed up downloads of the data as it stands. But having a smaller repository would be great - as you say, even if Github has said it’s OK, it still seems nicer not to use more than we have to. Plus, it’s kinder to my disk space, which I’d appreciate

orf · September 3, 2023, 1:40pm

Just an update to this: repacking the repositories by chunks of 5 reduces the total size to ~160GB, which is pretty good IMO. I need to work out some kinks and document it, but the packfiles are accessible via:

https://b.py-code.org/packfiles/[start]-[end].{idx,pack,rev,objects.gz}

i.e:

this increments in chunks of 5, so the next one would be `https://b.py-code.org/packfiles/5-10.idx.

Someone at Clickhouse also added the dataset to their public instance, so you can use it to run queries in your browser:

https://play.clickhouse.com/play?user=play#c2VsZWN0IHNwbGl0QnlDaGFyKCcuJywgc3BsaXRCeUNoYXIoJy8nLCBwYXRoKVstMV0pWy0xXSBhcyBleHRlbnNpb24sCiAgICAgICBjb3VudCgqKSBhcyB0b3RhbF9maWxlcywKICAgICAgIGZvcm1hdFJlYWRhYmxlU2l6ZShzdW0oc2l6ZSkpIGFzIHRvdGFsX3NpemUKIGZyb20gcHlwaQp3aGVyZSBza2lwX3JlYXNvbj0nJwpncm91cCBieSAxCm9yZGVyIGJ5IHN1bShzaXplKSBkZXNjCmxpbWl0IDEwOw==

I also added some example queries to the dataset page

pf_moore · September 3, 2023, 1:55pm

Cool. I’m not sufficiently familiar with the “innards” of git, so I’ll wait until you’ve documented how to make these into an actual repository before I do anything with them.

FWIW, I’ve been doing some work analyzing projects with a pyproject.toml in their sdist. Not a huge amount to report so far, but depressingly there are 208 sdists where pyproject.toml isn’t even valid TOML. It took me a surprisingly long time to work out why my queries were aborting part way through…

That’s out of 590,526 sdists with a pyproject.toml, which is a vanishingly small percentage, but it does act as a reminder of just how defensive tools have to be when processing PyPI data.