Torrent client in PyPI for package/binary distribution

Inspired by a discussion on github on why cadquery (Python package) isn’t available through PyPI.

I’ll just quote the explanation they gave:

We’d all love to see PyPI-hosted packages for CadQuery. No one sane is opposed to that. Unfortunately, the core issues here are all on PyPI’s side: e.g.,

  • Distributing large binary payloads as separate downloads
  • API to get dependencies without full download

Without significant movement from the official PyPI community on those issues (which seems unlikely, because there’s no consensus as to at what the high-level solutions to those issues even are), it’s unclear whether CadQuery can meaningfully do anything here.

Full comment here.

Well that’s unfortunate. So apparently one big reason why large binaries aren’t allowed to be distributed through PyPI is that hosting is expensive. PyPI spends about 1,5 million USD on hosting alone apparently (according to the comment linked above).

So I feel like 2 challenges that PyPI has could be solved. Those challenges are:

  1. Spending 1,5 million USD on hosting per year (and growing), which is not sustainable regardless of the large binary distribution discussion.
  2. Not allowing large binaries to be distributed with Python packages, because bandwidth is limited and expensive.

So…

Why not build a torrent client into PyPI as one of the repo download channels? So: you keep PyPI the way it is (including the hosting, hard coded repo URL’s, “normal” way of downloading) but you add the option to find packages through torrent indexes and download them from within pip. So instead of (just) hosting packages PyPI could offer the option to host magnet links, torrents and/or links to other indexes/repositories of torrents that contain package(s).

This would allow:
– Users of pip to experience 0 difference in how pip is used (except now old/obscure/large data packages are more likely to still be seeded by someone somewhere and to be downloaded through a torrent which requires no extra user interaction).
– PyPI to start saving bandwidth costs, because little by little more and more packages will be downloaded through the torrent network instead of 100% from PyPI servers.
– Volunteers who know how to seed a torrent to start hosting Python packages. Just because they’re nice like that. Anybody could do that from any device. The barrier to entry for hosting python packages right now is pretty much infinitely larger than seeding a torrent? Unless I’m missing something.
– It would allow creators/maintainers of packages to seed torrents for their own packages. Including packages with huge binaries such as cadquery.
– Businesses with spare bandwidth to support Python by simply seeding package torrents.
– conda to die?

Disclaimer: I have almost no knowledge of how package management/distribution or PyPI works. Just firing off ideas here and trying to learn why this could (not) work.

3 Likes

One if the issues is the time required to investigate an idea, which members of the packaging community constantly complain they have little of. If you were to look in to existing torrent packages (eg: gallexis/pytorrent), and see if you could extend pip’s downloader to handle magnet links, and set up a local package index with your magnet URLs (eg with packages not on PyPI), then you could give the community the experience to better investigate the idea.

2 Likes

I think it’s a great idea when PSF could save so much money. I suggest to use DHT for fast connections. We could use the excellent implementation by nitmir.

In preparation for a POC, I will set up a small package index website with a Magnet link to ansible: magnet:?xt=urn:btih:cc7b5da37a3531f848e4019e95c5d8bd09d22a94&dn=ansible-4.5.0.tar.gz. For the first seed, I have added the x.pe parameter with my host and port. Meanwhile the metadata is available in Mainline DHT, you can check with dht.get_peers(binascii.a2b_hex('cc7b5da37a3531f848e4019e95c5d8bd09d22a94')).

Now what’s missing is the extension of pip to handle the download. As I can’t do that, it would be great if somebody could work on that!

Do note that all bandwidth for PyPI is graciously donated by PyPI sponsors, so PyPI’s cost to the PSF comes from having one person on staff for infrastructure at the PSF which includes PyPI as part of that work (among other things).

I am also working on a distributed repository called floating cheeses, although the scope is a bit different: it only accept wheels and dependencies are resolved to one single version. Instead of BitTorrent, we use IPFS which has HTTP gateways builtin for use as a drop-in replacement for PyPI, e.g. https://bafybeiddji3vl36znz6cfts6sk67h7at3jx6uznwqiv7ahchywbfkir2qi.ipfs.dweb.link (it’s preferable to use a local gateway for security reasons, but I suppose most readers of this comment don’t have one installed).

The downside of a P2P network is that the latency is depending on how popular a piece of content is, so organizations using our repository will need pin the contents locally. In other words, it’s like a chicken and egg issue where it’s only fast when there are multiple people using it and it’s unlikely to be used if it’s slow. We do use the repo to self-host our own dependencies in the CI though and it can take anywhere from a few seconds to a few minutes for a freshly spawn IPFS node to find the packages it needs.