Introducing IPWHL: an alternative Python package repository

xarvos · June 23, 2022, 2:47pm

What is IPWHL?

The interplanetary wheels are platform-unique, singly-versioned Python
built distributions backed by IPFS. It aims to be a downstream wheel supplier
in a similar fashion to GNU/Linux distributions, whilst take advantage of a
content-addressing peer-to-peer network to provide a reproducible,
easy-to-mirror source of packages.

On IPWHL, for each platform (architecture, operating system and Python
implementation and version), there exists only one single built distribution.
The collection of these distribution packages are given as a single IPFS CID.
An installer can use solely this content ID and packages names to reproduce the
exactly same environment on every platform.

The official IPWHL repository will provide exclusively free software. However,
deriving the repository should be trivial and is a supported use case.

Why?

IPWHL is created as a curated and decentralized Python package repository.

PyPI repository is uncurated: anyone can publish a package there, which enables
typosquatting and some other exploits. In contrast, by controlling which
packages can go into IPWHL, we reduces risk of distributing malware
significantly. Decentralizing the repository with IPFS makes mirroring more
helpful and cost-saving. Additionally, by making the wheels singly-versioned,
IPWHL is expected to save time for dependency resolution.

How to use IPWHL?

Setting up IPFS

IPFS has a well-documented installation guide.
It is worth noting that several GNU/Linux distributions and BSD-based OSes may
have already included it in their repositories. Afterwards, please follow the
IPFS quick-start guide. Some downstream go-ipfs packages may also contains a
init-system service to automatically manage the IPFS daemon. By default, the
daemon opens a local IPFS to HTTP gateway at port 8080.

Use it

To use IPWHL repository, we can simply replace the PyPI URL to the repository
through an IPFS gateway. For pip, you can do this by changing index-url:

pip config --site set global.index-url "http://localhost:8080/ipfs/$IPWHL_CID"

Mirroring a release is also as simple as pinning its CID:

ipfs pin add $IPWHL_CID

Feedback

IPWHL is in its early stage, so we would appreciate if you can let us know how
you feel about it.

aronchick · June 23, 2022, 3:15pm

HI! I’m David Aronchick at I work at Protocol Labs. I’d love to help with this in any way!

david.aronchick@protocol.ai

jwodder · June 23, 2022, 3:28pm

Is there a list of exactly what packages are present in this repository somewhere? I can’t seem to find one, other than going through all the folders in ~cnx/ipwhl-data: pkgs/ - sourcehut git.

McSinyx · June 23, 2022, 3:38pm

There are not many packages for the moment, but the repo is the simple API so for instance the latest snapshot is at ipfs://bafybeia4earmozedviyibpkybhkei4d3qxwmrelflzt7lhzb64mjpsfuw4 (please only use gateways you can trust, e.g. self-hosted, for production though).

pf_moore · June 23, 2022, 4:44pm

For someone who doesn’t understand a lot of the terms being used here, what is this exactly? It’s a “curated” index - who is doing the curation, and what are the criteria? Is that documented anywhere? A curated index presumably works on the basis that users trust the curation - how is that trust established here? I see no immediate reason why I should trust this index. I understand that by only allowing “approved” packages, certain exploits are avoided, but unless the approvers can be trusted, other exploits are possible (for example, a malware-infected copy of requests could be placed in the index, if the checks applied to ensure that only the requests authors can provide the code for requests are inadequate).

Also what does “singly-versioned” mean? Will it not be possible to get requests 2.27.0 from the index? What will happen to requests 2.28.0 when 2.29.0 is released? Will I be unable to pin my dependencies, or use packages that don’t support the new version of requests for some reason?

And finally, why the weird protocol? It appears that I need to have software installed on my PC to talk to the repository, rather than just using the standard https protocol. If this repository is only intended for people who have already bought into whatever IPFS is, it would be useful to make that clear up front.

Sorry to sound so negative - I’m genuinely confused as to what this is trying to offer.

McSinyx · June 23, 2022, 6:56pm

Sorry to sound so negative - I’m genuinely confused
as to what this is trying to offer.

No worries, it didn’t come across as negative at all; you brought up
points that we should have addressed, but writing is hard since we don’t
know what readers expect to know, hence this topic here.

It’s a “curated” index - who is doing the curation,
and what are the criteria?

Someone you trust, and the criteria you deem fit. To make it
less confusing, let’s refer to the tool chain as IPWHL and the sample
index (e.g. git.sr.ht/~cnx/ipwhl-data) as floating cheeses.

Floating cheeses’ policies are not as strict as we wish to be,
for the moment they include:

The project is valid, e.g. not a typo squatting attempt.
The built distribution is either built or verifiably so
from the version-controlled source (if only reproducible wheels
are common in the wild!).
The difference to the previous version does not contain
any suspicious change.

A curated index presumably works on the basis that users trust
the curation - how is that trust established here?
I see no immediate reason why I should trust this index.

Trust is inherently social and should be established accordingly,
e.g. you and I have some level of mutual trust because we extendedly
interacted before (although the level should not be very high
since we have not communicated regularly for almost two years),
Huy is an IRL friend of mine so you two may or (may not) trust
each other, and the web of trust can also expand for people who
trust either of you.

I hear you, this is way too naïve, but so is choosing to trust
(an uploader of) an upstream library. Moreover, we human
cannot keep track of many others. Distributions, like Debian,
FreeBSD or floating cheeses, narrow the number of identities to trust
from tens or hundreds of thousand to one or a few.

I understand that by only allowing “approved” packages,
certain exploits are avoided, but unless the approvers can be trusted,
other exploits are possible (for example, a malware-infected copy
of requests could be placed in the index, if the checks applied
to ensure that only the requests authors can provide the code
for requests are inadequate).

To put it briefly, right now users should not trust
the floating cheeses: we maintainers are not security experts
(we hardly know what we are doing for rule 3) and rule 2 is only
recently applied, i.e. most of the wheels were not vetted.
We have tried to get publicity for months hoping for more experienced
folks to chime in.

Our ultimate goal is the adoption of IPWHL so that if trust
is established, it is trivial to securely and efficiently distribute.

why the weird protocol? It appears that I need to have software
installed on my PC to talk to the repository, rather than just
using the standard https protocol. If this repository is only intended
for people who have already bought into whatever IPFS is, it would be
useful to make that clear up front.

Our security measures in no way try to be perfect,
but a content-addressable delivery mechanism like IPFS is important
in a few ways:

It should be easy to verify and modify an index, e.g. this wheel
is really from here and I can replace with my patched version there,
while sharing the same CDN for other ones. Like BitTorrent,
more people using the same thing should make things faster,
not demanding more infrastructure.
From one hash, e.g. QmQESYddXAEFiLUofuiNqFs7KdmNWY67NJwmme51y4pmux,
one can have the hash of every wheel in that index version
(like all hashes previous of a Git commit). This is why the IPFS
node should be run by someone you trust, ideally locally like how
TLS is done in a browser or pip compute the hash client-side.
An organization can share a same node, but compromised public ones
don’t show signs.

Since Git also uses a Merkel DAG, this analogy might help:
with HTTPS you can make sure the repository you clone the same
as the remote one, but only the commit hash can verify it’s
the one you want if the remote is compromised. (There’s
hash collision but with SHA-256 it’s not yet an attack vector.)

Similar efforts for content-addressable distribution is also being
experimented for Nix and Guix (although not with IPFS).

Also what does “singly-versioned” mean?

Since it’s not possible to import multiple versions, which version
of a package to be included with a collection of other packages
should be pre-determinable.

Will it not be possible to get requests 2.27.0 from the index?
What will happen to requests 2.28.0 when 2.29.0 is released?

Each version of requests, if needed, will be on a different
index release. People usually don’t pin a version because they
like the number, but because it’s the first one they know to work.
Optimally, they should be provided with the best (often implying
updated) one that works.

Will I be unable to pin my dependencies, or use packages
that don’t support the new version of requests for some reason?

As mentioned earlier, instead of pinning some or every single package,
you pin the whole index. Back to the issue of trust, it’s a lot easier
to bump (or roll back) an index instead of each package individually.
From here, we wish to encourage upstream to not pin dependencies
in package definitions and facilitate reusability among packages.

While upstream can have an index version for development,
downstream can test across a wider range of releases. Say bar
depends on foo, maintainers of bar may pin bar 4.20 against
foo 6.9, but when foo 6.10 comes out floating cheeses can push
the update to a testing index for users who like to live on the edge.
More importantly, if baz also depends on foo, and foo 6.10 is
an important (e.g. security) fix, downstream can push update,
simultaneously effective for both bar and baz.

The possibility of collective testing is not even possible with
a warehouse like PyPI. When foo 6.10 comes out, PyPI users have to
find out for themselves whether there exists any incompatibility,
in which case bar’s support channels will be flooded with reports.

In short, IPWHL is a set of tools to, from a collection of bdists,
generate a single ID from which an index can be collaboratively
distributed and modified in a (hopefully more) secure and efficient way.

aronchick · June 23, 2022, 7:27pm

Just to add on here - IPFS just offers the CID and distributed storage mechanisms. There are many HTTP gateways from all your favorite folks, and CDNs (e.g. Cloudflare) that would cause no change for the end user.

steve.dower · June 23, 2022, 7:55pm

I am very interested in how this works out. I believe this is a very suitable approach for centrally-managed teams that are using Python (e.g. I’m setting up something similar for my teams at work), and would love to see you succeed with this technique.

FWIW, you will definitely get surprised or concerned responses from people who don’t immediately see the tradeoffs being made here, so since you’re taking it public you’ll want to get those nice and clear. Things like “we have 17 teams all deploying to the same server and so they need identical dependencies” are very well served by having a single index with exactly the right dependencies and nothing more.

But you’ll get pushback from people who have struggled with centrally managed packages in the past, or have never worked in an environment where that level of control is taken seriously. Your proposal is great for these though, so don’t give up!

pf_moore · June 23, 2022, 8:04pm

I’m still confused. Why are we referring to “the sample index” by a made up name? Is it not a real index that we can talk about, like PyPI? Is this announcement simply about a set of tools that will let people create a curated index, rather than being about an actual index?

Hmm, this sounds very similar to the Debian (or was it GPG?) “trust network” ideas that were around some time ago. I never really thought much of them (probably because I had no social contacts and no wish to attend “key signing parties”, so I felt left out ) If that is the sort of curation/trust model you’re talking about then OK, I guess, but it’s not for me, really.

Thanks, that gives me a better sense (I think!) of what IPFS is. Forgive me if I’m getting this totally wrong, but it feels like a BitTorrent style sharing mechanism, combined with blockchain-style (or git-style) provenance assurance. Again, not something I’m particularly a fan of.

I wish I knew what “content-addressable” means. I need to do some research, because at the moment my brain tends to treat it as “buzzword - do not trust”. But that’s my problem, not yours, and I’ll read up before trying to dig into how it applies here.

But in my mental model of an index, I don’t install everything, just bits that I want. If I’m installing just requests, I don’t want to get an old version just because some other package in the index says it needs an older version of requests.

What’s an index release? Does it equate to a different URL passed to pip’s --index-url option? That sounds like it would need quite careful management if it’s not to end up a mess.

For information, I don’t even know what a “CID” is (or what the acronym stands for). But see below, I’m not asking you to explain to me

Ah, thanks Steve! Your post gave me a chunk of context that I was missing.

I think this is the main point here. My impression based on the responses so far is that I’m very definitely not in the target audience for this announcement. But that certainly wasn’t clear to me initially, and I’ve still got no idea who would be the sort of group who might be interested (so I’m not even going to be able to act as someone who can push potentially interested people in your direction).

I’ll back off now, as I think it’s clear that I can safely ignore this thread - at least for now

fungi · June 23, 2022, 9:21pm

You’re referring to the PGP “Web of Trust” model. And no, Debian
doesn’t really rely on that. For an official package maintainer to
have their OpenPGP key added to the Debian Keyring (which is how
they authenticate package uploads), their key does need to have been
signed by another person whose key is already in the keyring as a
safeguard, since the Debian Developers are a globally distributed
community of volunteers. That’s the reason for their key signing
“parties” (fashionable types call them key “ceremonies” these days).

If your definition of trust is that you pay someone who pays someone
else for whom they’ll vouch in a contract witnessed by lawyers and
enforced by the courts of a government that gives you warm fuzzies
and is totally not out to exploit you, then yes a transitive trust
model is probably not for you. But ultimately, the entire idea of
“trust” is something which information security professionals debate
endlessly, and we have our own mailing lists dedicated to such
exciting topics.

aronchick · June 23, 2022, 9:38pm

Sorry about that! For anyone browsing through and wants definition:

content addressed = a hash of the underlying content (usually built up from a merkle tree, similar to git) provides a unique identifier. The “content” now has an “address” and you just need a system to index and discover it (IPFS is one of many)
CID = content identifier = the summary hash

tacaswell · June 23, 2022, 10:05pm

I agree with Steve, this is a very interesting idea. The dependency solver can not do the wrong thing if it has no choices!

We have also started to do something similar (but with versioned conda-pack tarballs/read-only envs rather than a pypi index) and it has been the most successful approach yet to centrally managing Python environments for (many) internal teams.

I would definitely sell this by starting with all the benefits of a curated + stripped to single version per project repository/index and then explain why IPFS is a good way to implement / manage / distribute this.

CAM-Gerlach · June 23, 2022, 10:32pm

Not to mention trusting trust…

The issue, though, is if your package needs different choices than what the index made—and how should the index make them in the first place?

This seems very similar in effect to the anaconda metapackage (as I’m sure you’re familiar with), in that it is a single solve tested and integrated together that makes these choices for you. This works well for a constrained set of packages within a specific ecosystem, and for writing scripts and data analysis workflows (as we’ve seen in practice), as well as with libraries and applications that are core parts of the solved set (such as your Matplotlib or or our Spyder). However, this doesn’t always work so well for substantial libraries and applications with many dependencies or specific requirements (those for which this can be most beneficial), unless they are considered during the solve.

For example, lets say you rely on a public Sphinx plugin that (for the moment at least) doesn’t yet work with Sphinx 4, but a critical bug fix was released in an update. How do you get that version of the package? Does it even get pushed to the index at all, if it comes out after Sphinx 4 was first released (same deal with any other Sphinx plugin, library, tool or application dependent upon Sphinx—if it doesn’t have Sphinx compatibility from day 1, does it get immediately bumped off the index until it has it, or do you delay the release of Sphinx 4 on the index until some percentage of plugins are compatible…but who decides that, and how?

So, this seems perfect for corporations, organizations and ecosystems wanting a consistent, centralized, integrated and known good set of tested packages for their projects, but perhaps not so much for individual developers or open source communities.

And, equally importantly, specify the benefits to whom, i…e. describing who this is for and whom it would most benefit, as it is a much better fit for some projects than others.

brettcannon · June 23, 2022, 10:36pm

PEP 691: JSON-based Simple API for Python Package Indexes is a proposal to create a JSON-based simple API. Please give that a read and leave a comment on that topic if it will work for this use-case.

I’m assuming this is using something like Address IPFS on the web | IPFS Docs to control the URL structure? Otherwise my understanding of IPFS and its CIDs would make structured URL formats impossible, thus not work with the simple API since there wouldn’t be the concept of a subdirectory.

steve.dower · June 23, 2022, 10:46pm

You make the index by resolving the packages that are needed by the other packages you make available. This is what [pip-like] installers try to do on every single install, so all this is doing is doing that resolution once, and then only offering up that set of packages.

The only way to get inconsistencies is for them to exist when it’s created, but when using this index someone has already reviewed and okayed the inconsistency, so the user can ignore it. Under “normal” circumstances, the inconsistency shows up randomly when an update is published, and now your entire build is broken and you have to solve it yourself. (One of our team’s channels I follow at work hits this a few times a month - luckily they have a dedicated person to solve it.)

As a user, getting to say “I’m happy to avoid dealing with that but in exchange I only get updates once-a-<whatever my schedule is>” can be really attractive.

CAM-Gerlach · June 23, 2022, 10:55pm

Sure, but the crucial element here is that either you need to be in control of the index, or at the very least your package needs to be part of the initial index solve. If not, there is no guarantee that any version of the index will satisfy the constraints of any particular package.

As you mention, that’s great for centrally-managed teams, as well as for a number of other use cases, and as I mention in the anaconda metapackage example, great for end-users of the packages included in the solve (particularly those who don’t want to deal with a myriad of package versions, e.g. scientists; its what we would always recommend people using Spyder through Anaconda do).

However, I wanted to highlight one of the class of uses cases for which it doesn’t really work, to point out that doesn’t seem to be trying to be a replacement for a public, general-purpose index like PyPI, which I think is the impression some people got at first. Being clear about that up front would avoid those kind of misplaced expectations.

xarvos · June 24, 2022, 2:36am

Currently I am listing them for the latest release on my site and compare the versions to the latest ones on PyPI.

xarvos · June 24, 2022, 4:47am

You can derive your own repository, though if there are a lot of differences, it might be a lot of work.

As for how we make the choices: Currently, we are selecting popular packages and ones we need for development. We select the latest versions possible.

This is a dillema we often face, which we resolve differently from case to case. Generally, we would prioritize a package that affect more people, in this case Sphinx. At the same time, we might consider alternative approaches such as helping backporting the bug fix to Sphinx 3 or get the plugin to work with Sphinx 4.

FRidh · June 24, 2022, 7:01am

So indeed IPWHL is a curated Python distribution. I really like this.

Some years back I wrote on the mailing list that I think there is a need for a larger Python distribution that covers the needs of other (Linux) distributions that can then reuse/package that Python distribution. For Haskell there is for example Stackage. In Nixpkgs we basically repackage the whole of Stackage. I wish we could do the same with Python.

CAM-Gerlach · June 24, 2022, 9:15pm

Huh? Did you mean to reply to something specific here, or to another thread? Or is this just a spambot?