Another idea for additional package repositories

sinoroc · October 4, 2024, 5:51pm

I had a quick thought while skimming through the PEP 759 thread.

Would it be possible to do something like the additional apt repositories in Debian/Ubuntu (and I am quite sure it it similar on other systems)? Typically if I want to install a package that is not on the main/default/system package repository, I add an additional package repository via some copy/paste command. And the next time I use apt I can install from the newly added repository.

Maybe something similar can be done for Python repositories. The difference would be that I would introduce some simpleindex-like features in the mix, so that the user experience would look something like this:

pip add-repo --name=pytorch --url=https://download.pytorch.org/whl/cu118
pip add-rule --packages=torch,torchaudio,torchvision --repo=pytorch
pip install torch torchvision torchaudio

With this pip would use exclusively https://download.pytorch.org/whl/cu118 for torch, torchaudio, and torchvision, but use PyPI or any other pre-existing rule for anything else

No idea if something similar has been discussed before or if it does add anything useful to the general discussion on this topic. Also, obviously I did not think much about but wanted to write it down quickly before I forget.

I guess it does nothing for the fact that all package repositories are supposed to be one single namespace.

jamestwebber · October 4, 2024, 5:57pm

Is this different from using an alternative package index? Those exist but there are some open questions with cross-index dependencies (see e.g. PEP 708 which is not fully implemented)

sinoroc · October 4, 2024, 6:15pm

Yes, I know that it is not much more than --extra-index-url. But doesn’t the simpleindex-like functionality help mitigate some of the dependency confusion attacks? The idea is that once this is setup, the installer tool (pip) should never consult any other index but pytorch index for packages torch and friends.

I guess the installer tool’s cache would need to be cleaned up after each new configuration change.

sinoroc · October 4, 2024, 6:22pm

I guess I should try to figure out if something like this would have helped in a case like the torchtriton case.

jamestwebber · October 4, 2024, 6:22pm

I am probably misunderstanding the distinction, as I don’t encounter non-PyPI indexes in my day-to-day^[1].

I guess what I’m wondering is, is this a change in the specification for packaging, or is it a change in UX for installers, so that they deal with extra indexes in a way that’s easier to configure. Or maybe some combination of both?

other than using conda channels, which are a different flavor of the same idea ↩︎

notatallshaw · October 4, 2024, 6:23pm

FYI, uv is looking to add features very similar to this: https://github.com/astral-sh/uv/pull/7481

There are a lot of design choices to make, this seems like something that needs to be solved and innovated by front end tools, not something to build a packaging standard around?

sinoroc · October 4, 2024, 6:36pm

My gut feeling is that it would be better as a standard. If I configure my indexes and rules, I want them to apply to all Python package installers (pip, hatch, PDM, Poetry, uv, and so on) and to all my projects. And of course there could be some local overrides with additional repos and rules (config file at the root of project directory), but ideally still installer independent.

If I understand the torchtriton case correctly, the issue is that if one time you forgot to specify the PyTorch nightly package index server, then you installed PyPI’s torchtriton instead, i.e. the malicious one. So if I configure my repos and rules to use PyTorch nightlies server with uv, I do not want to get the malicious package when I install with pip, that would defeat the purpose of the security measure.

jamestwebber · October 4, 2024, 6:41pm

My impression was that PEP 708 was intended to standardize the solution for that situation. I’m not sure what other behavior needs to be standardized for this to work more generally. This isn’t me saying there’s nothing to do, I just want to get the details spelled out.

notatallshaw · October 4, 2024, 6:59pm

Yeah, my understanding is PEP 708 lets the underlying project protect their users from dependency confusion attacks.

In terms of dependency confusion attacks, relying on users to set projects to specific indexes (even if that config can be standardized, shared, and put in repos) means 1000s times more configuration needs to be set up, for every user that has a transitive dependency on the project. Putting this onus on the 1000s of users is not ideal for preventing errors.

steve.dower · October 7, 2024, 8:40pm

Right, but this idea is not just valuable for protecting users from dependency confusion, but for enabling them to selectively go to a specific index for specific packages.

How this suggestion differs from --extra-index-url is that an extra URL will be inspected for all packages, whereas this proposal is just to use an extra index for specific packages. The references to simpleindex (worth checking out if you have any concerns around this area) reinforce that - it’s primarily a tool to provide a single virtual local index that redirects based on name/patterns to whichever index you like. So you could easily configure torch* to come from one index and everything else from another.

The reason this is important is because (from the first referenced discussion) many of us are interested in making multiple indexes more approachable, so that packages may have their own index entirely. Pytorch is a perfect example of this, where virtually every release is going to exceed PyPI’s size limits. I could also imagine Linux distros having their own indexes for specific pre-built packages, and happily letting users fall back to PyPI for anything else.

The problem is that the UX for this does not exist. We have “all indexes are equal” or “set up your own index”. This is a proposal for a middle ground.

That said, I’m not a huge fan of the proposed design as it is. I don’t especially have a better one in mind, and I’m also not an expert on how the resolvers work (relevant because for UX purposes we’d probably want packages to be able to prefer dependencies from their own index, even if the user didn’t explicitly specify that, and then you get more potential conflicts). But the motivation is sound, and I do think it’s worth exploring this space.

jamestwebber · October 7, 2024, 8:51pm

Ah that explanation makes a lot of sense, thanks for spelling out the distinction here.

I don’t know if this is viable from the resolving perspective, but a currently-possible UX for this is simply “this index doesn’t provide anything else”. i.e. if an extra index only provides torch packages, you have to fall back to something else.

I guess if there’s no priority on indexes that still doesn’t work–is that really the case?

pf_moore · October 7, 2024, 9:06pm

In the current ecosystem, yes. uv provides an index priority system (I don’t know much about it beyond the fact that it exists) but pip treats all indexes equally^[1].

Pip’s resolver works in terms of “candidates”, which are identified by project name and version^[2]. You don’t have two candidates with the same name/version, and the resolver doesn’t know which index a candidate came from. So there’s no way to say “dependencies for a package should come from the same index”.

The question isn’t even well formed, in the general case. If package A came from index A_idx, and B came from B_idx, and both A and B depend on C, which index should you get C from?

And because that’s the historical model, there’s a risk that it’s an unstated assumption behind existing standards. That’s why “just adding index priority” is hard, because we’d need to make sure we didn’t accidentally break something ↩︎
Extras add a quirk to this, but not one that matters here ↩︎

steve.dower · October 7, 2024, 9:12pm

Yeah, it really doesn’t generalise. And yet, it seems like a reasonable expectation that if I need spam in order for eggs to work, getting eggs should get the spam it expects and not something newer or incompatibly built from another index.

I guess the only real option is ordered prioritisation, which clearly doesn’t fit pip’s resolver right now, or package-by-package constraint which I believe is not practical for users. (For the record, I believe most alternate indexes like Azure Artifacts and Artifactory allow specifying fallback/upstream indexes that are handled in order, and so pip gets a single index view that encompasses a prioritised merge of available packages. This is a good thing, as it’s doing something that pip can’t, but it’s a fairly heavy abstraction layer to get something that feels like it can be handled in the client.)

pf_moore · October 7, 2024, 9:27pm

I’m genuinely unable to understand how to interpret that expectation. If eggs needs spam it should depend on what it needs, so that “something newer or incompatibly built” doesn’t satisfy eggs dependency specifier (or has wheel tags that don’t match the target environment).

It’s one of many possible abstractions. uv has three options for --index-strategy and I believe it has package-specific index specifiers as well. We could probably (given sufficient time and resources) add similar capabilities to pip, but it could overload pip’s already-complex option structure to breaking point. That’s why I’d rather see a lightweight way of letting users manage a proxy that handled this - I proposed An option to start a local index proxy when running pip · Issue #11771 · pypa/pip · GitHub as a way of doing that, but it didn’t get much traction, unfortunately.

steve.dower · October 7, 2024, 9:32pm

If I were to install scipy from a particular index, it’s going to expect the copy of numpy it was bulit with. If that same index also provides numpy, it’s likely going to be the consistent one (if I’m setting up the index, it’s 100% the right one, but I can’t speak for what other people might do).

So if you grab scipy from my index and numpy from PyPI even though there was a numpy on my index, things could break. And if the user is only thinking “I want scipy”, they’re not going to extend their list of packages to specify all dependencies so that they come from the right place - they’re just going to want the right ones.

Index prioritisation works here, because in order to get my scipy my index must’ve been higher than PyPI, which means my numpy is also higher. In the “all indexes are equal” there’s no way to guarantee this.

pf_moore · October 7, 2024, 9:36pm

Ah, OK. That’s a case where wheel tags and dependency specifiers aren’t fine-grained enough. Yes, index prioritisation is a workaround for that problem. I don’t know if we should promote it to being the “officially sanctioned solution”, though…

sinoroc · October 8, 2024, 5:36pm

Of course, there is not much of a design yet. : D

Good, that is all I wanted to hear.

My gut feeling is to disagree here, but I rarely use the pypackaging-native kind of packages, so my gut feeling might be very wrong. Anyway, from my point of view, the “rules” should be followed strictly. If I configure things to have scipy from a specific index but nothing for numpy, then numpy should come from PyPI. Shouldn’t this kind of issues rather be solved by one of the following?

sinoroc · October 8, 2024, 5:46pm

Also, in my mind, things would come with instructions anyway. I assume that somewhere I would read something like this:

We recommend you to run these commands to install scipy and numpy from our package index server.
pip add-repo --name=scipy --url=https://repo.scipy.example/some-specific-build/simple/
pip add-rule --packages=scipy,numpy --repo=scipy
pip install scipy

The model I have in mind is for example the instructions to install Docker on Debian. I am not saying it is a perfect or even a very good model, but it does work.

jamestwebber · October 8, 2024, 5:55pm

I think this could get pretty annoying once you get a little deeper into such an ecosystem. If I need to install pynndescent from a specific index, I don’t want to have to make sure I specify a compatible index for scikit-learn, numba, and scipy, and all of their dependencies (i.e. numpy isn’t a listed dependency but it is necessary as well).

I think your comment above is the right answer: with something like selector packages, or external dependencies, these packages could self-describe what constitutes a compatible dependency. In that case, you don’t need to pin an index for every package, just for the thing you are directly installing.

I think this idea is dependent on solving that open question–how to specify the non-python dependencies that are causing these subtle incompatibilities.

pf_moore · October 8, 2024, 6:24pm

Does it get annoying for Linux distros? That’s a genuine question, as I don’t use Linux much, but my impression was that this is very similar to the model used for adding non-default channels in something like apt-get.

If it works for Linux, I don’t see why it wouldn’t work here as well.