Another idea for additional package repositories

jamestwebber · October 8, 2024, 6:28pm

I don’t know the answer either, as I also don’t have enough Linux experience to have dealt with it in that context.

edit: I would think one difference between the two ecosystems is that those tools are language-agnostic, and so dependencies can be anything^[1]. I don’t know if/how they can maintain the kind of compatibility guarantees that python packagers worry about.

whereas scipy can only specify numpy and not “numpy built with openBLAS” ↩︎

fungi · October 8, 2024, 6:53pm

It was especially problematic in Debian and its derivatives (where
APT tools originated) until package management grew sufficient
functionality to filter repository additions by package, so that
users could say “I only want to allow package names foo, bar and baz
to be installed from repository plugh, all other packages should
come from xyzzy instead.” Third-party repositories had a tendency to
supply multiple packages some of which may not have been necessary,
or as they grew stale shadowed updated versions in the main
distribution due to bespoke version numbering.

It’s still somewhat painful, insofar as distributions typically
supply a frozen snapshot of known-compatible versions of packages,
but as soon as you start mixing in packages from third-party
repositories there’s no longer any guarantee that the dep solver can
work out a coherent set of the specific packages you requested.

sirosen · October 8, 2024, 10:02pm

Disclaimer: I’m a long-time Debian/Ubuntu user, but not a package maintainer. So I bet I’m missing some detail in this post and I might not be fully accurate.

Another related evolution in the Debian world is that a few projects provide their own repositories which only host their very specific packages. (Fewer repos these days, in my experience, try to host big chunks of the package space.) They may intentionally shadow package names from the main repos, but only those which belong to the relevant project – so the expectation is that if you’re setting up that repo as a source, you want that shadowing.

In practice, as long as you’re working with well-behaved repositories, it works very well and there’s no need to configure special overrides.

For example, if you follow the instructions for setting up the postgres repo and those for setting up the google repo, then you can

apt install google-chrome-stable postgres-client-15

and expect a sane result.

There’s some complexity being hidden here. Those package repos can target different package versions to different platform versions, so that install command working seamlessly on Ubuntu 22.04 and 24.04 can be the result of careful maintenance by the Postgres and Google Chrome developers of different paltform-specific package channels.

I’d be hesitant to follow this model without a lot more thought and study. But I wouldn’t underestimate the user experience – although you can get yourself in trouble by setting up badly behaved repositories, you can also have a very good experience if you are conservative in what repos you enable.

steve.dower · October 14, 2024, 1:54pm

Conda follows this model as well, so we have some closer-to-home precedent and study. Dependency confusion also doesn’t exist at all in this model, so there’s another unforeseen argument in its favour.

It’s unfortunate that we don’t trust people to run well-behaved repositories

mwichmann · October 14, 2024, 3:52pm

The Fedora/Redhat family dnf tool supports priorities, there’s a short description here (scroll down a bit):

https://dnf.readthedocs.io/en/latest/conf_ref.html#repo-options

But as others have said, it can clearly get painful if extra repositories aren’t quite well-behaved.

sinoroc · October 14, 2024, 5:16pm

@mwichmann @sirosen

Without going into details, would you mind sharing a short example of what a case of “not well-behaved repository” you mentioned would look like?

Would those cases still exist if each repository is associated with an allow-list, where only the packages in the list are allowed to be considered (metadata, download, install) for that repository?

jamestwebber · October 14, 2024, 5:32pm

That would fix it but only if the allow-list is “well-behaved”. Who provides the list? If the user has to curate it, it’s not very scalable. If the repository provides it, it’s no more reliable.

steve.dower · October 14, 2024, 7:12pm

We have hundreds of them at work, where teams are allowed to create their own package feeds freely.

The main misbehaviour is to publish someone else’s package to your feed once and then never update it. For example, if my package spam requires numpy, so I put numpy on my feed today, without further action by me that version of numpy would still be the same one in ten years time. Quite likely my use of it is highly compatible (I deliberately didn’t use scipy as the example), and there’s nothing wrong with users getting a newer version, but because my feed shows up first it will take priority.

Today, doing this with Python wouldn’t matter because any other feed with a newer numpy would win (malicious or otherwise). Under PEP 708, doing this with a Python installer would cause an error for all my users, if there isn’t any metadata on my index telling installers they can ignore my numpy if they find another one.

The correct behaviour on my part would be to not put numpy in my feed at all, and so an installer must always specify a second index that provides it, but they aren’t going to silently get a very old version.

sinoroc · October 15, 2024, 5:32pm

In my mind, the list would be curated on the user side.

I have not considered the case where the list is curated on the repository side. Maybe there is some value to that approach. My first impression is that it would be pointless, either the repository has the package or it does not have it. Why would there be an allow-list on the repository side?

mikeshardmind · October 15, 2024, 5:46pm

Maybe I’m being too optimistic here, but this feels like something most users won’t run into if guidance is good, and there’s a relatively straightforward way to ensure users can always specify exactly what they want while making the “well-behaved case” easy

Configurable numeric priority per index, allowing equal priorities.
When priorities are equal, prefer latest version that contributes to a solved dependency graph.
The ability to bypass that priority and say “use this index for this dep”

jamestwebber · October 15, 2024, 9:34pm

I can’t think of a great reason for it either, but a user-side list doesn’t seem useful to me. You’d need to know the details about every repository you’re using to prevent any mistakes, which seems like a really unpleasant UX. It’s basically just pushing the whole problem of confusing indexes onto the user.

It probably makes sense to allow a lock file^[1] to allow per-package repository specification, but that should be a rare use-case, not the standard solution to package conflicts.

or env file or equivalent ↩︎

pradyunsg · October 22, 2024, 5:13pm

I’m sorry if I’m missing something – Isn’t this what an entire section in PEP 708 (PEP 708 – Extending the Repository API to Mitigate Dependency Confusion Attacks | peps.python.org) is already covering?

Sure, it’s not mandatory but (if I’m reading this correctly) that section is about solving the exactly this problem and how it is leaving the UX design for this to the individual tools with guidance that something should be provided to users.

If I’m not missing something catastrophically, I’m inclined to say that we don’t need to solve this problem in this forum and this is an implementation problem for pip/uv etc to deal with. If we can’t design something and need user feedbac/additional opinions, I think it’s useful to direct folks from this forum to the pip/uv issue tracker for the relevant issue.

steve.dower · October 22, 2024, 9:33pm

Theoretically, yes, but in practice I suspect not. Or at least not without requiring the user doing the configuration to have a deep understanding of the policies of each index involved (e.g. does package “spam” require “eggs” from the same index? Or can I get “eggs” from another index) and likely having to configure per-package (i.e. every transitive dependency, not just the specified requirements).

The user configuration in PEP 708 is unavoidable, but it’s not the ideal case. An index that is deliberately masking another index’s copy of a package is always going to result in a complete resolution failure under PEP 708,^[1] but the assumption made by the PEP (at least during the discussions) is that most masking will be unintended and users are going to have a single preferred feed that they want to use.

What I referred to as a well-behaved index earlier will not do any masking, and so PEP 708 won’t come into play at all. The misbehaving ones will cause UX issues (regardless of PEP 708).

Which is good! Because the alternative (without index prioritisation) is dependency confusion. ↩︎

pf_moore · October 22, 2024, 9:59pm

I think the point here is that it’s still a UI decision that shouldn’t be mandated by a standard. I don’t know which I’d find more frustrating in practice - from pip’s point of view having a standard demand that we implement a UI before we’ve had a chance to do our own design, or from uv’s point of view, having a standard come along and mandate something when they’ve been doing a lot of work designing something else that they think will suit their users^[1].

I’m happy that people are discussing options, but ultimately I agree with @pradyunsg - anything that will lead to actual changes will have to be discussed on the trackers of individual tools. There’s no “one size fits all” solution here (and IMO there shouldn’t be).

IMO what we need to do is get PEP 708 implemented. That’s an approved standard designed explicitly to address (some of) the problems with multiple indexes. Once that’s in place, then we can debate what extra measures, if any, might be needed (with the proviso that as I said above, just debating here won’t actually lead to real change).

There’s a lot of activity right now on uv’s design of index selection ↩︎