"Virtual" pypi indexes?

dhashby · December 5, 2024, 12:58am

Maybe it’s just me, but as a community we seem to be “spinning our wheels” when it comes to the topic of pypi namespaces. We’ve seen lots of good discussion on this topic, but at least from my (outsider) view we don’t seem any closer to a consensus on how best to move forward on this important effort.

As I caught up on the PEP 752 discussion, an interesting thought struck me: what if PyPi supported “filtered” indexes that were API-compatible with the “main” API? Filters could take several forms - for example, maybe there’s a endpoint for packages with “Trusted Publishers” or that have published attestation bundles. Users of a “filtered” endpoint won’t see packages that don’t meet the filter criteria. Maybe eventually client-side tooling is updated to support --trusted-only type switches that will automatically make use of the “virtual index” endpoint.

Extending this concept slightly, filters could also be leveraged for organizations/publishers/maintainers. So maybe if you only wanted to see packages published by Nvidia you’d point to https://pypi.org/virtual/nvidia/simple/ . Depending on how PEP 766 evolves, maybe you could even “pin” certain packages to certain “virtual” indexes, protecting (even more than some of the mitigations that have recently been implemented) against dependency confusion attacks.

Ultimately I think this could be a nice complement to PEP 766. I’m not proposing immediate changes to how the resolvers work, but ultimately would love to see this as a building block that enables more explicit control over dependencies. This could also provide a way to transition/evolve from pypi’s current “flat” namespace to explicit namespaces at some point in the future (but I’m not proposing that now - just trying to be mindful of where this could take us in the future).

Is this feasible? Or are there other considerations that would make this impractical?

msarahan · December 5, 2024, 2:00am

My reading of your idea is that it is kind of like private indexes, except that instead of building up indexes by adding distributions, you would instead start with all of PyPI and filter it to some subset. Is that accurate?

If that is indeed accurate, I’m not sure what this approach offers over existing index proxies/mirrors and privately-hosted-indexes. These mirrors/proxies are already one suggested way of having index priority (you just implement on the server side instead of the client side). Paul’s comment at index-url extra-index-url install priority order · Issue #8606 · pypa/pip · GitHub describes one way that devpi might serve this purpose.

PEP 766 sounds related in that virtual indexes may already be doing something like “index priority” when creating their contents. The downside of doing this server-side is that end users must run this server/index creator, which raises the bar for package management considerably. You can’t have NVIDIA-only or PyTorch-only indexes, or whatever, because there is no general solution for everyone, and once you start mixing indexes without any way to prefer one or the other, I predict there will be some cases where the installer is not doing what the user wants.

In my mind, I don’t think a server-side solution will work. It is too much to ask users to run some separate tool when they want to install a package. There are 2 options for client-side specification of index preference:

PEP 766: coarse-grained, but also intuitive relative to a human notion of a trust hierarchy.
Per-package index specifications. This is useful for environment definitions and lockfiles, but I think it is overly tedious in the general day-to-day package installation case.

Maybe you’re thinking of a new, 3rd idea that is in-between 1 and 2, where you can alter the fall-through behavior in index priority on a per-package basis, without going all the way to per-package index specs. Is this more what you were thinking?

dhashby · December 5, 2024, 2:09am

I guess I look at it as leveraging “pypi’s knowledge” of packages and certain aspects of their provenance. That’s not something that private indexes and proxies can easily do. To the extent that this provenance is useful in choosing a particular package to install, that’s where I think it could be relevant to future changes to resolver behavior (whether in the form of what’s proposed under PEP 766 or elsewhere).

Does that help?

dhashby · December 5, 2024, 2:19am

(And you’re absolutely correct that you need some client-side syntax to “hint” which index should be used for a given package/set of packages. Some of that has been discussed in the 752/755 threads, considering a syntax like nvidia::numba-cuda or numba-cuda by nvidia. If this concept is viable, then a future step would be to settle on a particular syntax to communicate those preferences)

msarahan · December 5, 2024, 2:20am

To be sure, the provenance is a kind of trust measure, and anything that can help the user express preference based on that trust over something like version will be helpful. If you provided a way to filter and split PyPI into trusted publishers and everything else, and then prioritize the trusted stuff, yes, that would be valuable.

I spent a lot of time working on the Conda ecosystem, and this was always a weak spot. The packages on PyPI were provided by the teams working on the project. This made them official. Conda builds were always secondary - they were built from the source that was posted to PyPI, but they were not built by the project team. In some cases, the project team did help maintain the conda recipe, but it was not safe to assume that the conda packages were “blessed” by that team in any way. Some people would not use conda packages for this reason - the packages were not official enough.

dhashby · December 5, 2024, 2:25am

Yes, you’re absolutely correct. Even “trusted publishers” aren’t perfect, but they’re a huge step forward compared to the days of old. The python ecosystem has come a long ways just in the last few years (as it relates to making real improvements in software supply chain security), but there’s still plenty of work ahead.