Idea: selector packages

We somewhat frequently see problems like this one arise, and proposals like Archspec (and I’ve seen more that have never made it public). But I honestly can’t see any of them being integrated into every install tool that we may care about.

Essentially, the problem comes down to this: based on something about the target environment, users should select a different wheel.

To use the example in the first link above, they should have selected fluidsim[pythan, mpi] if they have Pythan and MPI support, and fluidsim[purepy] otherwise. Other examples we’ve seen include GPU-specific packages or based on the availability of certain CPU features.

The fundamental problem is that users can’t just “install (package)” - they have to read some documentation and go and figure out which of a set of packages they require and then specify that name. (See https://pytorch.org/get-started/locally/ for a very concrete example.) This prevents specifying the package as a dependency, since you cannot reliably acquire it by name alone.

So here’s an alternative proposal:

A selector package is a wheel tagged with a new platform (e.g. “mypackage-1.0.0-selector.whl”) that indicates it is a selector.

Selector packages are not installed into the target environment and do not specify any install dependencies (meaning they are trivial to resolve when specified as a dependency of other projects).

During overall dependency resolution (at some point that I need help finding), the selector package is extracted and executed in the target environment’s interpreter. It prints a set of requirements to the console, which is captured and fed back into the dependency resolution process in place of the selector package. The selector package files are then deleted. (Update: It’s probably better to always use the latest selector package and pass in the requested version, as this means that finding the selector only requires the name and not full dependency resolution.)

This allows a reasonable amount of flexibility to choose a specific package based on what is currently (or about to be) installed, and anything that can be determined through the standard library or vendored code. The resulting environment does not depend on the selector package, which means if you freeze and reproduce the env elsewhere, you’ll bypass it completely.

Structurally, this would mean that anyone using a selector package will actually have multiple packages, e.g. “pytorch” (the selector), “pytorch-cpu”, “pytorch-cuda-9.2”, etc., where one of the latter is required to actually get the pytorch module. In some ways, it operates similarly to extras, and also similarly to environment markers, but allows people to handle the more complex cases that those don’t.

Things mentioned above that I don’t have strong preferences about:

  • exactly how we mark a package as a selector package (I suggested a platform tag)
  • how the package is executed (executing a well-known .py filename with package name/version/other reqs on argv seems reasonable)
  • exact format of the requirements (obviously needs to be specified though)

Thoughts?

1 Like

+1 on the general approach of trying to find a creative solution to these sorts of issue. They do come up relatively frequently, and we don’t really have a good answer right now.

However (you saw that coming :wink:) this proposal sounds like it would add some non-trivial complexity to Python’s dependency resolution process (which already has an issue in that it’s not something that is documented anywhere, it’s mostly a pip internal mechanism, although we are making some inroads into breaking it out into reusable library code). And algorithmically, not all resolution algorithms can handle new dependencies being injected part way through the process - so this would limit options if we ever wanted to explore better (faster, more reliable, better error reporting) algorithms.

Also, the “ask the user” approach is problematic in non-interactive situations like CI, so we’d need to specify how the proposal would work in those situations.

And finally, this is in direct contradiction of the direction we’re trying to move in, where dependency resolution data (name, version and dependencies) is statically available from the package index, and does not require execution of package code to determine.

I’m very much in favour of exploring this idea further, but it does have some fairly substantial hurdles it will have to address if it’s going to be usable in practice.

Also, we need to remember that 99% (yes, made up number :slightly_smiling_face:) of packages don’t need this sort of mechanism, so we need to take care not to end up with a solution that penalises the masses for the sake of the few.

2 Likes

I agree w/ @pf_moore, this seems like it would make dependency resolution far more complex, and would exacerbate the current “packages can have dynamic dependencies” issue that setup.py produces, and that we’re trying to move away from.

I’d be interested in seeing what the most common use-cases we’re trying to solve are. My suspicion is that the large majority of them would be projects attempting to publish support for multiple GPU architectures.

I think those projects would probably be better served by adding a platform compatibility tag for GPUs instead, and giving pip the ability to detect GPU architectures (via a common shared library, of course).

The remaining use cases would probably only be a small fraction of projects that would use this (which as @pf_moore already mentioned, is very small as-is) so overall I don’t see this as having a good balance between “how many users we can help” and “how much effort it would take”.

Yeah. Unfortunately, I can’t think of another way to handle it. Maybe always require selector packages to use the latest but then pass in the requested version when invoking them? That way they can be filtered out up-front and converted into a concrete list before the real dependency resolution starts. (Edit - Added this idea to the original post so people who read that and skip this one can appreciate the idea.)

Yeah, I’m definitely aware of this. But ultimately, unless the interpretation of this static data becomes complicated enough to handle all the cases we need, this will never be feasible. And you can say “all dependency metadata on PyPI is static”, and it will really mean “PyPI gives you the slowest packages and if you want the real ones you have to go somewhere else”.

A target-specific preprocessing step to turn some requirements into more specific ones seems like a workable compromise.

Also, one avenue which never seems to get adequately explored is the possibility of having separate indexes serving versions of projects for a given GPU architecture. There are issues that would need to be solved there (not least being that pip currently doesn’t allow prioritising one index over another, and changing that is a non-trivial alteration in philosophy) but it’s IMO something that should be explored more openly. (My personal suspicion is that it’s an unpopular idea because it requires “someone” to invest in setting up the relevant infrastructure).

1 Like

Some are hardware, some are other desired dependencies, some are CPU features, some are pre-existing dependencies, some are (spoken) language-specific, some are external datasets or models, some are as-yet unimagined.

This is what I meant by “unless the interpretation of this static data becomes complicated enough”. If you want to be responsible for PEP-specifying every single exception, you’ll quickly get buried under the work (e.g. manylinux). Allowing an escape hatch like this helps people get their own stuff working without having to defer to something “official”.

More important is cross-index dependencies, so that I can put a stub package on PyPI that pulls my real one from another server and other dependencies from PyPI itself.

This would also help people use their own storage for large packages, as exceptions on PyPI aren’t being approved right now for some reason.

I don’t understand why you would need that. pip install not-a-stub other third --extra-index-url https://where.the.stub.would.point.to should work just fine, with no need for a stub on PyPI.

It doesn’t remove the need for users (and transitively users-of-those-users) to find your documentation to find your URL.

Plus I suspect people would prefer “package X from index Y” rather than “all my packages potentially from index Y”.

Just a remark that the possible use for detecting CPU architectures could be useful for the whole scientific Python community. I mean, some speedup could be optained by having more specialized wheels for the most popular packages of the scientific Python environment.

So the argument that 99% of packages do not need such mechanism may not be so relevant (even if it is surely true, maybe not if you count the packages depending on packages for which such mechanisms could be useful).

However, there may be simplest/better ways to do that.

The platform compatibility tag already considers CPU architecture: https://www.python.org/dev/peps/pep-0425/#platform-tag

And scientific Python projects already publish distributions for each of these architectures, e.g. https://pypi.org/project/numpy/#files

Or are you suggestions something more fine-grained?

Yes, I wasn’t clear. More fine-grained. I meant using advanced CPU instructions to get the perf that one gets when recompiling with -march=native but without recompiling. I know that for some extensions, it can really improve the performance (sometimes with drawbacks but…).

1 Like

I think the biggest challenge for any kind of fine-grained auto-detection of system attributes, is that you don’t actually know if the system you’re installing on is identical to the system you’ll eventually run on. E.g. just because you ran docker build on a system purchased in 2020 with support for all the latest SIMD instructions, doesn’t mean that you want the docker image to only work on CPUs manufactured in 2020.

1 Like

Agreed, but that’s so far out of the hands of the package developer that there’s nothing they can do about it, or that we (as pip/PyPI/whoever) can do about it. We don’t officially support relocatable environments, even within Docker containers, as far as I’m aware.

The one we can help with is that if the library developer built it with certain SIMD instructions, or with a choice of dependencies, they can release a wheel like that and users installing into an environment that has those dependencies can acquire it. Right now, all wheels have to be lowest-common-denominator (or “fat”) once you get beyond the platform tag, whether it’s CPU or MPI or something else.

This problem and these ideas are not new:
https://mail.python.org/pipermail/distutils-sig/2013-December/023238.html

Numpy is one of the most common dependencies on PyPI and the numpy project was reluctant for some time to distribute wheels on Windows because of the inability to ship variants based on the level of SSE provided by the host CPU. In the end numpy switched to providing binaries built with OpenBLAS instead of ATLAS meaning that there was no longer a need to have these variants. This is in some sense distributing a “fat” wheel except that the multiple variants are inside the vendored dependency.

There was nontrivial work (not done by me!) in making that possible including fixing bugs and improving threading support in openblas. There were other reasons for wanting to switch to openblas so it wasn’t done simply to put wheels on pypi. Putting wheels on pypi was not as obviously a high priority at the time because most scientific users were already used to the idea that pip didn’t work for scientific packages and used Canopy/Anaconda etc instead. The current success of the pip/pypi ecosystem for many communities owes a lot to numpy making this work: I wouldn’t use pip and related tools now if they were still unable to install numpy, scipy, matplotlib… all the other stuff that depends on numpy.

It still remains the case that many projects need to be able to provide variants for all sorts of things that can’t/shouldn’t be catered for by enumerating all possible environment features as static tags. The solution that numpy came up with was part-luck rather than a generally applicable approach for others to follow.

We do support them though, in the sense that it works fine today and tons of people’s workflows depend on that. The rule isn’t “we only care about things that are officially listed as supported”, it’s “we care about not breaking our users”.

I think the biggest incremental improvment packaging could do here would be to implement a workable Provides: mechanism, so that downstream packages could e.g. depend on a generic tensorflow, and then individual environments could fulfill that using tensorflow-generic, tensorflow-nvidia, tensorflow-avx512, etc.

1 Like

So, Provides-Dist exists already, but isn’t supported well by tools (that’s what the documentation says, and I can confirm this is true for pip, at least).

Is the only problem that no-one has stepped up to do the work to implement this? I suspect not, but if it is, and if your assessment of the value of such a feature is correct, then it sounds like a good candidate for a funded piece of packaging work.

2 Likes