Implementation variants: rehashing and refocusing

oscarbenjamin · June 6, 2024, 4:40pm

I see it like this:

NumPy uploads wheels to PyPI that bundle openblas and tags them as say numpy[openblas].
Most projects just require say numpy >= 2.0 i.e. they don’t care which kind of numpy they get.
A small number of projects (perhaps only scipy) want to reach into that internal BLAS library so they upload scipy[openblas] wheels that require numpy[openblas].
There are no alternate NumPy or SciPy wheels on PyPI but you can install numpy[mkl] from conda.
If you install numpy[mkl] and then pip install scipy then pip cannot simply install the SciPy wheels that require numpy[openblas] so it either needs to replace numpy or it needs to build SciPy from source.

jamestwebber · June 6, 2024, 4:57pm

This perspective clears things up a lot, I was primarily thinking about PyPI. I do think that any feature like this has to consider how it will interact with PyPI though–it seems possible for a mess of variants to be uploaded and that could lead to installation headaches for novice users.

One thing I’m unclear on^[1]–does the variant really only matter if you’re going to use BLAS internals, or is any compiled extension tied to a specific variant? Is it possible to build an extension that uses numpy internals but can be universally compatible with the variants? edit: and is doing this trivial, or does it require some careful design choices?

I think it’d be a better outcome if the installer just fails with an explanation, but isn’t that a significant change to behavior? If installers default to “only install wheels” this makes more sense (maybe that’s taken as a given, here).

likely due to my inexperience with writing C libraries ↩︎

oscarbenjamin · June 6, 2024, 5:33pm

Compiled extensions could use NumPy’s C API without knowing what BLAS library NumPy is using and would therefore be compatible with all variants. The issue is if they want to e.g. use the BLAS library directly i.e. bypassing both NumPy’s Python API and its C API.

An important case for this is NumPy and SciPy since they both want to have direct use of a BLAS library. Currently the wheels for both ship separate BLAS libraries so if you pip install numpy scipy then you have two separate BLAS libraries e.g. here in a venv I have:

$ ls -lh site-packages/*/libopenblas*
-rwxrwxr-x 1 oscar oscar 34M Oct 13  2023 site-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
-rwxrwxr-x 1 oscar oscar 32M Oct 13  2023 site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so

That’s two separate shared libraries each weighing in at over 30MB. At runtime both of these libraries will even be loaded separately into memory simultaneously.

If SciPy were to use the NumPy wheel’s libopenblas.so rather than shipping its own then it would mean that the PyPI wheels for SciPy would only be compatible with the PyPI wheels for NumPy. Currently there is no way to express this requirement in wheel metadata so they instead ship redundant BLAS libraries. Note that this is only done for PyPI wheels: in every other distro openblas would be a separate non-Python library and numpy and scipy would both depend on it.

steve.dower · June 6, 2024, 5:35pm

This is correct, and a pretty important procedural point to note.

We’re talking about a public specification here. A lot of this stuff can already be done “legitimately” with custom tooling or custom indexes.

But a public specification must include how PyPI works, and ultimately is largely going to focus on how PyPI works, because other indexes don’t require PEPs to allow divergent behaviours.

pf_moore · June 6, 2024, 5:48pm

I think this feature is highly focused on people who aren’t downloading wheels from PyPI. I don’t think anyone is forgetting that. I think what is being forgotten, though, is that those people are a tiny minority of Python’s packaging system.

What’s important to me is that we support the users and package developers who need build variants, without harming the workflow for people who do simply download wheels from PyPI. At the moment, pip install numpy scipy sympy matplotlib torch pyflint just works (on my Python 3.12, Windows machine). This is an amazing achievement and is what we should make sure we preserve. Yes, it would be nice if users got optimised builds for their environment. It would be awesome if they got those without having to do anything more than the pip install command I quoted above. But that’s secondary to not making the default experience worse than it currently is.

This is a very good point that hasn’t been explained properly yet. There’s been a lot of general discussion about “build variants”, but I don’t think anyone has gone step by step through a single worked example yet.

Taking mkl as an example, is it necessary for every binary to be linked with mkl? Or is it OK if numpy uses mkl but scipy doesn’t? How does the proposal communicate that requirement? If I say pip install numpy[mkl] scipy does that install the mkl build of numpy and the standard PyPI build of scipy? Or is it an error? Or does it force the installer to build scipy from source (because there’s no mkl build of scipy on PyPI)? Is it necessary for all binaries to link to the same version of MKL? How is that communicated? If we say the user should specify (which is basically what GPU variants seem to expect) how is the end user supposed to know which version to use? If they choose mkl1.0, but it turns out that not every library they need has an mkl1.0 variant, but they do all have mkl2.0 variants, is it really user error that they picked the wrong variant, or should the system have somehow “upgraded” to a MKL version compatible with all the requested libraries? If the MKL 2.0 build of mylib depends on foo, but the MKL 1.0 build doesn’t, and there’s no binary of foo for my platform, should the system downgrade to using MKL 1.0? Or are we again making it all the user’s problem? Has anyone confirmed that the users (as opposed to the package maintainers) are happy with having to make all these choices?

This is the level of detail I mean when I suggest that we need fully worked through examples. I genuinely don’t know the answers to any of these questions, and because of that I can’t even comment on whether this is something that’s easy, hard, or even impossible for pip to do.

steve.dower · June 6, 2024, 6:03pm

This triggers another idea^[1] - how would reverse constraints work?

What I mean by that is, could we have two variants of a scipy package, where one specifies that it reverse-requires mkl, and if that package is not available (or not a candidate for install already) then the variant of scipy is excluded?

So then pip install mkl numpy scipy could get MKL-specific variants of both numpy and scipy (if they exist), while pip install numpy scipy would not (unless it was already installed).

This extends somewhat to having fake packages that represent CPU/GPU info, if real ones aren’t available (e.g. a SIMD feature package is probably fake, while a CUDA package is potentially real), but ought to behave properly in both cases. (Users have to manually install the feature package, but that’s no worse than today, and there are ways to pre-install things or there could be ways to otherwise inject them into the resolution process.)

Because it’s purely about dropping candidates from the resolve process, it should degrade into “no option available for package ‘foo’” rather than massively complicating the resolution by adding new candidates. If something directly depends on mkl then it may be updated, as normal, but typical reverse-requires would only check that it’s going to be in the final environment.

We still need some way of providing more variants of a wheel that match, but that ought to just be naming (and ordering) convention, right? Resolving which should be installed is then taking the “top-most” variant that isn’t excluded by way of a reverse requirement.

</brainstorm>

Apologies for distractions, but I don’t think we’re committed enough to a particular approach that I’m being too disruptive yet. ↩︎

oscarbenjamin · June 6, 2024, 6:31pm

There will be no numpy[mkl] wheel on PyPI. The only way that most users would ever have numpy[mkl] is if they got it from conda. Someone who just does pip install numpy scipy without using conda will get a consistent set of wheels that uses openblas.

Other distributions that depend on numpy and scipy could ship wheels that would work equally well with numpy[mkl] or numpy[openblas] simply by using numpy’s normal API. They would have no need to require a particular variant and would just require numpy >= 2.0 or something.

The only place where a particular variant would be required is that the SciPy wheel would require numpy[openblas] or perhaps numpy[openblas_pypi]. Basically this works like a flag that says “we built these numpy and scipy binaries for PyPI together and they should be used together”.

If you pip install numpy[mkl] then there would be no wheel. If pip were to attempt to build from sdist then it would only succeed if the user had already installed MKL. This is already the case now: if you try to build numpy from sdist without having a BLAS library the build will fail.

If you used conda to install numpy[mkl] and then asked pip to install scipy then pip would either need to replace numpy[mkl] with numpy[openblas] from PyPI, build SciPy from source or exit with an error. Ideally an error in this case because the package was installed by conda:

scipy requires numpy[openblas] and you have numpy[mkl] which is incompatible. We are not going to replace numpy[mkl] which was installed by conda.

NumPy and SciPy would have to coordinate closely if sharing a BLAS library like this (openblas not MKL) in the PyPI wheels. There would be some constraints about how they do that but it is basically just something that those projects would manage. Basically they are going to put a consistent set of wheels on PyPI but they don’t want you to mix their wheels with other builds of numpy and scipy: the SciPy wheel that is on PyPI requires the NumPy wheel that is on PyPI and not some other random build of NumPy.

Final note here is that when I refer to NumPy and SciPy sharing a BLAS library I am referring to a hypothetical future situation: currently they do not share a BLAS library because there is no way to express the fact that a particular build of SciPy requires a particular build of NumPy.

There would be no problem in the numpy/scipy case if we could assume with 100% certainty that all binaries were wheels from PyPI. The fact that pip/PyPI gets mixed up with binaries from other places is the problem. We still need a way to distinguish different builds even if there is only ever one variant on PyPI.

jamestwebber · June 6, 2024, 6:51pm

How is this guaranteed? numpy/scipy is probably not the best example here, because they’re certainly going to do the right thing to maintain compatibility. But in the general case, it seems like either a) projects will be able to upload any variant wheels they want, and so these compatibility issues can start to arise, or b) someone has to choose, for any set of incompatible variants, which one can be uploaded to PyPI?

While it’s true that we’re talking about a very small number of packages here, that sort of gets back to the “is the complexity even worth it” question. And in the long run I’d hope a solution encourages more packages to stuff like this, so it should be able to scale.

oscarbenjamin · June 6, 2024, 7:16pm

In this particular case the issue is MKL’s license. It is not open source and there are some restrictions around it. I’m not sure why this is different for conda but apparently they can distribute MKL but numpy et al do not want to do that on PyPI.

I’m describing the particular details of the case because it could make use of build variants in order to do something useful which is to remove the duplicate BLAS library. This is a simple example that you can also extrapolate across many other cases where we don’t want to duplicate shared libraries but the fix requires some notion that a build of one distribution requires a particular build of another.

No compatibility issues arise from the presence of multiple variants. The potential difficulty would be if many projects start requiring incompatible variants of other projects. There is usually no need to require particular variants at all though.

Consider python-flint uploading variants for CPU SIMD features. A user could choose which to install like pip install python_flint[x86_v4]. There is never a reason that any downstream package that depends on python_flint actually needs to require a particular variant because the different variants don’t affect compatibility. Ideally in the future it becomes possible to select the best variant automatically and then no one ever needs to specify a particular variant explicitly.

pf_moore · June 6, 2024, 8:33pm

That in itself is a major complication, then. Pip’s resolution process only considers packages named in the install command, and recursively dependencies of those packages. It does not consider packages already installed in the environment that don’t appear in that list.

I’m not 100% sure that impacts your comment, as what gets installed is numpy, not numpy[mkl]. There’s currently nothing in the installed package metadata that indicates that it’s the [mkl] variant, and I haven’t seen any concrete proposal for how something like that would be added. So if the user did pip install numpy[mkl] followed by pip install scipy, the second install would see numpy installed and so not bother installing it. But it would have no way of knowing that it should install the mkl variant of scipy, or even that there is a mkl variant of scipy to prefer. Given that you’re claiming that mkl wheels won’t be on PyPI, “I need to install a mkl variant of X” means “I need to ignore any wheels that do exist, and build a MKL variant from source”. And to do that you need (1) a standardised way to request a MKL variant build from an arbitrary source distribution, and (2) a way to know before deciding to discard all the wheels that the sdist supports a “MKL” build. You need (2) because if numpy[mkl] is installed, and the user does pip install requests scipy, you need to know somehow that scipy needs an MKL build but requests doesn’t.

I’ll pass on discussing how conda-installed packages integrate into this. Either they look the same as packages installed via pip install, or they are using non-standard metadata, and that excludes them from this discussion (which is about what the standards can offer).

If you want more background on pip’s behaviour with already-installed packages, Clarify and define install/upgrade behaviour for the new resolver · Issue #8115 · pypa/pip · GitHub gives a lot of information. And if you want a discussion of one particular case that has no ideal solution, there’s Warn users about dependency conflicts when updating other packages · Issue #7744 · pypa/pip · GitHub.

pf_moore · June 6, 2024, 8:44pm

OK, this confuses me further. We now have two examples of “variant” under discussion. CPU SIMD features can be mixed freely, with no detrimental effects. But MKL/BLAS has to be carefully matched in order to work. How can I tell if a given variant type is mixable or not?

I guess that if, for a moment, I forget the idea of “variant” and think in terms of separate projects (so there’s python_flint_x86v4, numpy_mkl, scipy_mkl etc.) then this is just dependency resolution. So I guess what I’m saying is:

What exactly is different when you have two “variants”, numpy and numpy(mkl) as compared to having two projects numpy and numpy_mkl?
How do you know that the differences aren’t going to impact the resolution process?

The reason I ask the second question is because as far as I understand the “variant” concept, it does critically impact the resolution process (and I’ve said as much), so I’m very interested in why you think I’m wrong…

steve.dower · June 6, 2024, 8:59pm

Basically, Anaconda (and I assume conda-forge) are not concerned about their users linking their numpy builds into a copyleft-licensed application, which would be incompatible with MKL’s license. There are no restrictions on redistributing MKL as a dependency of numpy itself, but the numpy project were not happy with their PyPI builds not being further redistributable under any license. Hence, using only licenses that are copyleft-compatible.

jamestwebber · June 6, 2024, 9:03pm

Is this even the case? Right now if you mix MKL numpy with openblas scipy they should both work, it’s just a bit odd^[1]. The proposal is actually introducing the possibility that this wouldn’t work, because scipy would be relying on the library that numpy installed.

I think? ↩︎

jamestwebber · June 6, 2024, 9:13pm

Out of curiosity I decided to test this in a fresh conda env…but conda installed numpy[openblas]

oscarbenjamin · June 7, 2024, 12:54am

This is why I mentioned PEP 376 above. The build variant that is installed needs to be recorded somehow even when it is not pip that installs the build (conda, apt, …).

The situation that I am describing where there is a mix of build variants installed from different sources already exists right now. There just is no metadata that describes it and no way of making it work when publishing binaries to PyPI without isolating every binary by duplicating all the shared libraries.

No one should ever do pip install numpy[mkl] unless they are using their own index where those wheels exist. Likewise no projects should require numpy[mkl] unless they only exist in an index where those wheels exist. Practically no projects on PyPI should require numpy[openblas] (Rather than just numpy) either even though those would be the wheels in PyPI. The set of projects that require any particular variant of numpy would be very small.

The situation where someone has numpy[mkl] and is trying to install scipy from PyPI is a situation where the user is doing things incorrectly and should really be given an error message. They are mixing conda and pip in a way that cannot be assumed to work and they need to be told that. Right now it isn’t possible for pip to give that error message because it cannot distinguish different builds of numpy: it sees them all as being equivalent and interchangeable. This isn’t just a problem for builds from conda but also the builds that pip itself produces as well: they are not built the same way as PyPI wheels and are not interchangeable with them.

There should be a standardised way to request a particular build variant along the lines that I described above. Basically user does pip install .[mkl] then the mkl part is passed through to the build backend. This will still always fail though unless the user has already installed MKL because the build backend is not going to install it. Likewise pip install .[openblas] will fail if openblas has not previously been installed. In principle it could be possible to extend the backend to download and build openblas. It would not be straight-forward to just download and build MKL in the same way because it is not open source. So, yes there should be a standardised way to request the build but this is still not something that is likely to succeed for a typical pip user if the project has external dependencies.

jamestwebber · June 7, 2024, 1:49pm

I think this is where the numpy/scipy example is limiting the discussion a little–it’s very well-defined and that fact restricts the potential issues.

It’s not clear to me how an arbitrary use of variants will be resolved, and what that experience is like for the user.

dhellmann · June 7, 2024, 1:52pm

Oscar Benjamin:

There should be a standardised way to request a particular build variant along the lines that I described above. Basically user does pip install .[mkl] then the mkl part is passed through to the build backend. This will still always fail though unless the user has already installed MKL because the build backend is not going to install it. Likewise pip install .[openblas] will fail if openblas has not previously been installed. In principle it could be possible to extend the backend to download and build openblas. It would not be straight-forward to just download and build MKL in the same way because it is not open source. So, yes there should be a standardised way to request the build but this is still not something that is likely to succeed for a typical pip user if the project has external dependencies.

This issue is why I’d focused in the earlier thread on ways to automate the variant detection, and why I don’t think we want to use the extras syntax to express the variant names. We don’t want package A depending on a specific variant of package B. We want the best combination of packages selected based on compatible variants.

For the NumPy/SciPy situation, the SciPy authors could produce a selector package that probes the installed NumPy, if any, to determine what variant should be selected. That way if users start mixing and matching, there’s some protection against installing incompatible components.

dhellmann · June 7, 2024, 1:55pm

I’m not sure this is a requirement. Part of the goal here is to make it so that people publishing variants can do so in a way that more users can consume them more easily. If there’s no valid variant for a package, and there’s no fallback to a “generic” variant, maybe it’s OK to just fail to install? Perhaps that means it’s not possible to build the package for the requested variant at all, for example.

dhellmann · June 7, 2024, 1:57pm

Given the cross-package compatibility concerns, I would express this differently. We want a user to be able to express via some lock file syntax that they want a given variant. That variant should apply to all of the packages in the list, though. We don’t want it to be possible for a lock file to specify incompatible variants for different packages.

dhellmann · June 7, 2024, 2:00pm

Do we want old versions of pip to work (selecting someting to install) or do we want them to fail gracefully with a useful message?

If the wheel filename format gets an extra field, is an old pip going to decide that newer wheels aren’t valid and just ignore them?