Implementation variants: rehashing and refocusing

I agree those are distinct. I don’t think we want different solutions, but we do need to consider the cases separately.

There are going to be some axes of variation for which automation can help make the right choice. For now that’s the CPU & GPU cases as you say, and there may be others like linux-distro-specific builds.

There will be other axes of variation that require the user to be explicit about their intent. I consider it an acceptable limitation that the user has to express that intent, at least for some cases.

Choosing BLAS1 vs BLAS2 is an example of that where I think the user would either want some default selected for them (because they don’t care) or would care and would want to be explicit somehow. In the default case, it should be up to the package publisher to express that, as @ncoghlan described. The installer should just honor the default.

It’s mostly orthogonal. The key point is that an installer won’t try to somehow surface the currently variant matrix to an sdist build if it can’t find a matching variant wheel file.

I’d also like to find a way to allow packagers of difficult-to-build packages to be able to upload “non-building” sdists, in order to solve the “two mix-ups of purposes” for sdists. This is orthogonal to the discussion in question though.

Yes, that’s fair. If there is an sdist, and the installer decides it needs to build it, then that build won’t be told about any variant information the installer has determined, regardless of the source. I suppose the user could still set environment variables to control that build process, but the installer wouldn’t do anything like that automatically.

I see describing how to build variants as being in the purview of the build tools, and we’ve already said we’d ignore that for the time being.

1 Like

I think backtracking isn’t an option, but confirmation would be helpful.

I think you’re exactly right that in the case Paul outlines, the installer would just have to fail, but it could at least keep track of enough information to give a helpful error message. Something to the effect of “I couldn’t find a consistent set of variants to install from the set of BLAS1 and BLAS2. Please make your choice explicit by editing the pyvariants.toml file in your virtual environment.”

In the grand scheme of everything else being discussed here it would really not be difficult to extend the build backend interface with a hook that allows specifying which variant to build. I see that as something that is not necessary to discuss in detail at this point mainly because it would not be a difficult problem to solve if/when other aspects about build variants are agreed.

1 Like

This is definitely a concern in the sense that pip needs to have an answer for handling this situation when it arises. I think it is important to note though that most projects are not interested in shipping multiple wheels for different BLAS libraries like this on PyPI.

I don’t think that you will see NumPy and SciPy shipping multiple variants with different BLAS libraries for example. They just want to ship compatible wheels on PyPI so people can pip install numpy scipy. If people have more specific needs then they can go build it themselves or get their binaries from another distributor.

The complication comes from the fact that other sources (conda etc) may install variants that are not compatible with the PyPI wheels. The advantage of having variants in this situation is just so that you have the metadata to distinguish the fact that they are incompatible. If the metadata existed to express this then there would not actually be any SciPy wheel on PyPI that claimed compatibility with the conda NumPy package though so pip would have no hope of finding one.

What most of these projects would really like is just for e.g. BLAS to be a separate library that they could depend on rather than vendoring it. Then they could all share a BLAS library and they would probably all converge on a particular one (e.g. openblas) built in a particular way. There is a strong incentive for projects to converge on a consistent set of binaries for PyPI: building and bundling all these dependencies is a huge amount of work for them!

It would still not be possible to guarantee that those PyPI binaries would be compatible with binaries from elsewhere though.

2 Likes

For backtracking to be an option, we’d need variants to be a first class part of the data the resolver works on. Which isn’t impossible (extras are a bit like this) but it’s complex and not something I’d recommend. And it might not be possible to map the logic onto standard resolution algorithms (that’s one reasons extras are so messy).

Let’s just say “backtracking isn’t an option” :slightly_smiling_face:

Pip has a lot of trouble giving informative errors even with the current resolution algorithm. I agree that helpful error messages would be important here, but that may be difficult to achieve in practice.

Okay! Let’s go with that. :smiley:

We might be able to do better with variants. My thinking about “dynamic” (a.k.a. “narrowing”) variants above would require some capture of information in a JSON file, e.g. so that subsequent installer runs would pick a compatible set of ABI variants. Conceivable, that might provide enough to give some useful error messages too.

1 Like

Absolutely. One of the big concerns I have here is the risk that we end up over-engineering a solution to handle edge cases that we would be much better just declaring out of scope. But we’ve traditionally not had much success doing that - the simple message “don’t mix package managers” addresses an awful lot of the complexity, but people still insist on using pip to install stuff into conda environments, or into their distribution Python (we prevent it in physical systems these days, but people still do it in docker…)

I’m very much playing devil’s advocate in these discussions. For many of the scenarios I’m presenting, I’d be more than happy if people were to say “we won’t support that, and we can prohibit it by the following means”.

For example, why don’t we just publish a single, blessed BLAS implementation on PyPI, and state that (1) any package published on PyPI that uses BLAS must depend on that specific implementation, and (2) users are not allowed to install BLAS-using packages from multiple indexes (or “package sources”) into the same environment? The answer is, as far as I know, simply “because we’d never get away with doing that”.

1 Like

That is what happens in other distros although there you also get virtual packages and alternatives etc so I think we would still need some version of variants somehow. I don’t see enforcing a single BLAS library being achievable in PyPI-land because of the general lack of overall authority as you say.

I think that what would work though is providing the mechanisms that enable projects to do this so that they can converge on making something that works organically. Currently it just isn’t really possible to share binaries of underlying native libraries across different PyPI wheels.

Take a very limited scope and consider python-flint. Its main dependency is Flint which is a C library. If we could I would split it into two packages where one (say libflint) contains the C library and the other (python-flint) has all the actual code i.e. the Cython wrappers and the public interface for users. It would be useful for python-flint to be able to do this even if no other project ever depended on libflint:

  • From python-flint’s perspective the Flint library becomes an installable binary.
  • Making a dev setup for python-flint would be much easier.
  • Most python-flint contributors would never need to build Flint themselves.
  • Building python-flint would be about 10x faster.
  • CI would take about 5 minutes rather than 45 minutes.
  • python-flint wheels would be about 20x smaller.
  • The libflint wheels would still be large but would not be specific to the Python version so there would fewer of them and they would only need new releases for new Flint releases so less total storage in PyPI.
  • If we did have CPU variant builds then it would only need to be for libflint and not for python-flint.
  • Smaller/fewer wheels means less total space on PyPI so we can have more regular releases and support more platforms.

We (as in python-flint) would be in control of both packages so we could do what we need to keep them compatible with an exact version pin. (Exact pins are part of the problem in this approach, particularly for BLAS.)

Even though the benefits are clear we don’t do it. It takes us into the realm of packaging non-Python libraries as Python distributions which has always been something that was discouraged (or disallowed on PyPI?). Note also that doing this would necessarily translate into wheels having different dependency constraints from sdists (cf the consistent metadata thread).

It is also something for which the basic machinery isn’t there: how do I load libflint.so at runtime if it comes from a different package? How do I build against dynamic libraries that can only be found via sys.path and how do I find the headers? How do I prevent users from ending up with incompatible binaries (e.g. if they build libflint themselves)?

Unbundling in the BLAS case is more complicated. More libraries want BLAS and I don’t know how the ABI constraints would translate into version constraints among the wheels for the different projects. There would be a strong incentive to solve this but I don’t know enough to say exactly how it could pan out. The risk is lots of projects making tight dependency constraints on that BLAS library and then resolves becoming impossible.

+1 Huzzah for agreeing on limits to complexity!

1 Like

I’ve been thinking about the fact that some of the rules for selecting variants will need to be in a separate metadata file, outside of the distributions, as @oscarbenjamin originally proposed. Or at least that doing that may make some parts of the process easier. It also provides an opportunity for package publishers to provide help text of some sort that pip could emit if the selection process fails. That message could be “We, the authors, only publish packages built against BLAS1 on pypi.org, please refer to https://helpful-website-about-our-package.org/install.html for details about how to install for BLAS2.”, for example.

Even if you did this, it would only work until that one package had an update that broke compatibility. At that point, the downstream projects would need to differentiate which version they wanted, which violates the intent of having just the one option. When you have a shared resource like this, everything that uses it must be updated together, such that it is possible to have an aligned set. Conda-forge calls these “migrations,” and without centralized tracking, I can’t imagine coordinating any non-trivial number of packages.

Essentially, you’d never get away with it because people want options, but you also get bitten by the whole time dimension.

2 Likes

It seems like if that problem is going to be solved at all, the best resolution will come through the owners of a set of related packages agreeing on what they will do as a community, rather than having any sort of rules imposed from the outside.

1 Like

Yeah, that’s conda-forge in a nutshell. My point was that you can’t get away with simplification through limiting options. Compatibility breaks in one library are themselves options, and without resorting to something like epochs to achieve separation in time (like distro release cycles), I think we have to have metadata that accounts for differences, and we have to be able to utilize that metadata to achieve aligned sets.

1 Like

I agree with this, but it leads straight back to that need to depend on specific variants rather than only depending on default variants, or always independently selecting variants based on target environment characteristics.

I don’t come to the same conclusion.

There are 2 cases, I think? (A) installing everything at one time and (B) installing one package, then some time later installing another package.

For (A) the variant parameters are somehow specified when the installer runs (manually, auto-detected, whatever). All packages have the same parameters applied when selecting a file, and therefore should be compatible and direct dependencies on variants aren’t needed.

For (B), pip already doesn’t account for what’s installed in a virtualenv, so expressing a direct dependency from something that is already installed to something else being installed separately doesn’t help. Barry’s idea of recording the variant parameters so they can be reused would, but that’s just another source of those parameters in subsequent calls, it doesn’t require direct dependencies on variants.

That’s not entirely true. Pip certainly makes sure when resolving that the dependencies of installed packages are still respected. But it only does that for packages it’s aware of, and that means ones that are discovered as part of the dependency tree.

So for example, if 2 independent packages A and B are installed, and C depends on A, then pip install C will account for the dependencies of the installed copy of A. It won’t consider B, because the presence or absence of B won’t affect the resolution. With variants, though, the existence of B (variant BLAS1) does affect the resolution, as BLAS2 variants must no longer be considered valid options. That’s a significant change.

Whether the above is relevant, though, is something I’m not sure about. You’re talking about “expressing a direct dependency from something that is already installed to something else being installed separately” - but if that dependency exists, then surely the second thing must already be installed as well (because otherwise the environment is broken, and all bets are off)? Can you give a more concrete example of what you mean here?

Sorry, I said it backwards. Your example with A, B, and C is the scenario I was thinking of.

In that case, C depends on A but because I don’t think we want to allow C to depend on a specific variant of A, choosing the right variant of C to install becomes harder unless we save the variant parameters for a given virtualenv or the user specifies the same information a second time.

I don’t want dependencies on variants because it introduces more complexity in the common case. I’m willing to accept more complexity in the unusual case we’re talking about here.

This might be widely understood in this thread, but I want to emphasize that this would be new to the python ecosystem.

You can install A[BLAS1] and B[BLAS2] right now, you just need to also install BLAS1 and BLAS2, which is a bit of a waste. numpy and scipy are typically installed using the same BLAS but the versions of those on PyPI have vendored their BLAS libraries separately. So in practice you use the same one, but you have two copies.

The requirement that “you can only have one BLAS in an environment” would be a new rule for installers to deal with. I’m curious if anyone in the conda ecosystem knows of cases where that would be undesirable. Like if there’s only one variant available for a given package, you might want to install it even though it brings in a second BLAS library. Is that going to be broken now?

1 Like