Selecting variant wheels according to a semi-static specification

msarahan · May 30, 2024, 2:20pm

What’s the UI for this on PyPI and other repos? Right now, each of these would have their own project page. That has been a big headache for the CUDA packages, but the environment marker idea you have here would be an improvement. It would still be nice to classify the variants as children of the main project page, though, and not as independent project entries (or folders, in the simple API).

steve.dower · May 30, 2024, 2:37pm

UI on PyPI doesn’t need to be in the PEP, or in any PEP (unless they prefer it to be).

It doesn’t seem difficult to have a long description that says “Don’t use this package, use <href=“this one”>”. Users will figure that out, and it gives the projects far more flexibility than restricting their UI options by hiding things.

That said, I would love to see PyPI be searchable/groupable by the name of the top level module(s) that will be installed. Under such a system, I’d expect this model to get grouped together, along with any other alternative implementations of the same top level module.

oscarbenjamin · May 30, 2024, 3:33pm

It’s not just about UI on PyPI. Having multiple distribution names for what are really just variant builds of the same distribution is awkward in many ways and does not really reflect what is actually happening.

It’s awkward for project authors to create and upload the variant wheels with different distribution names. It’s confusing for people looking at PyPI or other metadata to figure out what is going on and where the real packages are and which are equivalent. In reality you can only have one variant of each real distribution installed but now some installers might end up attempting to installing conflict ones like python_flint_x86_v2 and python_flint_x86_v3.

It’s awkward to end up needing more names or even an open-ended list of names on PyPI and to maintain secure control of them and to check each of them to see that they have the right files and release versions. If you have 50 cu11 distributions and 50 cu12 distributions then what are you going to have to do when cu13 comes out? Then other indexes that host subsets of PyPI need to add the 50 new names so it’s awkward for indexes and other things as well. You also have a big name squatting surface because someone can come and register cudf-cu13 etc on PyPI.

It also messes up the install from sdist model because if you build from sdist you are going to get one of the build variants but none of the tooling will understand that you have e.g. the cu12 variant. Then after installing bar from sdist you might find that pip install foo installs foo_cu11 or maybe foo_cu11 requires bar_cu11 and then pip tries to install that as well.

Realistically the different build variants are variant builds of a particular distribution corresponding to a particular sdist and an installation can only contain one variant of that distribution. It would be better if we can have the tools understand that fact on a basic level rather than trying to workaround the limitations of the current model with dummy distribution names.

In fact not understanding build variants is already problematic because we implicitly already have build variants e.g. non-portable vs portable PyPI wheels etc. Currently I can build e.g. numpy from source and then pip install scipy from a PyPI wheel and those are not necessarily going to be compatible. Likewise I can conda install numpy and then pip install scipy and that might also be incompatible. A similar situation existed in the past with the Christoph Gohlke’s wheels vs PyPI vs conda. There are already many build variants but we currently have no way to distinguish them and tools like pip just assume that they are all equivalent. If we can make it so that build variants are distinguished then there are other problems that could be solved at the same time.

pf_moore · May 30, 2024, 3:53pm

I agree with this. What happens if the user installs both python_flint_x86_64_v3-0.6.0-cp312-cp312-win_amd64.whl and python_flint_x86_64_v4-0.6.0-cp312-cp312-win_amd64.whl? The two are different projects, so there’s nothing in the current ecosystem that would stop that.

In principle, I like the idea of the selectors just being custom markers, but we need to work within current semantics.

I haven’t thought this idea through (and I don’t have the time to right now) but could we use wheel build numbers to handle this?

python_flint-0.6.0-1.x86_64_v4-cp312-cp312-win_amd64.whl
python_flint-0.6.0-1.x86_64_v3-cp312-cp312-win_amd64.whl

These are both wheels for python_flint version 0.6.0. They sort based on (<integer prefix of build number>, <string rest of build number>). This allows the project to control priority. That just leaves the selection question. Maybe we could add a new metadata item, similar to Python-Requires in that it gets exposed in the simple API (we might even be able to simply add a marker expression to Python-Requires itself). Then, the installer evaluates the marker for each wheel and rejects any where the marker doesn’t return True.

The user still needs to install something that exposes the additional custom markers. Evaluating a marker expression with an unknown marker will fail, and the user needs to ensure the appropriate marker plugins are installed. Making that user friendly is “simply” a matter of good UI, though, it’s not a standardisation problem.

msarahan · May 31, 2024, 11:24pm

Would the build number idea allow multiple variants to be passed as candidates from the finder to the resolver for a given package version? I think not, because the build number isn’t part of the version. If so, though, that sounds promising. Otherwise, I think we need some kind of variant “alignment phase” prior to the finder, so that we ensure that variants throughout the dependency tree are consistent. This is the kind of idea that I described in Selecting variant wheels according to a semi-static specification - #99 by msarahan, but I’m fearful that this may classify as “rewrite the installer resolution algorithms” and thus a non-starter for this effort.

pf_moore · June 1, 2024, 10:06am

One thing that occurred to me here is that we haven’t yet talked much about how this would affect lockfiles - and specifically the proposal at Lock files, again (but this time w/ sdists!). I would be very cautious about approving any proposal that didn’t work with lockfiles - regardless of where we are on standardisation, they are an important feature used by a large part of the community.

In particular, if we had a package with multiple wheels depending on, say, CPU instruction set, how would we handle creating a lockfile? Would the lockfile only contain one variant (and if so, how would it specify the CPU instruction set(s) it was valid for), or would it contain all variants (in which case how would tools like audit scanners and lockfile installers that don’t have a full resolver know which wheel would actually get installed)?

The lockfile discussion is still at pre-PEP stage, so there’s no “official lockfile spec” to consider here, but the linked thread contains a lot of questions like this which existing lockfile solutions are having to consider, and which any spec will need to provide answers for.

oscarbenjamin · June 1, 2024, 10:57am

It hasn’t been discussed much in this thread but it is mentioned in the OP:

The OP design was explicitly intended to make a compromise between wanting dynamic wheel selection and wanting static resolution and lock files. That is why the title of this thread mentions “semi-static specification” and why the wheel selector file has an explicit wheel_tags table:

It would be good to hear more thoughts from people who are interested in making lockfiles though because it does not seem like that side has been well represented in this discussion.

pf_moore · June 1, 2024, 11:43am

Agreed. I’m not an expert here, but I know there are two distinct groups, and any lockfile-related proposal needs to address both of them to some extent:

“Portable” lockfiles, that can be used on multiple systems (with similar-but-not-identical profiles) and will allow installers to pick the right wheel to install. This generally requires at least some level of intelligence in the lockfile consumer.
“Reproducible” lockfiles, that specify exactly what will be installed, and what systems it will work on. These require minimal complexity in consumers - a common example of a non-installer consumer here is auditing, which won’t involve a resolver and may even be a manual process (someone reviewing the lockfile by hand).

msarahan · June 3, 2024, 5:41pm

I think this thread has gotten long enough to be unwieldy. I tried to summarize and lay out a plan going forward at Implementation variants: rehashing and refocusing - feel free to ignore that thread if it is not a productive contribution to this topic.

barry · June 17, 2024, 5:43pm

Well, on my macOS machine, that’s already the case:

>>> sys.version
'3.12.4 (main, Jun  6 2024, 18:26:44) [Clang 15.0.0 (clang-1500.3.9.4)]'
>>> len(list(packaging.tags.sys_tags()))
2769

msarahan · August 5, 2024, 12:35am

I wanted to put a proposal out here and solicit some advice. This proposal is mostly based on the top post, which is why I’m posting in this thread.

There’s 3 parts to this proposal, which combine other proposals we’ve seen in this and other threads.

Packages:

Contain a record of variants (metadata provider programs, keys and values) that were used for producing the packages. These are namespaced, like metadata-provider:variable=value
Hash the record of variants, and use this hash in the build tag for differentiating variants

Example filename: myproject-1.2.3-h8d84c3-py3-abi3-linux_x86_64

On the hosting side

I propose a file, variants.json that the repo/index software is responsible for creating. When a wheel file is uploaded, the server uses the record in the package to create or update the variants.json file. That file looks like:

{
  "8d84c3": {
    "provider_abc:variable1":123,
    "provider_xyz:variable2":123,
  },
  "b5670a":{
    "provider_xyz:variable2":123
  }
}

The hash keys there are just the first 6 (arbitrary length) characters of the sha256 hash of the dict values as strings.

This file does not list any individual files. Instead, it is a map of available variants, and a way to associate meaning to the hashes (by associating the hash key to human-readable info about the input content).

On the client’s system

There will be a file that specifies the state of variants that tools should find/install. Right now, I think it lives at the environment level, in line with @barry’s thoughts. That file might look something like:

[providers.provider_abc]
version = "1.0.0"

[providers.provider_abc.variables.variable1]
description = "This is a description"
values = ["123", "456"]

[providers.provider_xyz]
version = "1.0.0"

[providers.provider_xyz.variables.variable2]
description = "This is a description"
values = ["primary", "secondary"]

and this file could be generated manually, or with some combination of hardware detection programs (the standalone executables that have been mentioned might be relevant here). The values here do not have to come from the set of values that are available remotely, but of course if they don’t, then that variant is considered unavailable.

This file would be “compiled” locally into a cached collection of hashes, exhaustively hashing the combinatoric space of all combinations of all variables. The ordering in this file would prioritize:

Combinations with more variables (more specific variants)
Position of the provider/variable entry in the variants.toml file
Sorting based on order in the list of values provided in the variants.toml file

When a tool wants to look up matching variants, it:

fetches package/variants.json
loads the local collection of pre-computed hashes
does a set intersection of the remotely available variants with the locally available hashes
Sorts the result according to the order in the locally available hashes
Retrieves the file listing, which is named according to the hash - package/files-8d84c3 or similar. This files listing file would be identical to the existing PEP 503 and/or PEP 691 standards, but would only show files for that variant. This is similar to filtering by filename using the build string, if that makes more sense to have a flat hierarchy, but then the variant files may confuse “normal” package resolution.

What I’m not clear on is:

Is it feasible to add these new endpoints (one for variants.json, another for the per-variant files listing)?
If the variant matches, but there ends up being no packages that match the user system’s platform tags, what recourse is there to find either other variants, or fall back to the no-variant listing?

I’m playing with an implementation that I’ll post soon. I don’t know if the repo stuff (extra files/endpoints) I’m doing is going to be viable, but hopefully it will at least be good food for thought.

brettcannon · August 7, 2024, 9:56pm

So variant wheels can’t have build tags w/ this proposal, correct? Basically how do you tell that apart from an actual build tag that used a hash of something? If we aren’t thinking Wheel 2, then maybe a prefix for the tag and a way to separate the variant hash from the build tag?

Exposing a file via the Simple API has been done before, so I don’t see any reason that’s problematic for serving. The creation of the file on the index side is somewhat new, but if you view it as just repackaging metadata then it’s just a part of the index data itself and “normal” for an index to do.

I’m not sure what you mean by “per-variant files listing” as you already pointed out variants.json and your proposal has what a file supports embedded in the wheel file name which is nothing different than what indexes already do to list wheel files.

I would say the latter if such a wheel exists. Otherwise it’s back to the sdist (i.e. I don’t see it as any different than when no supported wheel is found).

msarahan · August 7, 2024, 10:51pm

Build tags aren’t ruled out here, because the build tag field’s primary purpose in this proposal is avoiding filename collisions. The build tag could have other stuff in it, and actually could be anything at all. The hash is a way of encoding arbitrarily long numbers of key/value pairs, but plays no role in file finding.

I’ve been working with warehouse, and I think this is pretty straightforward. It might be a new table to store variants, some relationships between variants and files, and some jinja2 templates for any new pages.

My work-in-progress is at GitHub - wheel-next/warehouse at variant-metadata.

By per-variant file listing, I mean that the index would not show files with variants by default. This is both for simplicity (avoiding any changes to the solver), and to maintain behavior for installers that don’t support variant metadata.

For example, let’s look at 3 endpoints that I think would be involved:

/simple/project/ - this stays the same. Any files with variants associated do not show up here. In pseudo-query terms, “list project files where variant is null”

/simple/project/variants/ - this is new. It is a list of all of the variants that any files use. In query terms, “unique set of variant values for all files in this project”. It looks like a bunch of hashes.

/simple/project/variants/<some_possibly_shortened_hash_value> - this is new, but it’s the same template as /simple/project. The query is “list project files where variant equals provided value”

This is why the build tag doesn’t matter. The variant value comes from the package’s variant metadata file, not the filename. The build tag can be arbitrary, as long as the variant metadata file does its job.

I have not gotten too deep into pip or other installers yet. I believe I’ll run into the --extra-index-url priority issue (also), because my design effectively treats each variant as a different index url, and I’d definitely need priority for considering more preferred variants first, as well as preferring variants to the standard non-variant builds. I’m willing to put some work into a feature PR for this behavior for pip. It doesn’t seem like something for a PEP, because it is installer behavior. Is that accurate?

brettcannon · August 7, 2024, 11:20pm

OK, but you said …

… and so I don’t know how to choose which file to install. But I think you answered in later in your reply.

How is this different from just gathering the hashes from variants.json? And if they are the same hashes, why both this endpoint and variants.json?

Ah, I didn’t pick up on this from your other post. OK, so to determine what could be installed:

Take the exhaustive set of hashed variant combinations that the environment supports
Get the set of variant hashes from ``/simple/project/variants/`
See what’s in the intersection of those sets
Look at each /simple/project/variants/<some_possibly_shortened_hash_value> in the priority order specified for the environment
If one of /simple/project/variants/<some_possibly_shortened_hash_value> has a wheel that can be installed then you’re done
If nothing is found then fall back to /simple/project/

Is that correct? If so, is there a concern about the increased HTTP traffic since step 4 has <=N HTTP requests for the N hashes found in step 3.

This could be avoided if a “variants” key was added to Simple repository API - Python Packaging User Guide that had the hashes as keys and the values were the same as “files”.

OK, so the addition to the file name is just to prevent file name clashes locally, not to convey info.

I don’t know who the PEP delegate for this who would give the final say, but I would say I would expect to have the PEP cover how the installer is expected to determine what to install, otherwise it’s just info that no one quite knows how to use.

msarahan · August 7, 2024, 11:36pm

Because I hadn’t gotten into the meat of warehouse when I wrote the first post.

Whether it’s a file or and endpoint is an implementation detail, but warehouse and pep 691 have convinced me that the endpoint that can return either HTML or JSON is the nicer way to go.

That is correct! Thanks for thinking about it. I’m hoping the HTTP traffic increase isn’t too bad because:

I plan on fetching different variant file pages lazily
Ideally, you would only hit the first variant file page. This would probably only fail because of a dependency conflict or platform tag incompatibility (manylinux?)
Things would only be really bad if a lot of variants ended up giving you non-matching or unavailable wheels. I’m planning on adding the ABI tag and the platform tag to the recommended variants, so I don’t see this happening often. It would be far more likely that none of the hashes would match, which should be cheap as a set intersection rather than a series of HTTP fallback requests.

I think this is true for PEP 691, but I haven’t figured out how it might work for the PEP 503 simple API. Maybe that just doesn’t matter. If I understand correctly, you’d have something like:

{
  name: project
  files: [...]
  metadata: {...}
  variants: {
    "deadbeef": [files list for variant deadbeef]
  }
}

where each entry in a files list must match the same criteria for the top-level files. Do I have that right?

pf_moore · August 8, 2024, 9:20am

Speaking as a pip maintainer I definitely would want a PEP if the design required imposing a priority order on indexes. Moreover, now that uv exists, “installer behaviour” is very much something that can only be relied on if there’s a PEP standardising it. (That always was the case, but people used to assume that if pip did something, that was good enough in spite of us reminding them that it wasn’t ).

Speaking as a potential PEP delegate, if the PEP doesn’t specify the behaviour precisely enough for it to be implemented in pip without “hidden” implications like this, then the PEP isn’t sufficiently detailed, and I’d want that fixed.

To put what I said above another way, any design that involves having two different index URLs serve the same filename with different content, is extremely unlikely to get accepted. This would break a lot of code, including caching algorithms based on the wheel filename.

Basically, don’t assume that adding index priority to pip will be acceptable. The issues you linked to give some sense of how complex the whole matter is, and either you need to address all of that complexity (good luck!) or you’re going to add a special case that implements index priority in just one part of pip, and that’s going to be a maintenance nightmare.

In all honesty, if you want to go down this route, I’d want a separate PEP, entitled something like “Selecting Python package files across multiple sources”, which defined standard behaviour required of all installers and consumers of data from package indexes. It would need to integrate with PEP 708^[1] as well, which is how we currently handle this issue. The variant wheel PEP would then depend on the guarantees provided by that other PEP.

which is (provisionally) accepted, but still hasn’t been implemented in the ecosystem - so there’s that, as well ↩︎

msarahan · August 8, 2024, 1:33pm

Thanks! This is good clarity about the index priority PEP, and I’ll work on putting something together. It will be nice to reduce some of the complexity of the metadata work by handling this part separately.

I wasn’t clear enough in my description. The index URLs differ between variant combinations, but the filenames should also differ between index URLs. The thought was that the build tag would be used as the differentiator and would contain the variant hash or some shortened form of it, but it doesn’t matter exactly where the differentiator goes in the filename, or really even what it contains, so long as it makes distribution files uniquely named. I think it’s important to keep the requirement that filenames be unique for each distribution, for the reasons you listed, as well as confusion from having them together locally when building/uploading.

I believe that having separate index URLs for each variant will allow dependencies to differ between variants of a given package while still respecting the static metadata assumption, but that’s a can of worms that I’ll wait to open until the implementation is further along.

pf_moore · August 8, 2024, 1:55pm

To be clear, I think the chance that a PEP that requires some sort of order-based priority for index items will get accepted is very low. I hesitate to say “not on my watch”, but I’ve seen too many over-simplistic interpretations of both the behaviour and the benefits^[1] of this idea to be sympathetic to it.

It’s perilously close to “mandating tool behaviour” rather than “providing interoperability”, and as I said above, it breaks the basic assumption that “if you have a file with the correct name, you don’t need to care where it came from”^[2].

I’d recommend that instead, you design the variant mechanism so that different variants have different filenames. And if that means you need a new version of the wheel spec to allow an extra field in the filename, then so be it. We can’t avoid that forever, and we shouldn’t design suboptimal standards just in an attempt to put it off a little while longer.

hint: it is of no help in handling dependency confusion attacks, which is typically the biggest reason people ask for it ↩︎
in terms of functionality - let’s keep audit trail and provenance issues out of this discussion ↩︎

pf_moore · August 8, 2024, 1:57pm

Hang on - I missed this point on first reading. If the filenames differ, you don’t need index priority. Why did you suggest that you did?

msarahan · August 8, 2024, 2:16pm

I don’t think it’s a “need” per se, but (ignoring nuanced complexity here):

It is sharding the package’s file list, which simplifies the steps after collection. Assuming minimal fallback indexes, this could be a good speedup. On the flipside, if we don’t shard, how will many new variants affect the scale of the package finding/solving problem?
I was concerned about the discussion regarding platform tags already having scaling problems
I wanted something that preserved the existing solver behavior as much as possible (not considering index priority as a behavior change in this, though it is certainly a major behavior change)
As mentioned above:

EDIT: Adding link to Paul’s post on github, which is very helpful in outlining what kinds of questions need to be answered for a prioritization implementation: index-url extra-index-url install priority order · Issue #8606 · pypa/pip · GitHub