Implementation variants: rehashing and refocusing

msarahan · June 29, 2024, 2:37am

I’m not sure I agree here. One of the main advantages to having mutual exclusivity and such encoded in some central location (like a solver “plugin” or “helper” for a variant is that you don’t need to spread that logic around in every package. The “run_exports” feature in conda-build, where runtime dependencies get added automatically based on build-time usage, is a relevant, if simplistic, example. The problem in Conda-land was that it was really hard to know what correct compatibility bounds should be. Every package had to guess or cargo-cult sane constraints for their runtime libraries. “run_exports” was created to give the power and responsibility of defining sane compatibility bounds to the shared library package author, and allow the consumers to just trust that the authors understand the package compatibility bounds.

I may be misunderstanding what you mean by “machinery” here. Where do you draw the line regarding “machinery” and whatever is interpreting the metadata?

This delocalization of variant meaning makes my head hurt. I’ve been thinking that there would be common key/value kinds of things, and that we’d have something like

given variant “cuda=12”, obtain package A and B where they both have that variant, if they have the “cuda” key at all.

I do see the value in having the package/function from Oscar’s idea, and I see that with that design, different packages/functions could respond differently to a particular key/value combination, but I still think the key/value unification is important.

This also gets back to the idea of putting a hash into filenames as the filename differentiator, since key/value combinations would not be especially workable due to length and syntax complications. Here’s conda’s docs on the topic: Build variants — conda-build 0.0.0.dev0+placeholder documentation

I should note that my hope was that such a scheme could eventually replace aggregate approximations, such as manylinux, and instead encode all compatibility information in a hash_input.json file, then captured in the filename as a hash. This does not preclude putting some human-readable variant identifiers in the filename, it just avoids relying on those identifiers for package/variant resolution.

pf_moore · June 29, 2024, 8:26am

You’re misunderstanding how clients currently work.

There is no assumption that all wheels have the same dependencies. What there is, is an expectation^[1] that an installer can legitimately choose any wheel that is compatible. Dependencies are only checked for the selected wheel, not because we assume they are the same, but quite the opposite - because we don’t care whether they are the same. Dependencies are just a way for the wheel to implement the required behaviour.

not an assumption, as it’s fundamental to what a wheel is ↩︎

brettcannon · July 3, 2024, 1:04am

It’s a wheel. All we are ultimately doing is compiling the same code w/ different assumptions about what will be available on the user’s machine. I don’t think that requires an entirely new file format, just a way to encode what assumptions was made when the wheel was built (e.g. CPU extensions, CUDA, etc.).

A “sibling wheel” to me is a wheel compiled from the same sdist for the same platform, but some variance (see what I did there? ) in how it was built, e.g. what CUDA version something was compiled for.

I’m not sure @ncoghlan was suggesting that. To me, whatever is used to generate the list of stuff/variants a machine supports will determine what’s exclusive and what’s not. And in my head that “stuff” is not run by the package itself; more likely some tool the user runs.

What is the commonality? Is it up to the community to come up w/ that? I thought that’s where this was heading, but I want to make sure we are all thinking the same thing.

msarahan · July 3, 2024, 2:33am

When you say “whatever is used to generate the list of stuff,” I think you’re implying that there is one tool to do this generation. It will be difficult to have only one tool. If it is only one tool, then it will likely fall on PyPA to maintain it. It is important that the detection functionality is distributed, so that domain experts are the ones who are responsible for maintaining their parts. Some kind of plugin architecture or standard API is likely what PyPA would provide.

I’m not settled on whether the detection tool absolutely must be run prior to any installs. I think there could be a “refresh” run as part of installs, but I also think this is an implementation detail that isn’t worth discussion time. The main idea is that you shouldn’t be installing detection tools and running them in the same install, because it’s cyclical.

If there are two packages that employ a variant for cuda, the name of that variant variables and the domain of values should be compatible. This implies a global namespace of variants. I believe the responsibility to publish and communicate these variant variables and value domains rests with the domain expert teams that I think should be maintaining the detection tools. This is an aggregate of communities.

What I meant was that we should not have two or more packages, each with two different variant strings, that end up representing the same thing with potentially conflicting meaning.

ncoghlan · July 3, 2024, 2:59am

I genuinely am suggesting that, as I think only individual packages will be able to judge the mutual exclusivity of the variants they support.

For example, package A might be able to adapt between a range of CUDA versions at runtime. It may want to be able to advertise that it supported multiple CUDA variants in one build, while its default build didn’t rely on CUDA at all.

By contrast, package B might be more tightly coupled to the CUDA version and define a separate variant for each option defined in the CUDA selector package.

It’s why my design sketch early in the thread allowed for combined variants, where a single installed combined variant could satisfy dependencies on multiple distinct individually named variants, and the package metadata would use environment markers to specify whether or not variants imposed transitive requirements on the exact variants used by the package’s own dependencies. Only the packages themselves know that info with certainty. Any centralised logic (even partially distributed across selector modules) would be making assumptions that may not be true in general.

brettcannon · July 3, 2024, 9:19pm

Sorry, I didn’t mean to imply that beyond maybe we all agree to some plug-in architecture and so one tool drives the e.g. cuda detection plug-in, the OpenBLAS plug-in, etc. IOW I don’t expect a single tool to own the detection logic.