Enforcing consistent metadata for packages

rgommers · April 4, 2024, 3:49pm

Thanks for the answers @konstin.

Yes indeed. It’s certainly happened to me that Pip cached some custom build, stored it in cache, and then grabbed that again once I wanted a regular install of the same package. It’s rare, but it happens.

Yes it is. I gave three reasons for why specifying as a regular dependency doesn’t work in this comment higher up.

pf_moore · April 4, 2024, 4:53pm

But these are the cases where a sdist really is nothing more than a generic distribution format, though, aren’t they? So the project could easily^[1] set up CI to build wheels as well as sdists, and that would eliminate those sdists from consideration.

One idea from the previous lockfile discussion, when lockfiles where wheel-only, was having a way for people to ship a lockfile plus a set of wheels that the lockfile refers to - so the person running the locker can build wheels for any sdists that would otherwise need to be locked. That’s not sufficient - if it was, the previous proposal without sdist support would have been approved - but it does offer the possibility of a workaround for a certain class of problems of this nature.

Yes, I know “you could easily do this” is not the right way to frame this sort of request. ↩︎

oscarbenjamin · April 4, 2024, 7:02pm

To be clear I take no offence but it is definitely unfair to compare my proposal with space bar heating!

The idea of having wheels on PyPI with different dependencies from the sdist is aimed at solving real problems. It is not something that is done now but rather something that would be better from my perspective. I described what is done now with duplicate bundling etc above.

We might be talking past each other here though. I see from @konstin’s description of the lockfile process that the expectation is that building a wheel via the PEP 517 interface (python -m build) produces a wheel whose (Python) dependencies are the same as the sdist. This is different from what Ralf and I are talking about which is having wheels on PyPI that are different from the output of a PEP 517 build. I explained above that the output of a PEP 517 build is not suitable for uploading to PyPI and what is done instead to make the wheels that actually get uploaded.

rgommers · April 4, 2024, 7:31pm

These things are not inconsistent with each other. The python -m build result will not be portable indeed, but it will have the same dependencies in its METADATA file. Running auditwheel to make it portable doesn’t change the METADATA file. @konstin just isn’t worrying about whether auditwheel is used or not, when they wrote “via the PEP 517 interface” that can be replaced with “via the PEP 517 interface plus vendoring deps via auditwheel or similar”. This process is a huge pain for package authors, but since it currently doesn’t impact metadata, it is quietly ignored by installers/resolvers.

rgommers · April 4, 2024, 7:55pm

Sure:

Current situation (leaving off version constraints for simplicity)

pyproject.toml:

dependencies = ["numpy"]

PKG-INFO:

Requires-Dist: numpy

METADATA:

Requires-Dist: numpy

Desired situation

pyproject.toml:

dependencies = ["numpy"]

PKG-INFO:

Requires-Dist: numpy

METADATA:

Requires-Dist: numpy
Requires-Dist: scipy-openblas32==x.x.x

That desired situation breaks sdist-wheel consistency. If we fix that with a new metadata field, it’d look something like:

Desired situation (with new metadata)

pyproject.toml:

dependencies = ["numpy"]

[project.pypi-binary-wheels]
dependencies = ['scipy-openblas32']

PKG-INFO:

Requires-Dist: numpy
PyPI-Wheels-Requires: scipy-openblas32==x.x.x

METADATA:

Requires-Dist: numpy
Requires-Dist: scipy-openblas32==x.x.x

pf_moore · April 4, 2024, 8:13pm

Note that this only fixes the consistency problem by defining a way to explicitly state that the dependencies aren’t consistent. I’m not convinced it’s any better than

pyproject.toml:

dynamic = ["dependencies"]

[tool.build-backend]
base-dependencies = ["numpy"]

PKG-INFO:

Dynamic: Requires-Dist
Requires-Dist: numpy

METADATA:

Requires-Dist: numpy
Requires-Dist: scipy-openblas32==x.x.x

(Note that the Requires-Dist in the sdist is, as per PEP 643, not canonical but can be treated as a hint about what the calculated value might be. If you don’t like the fact that the value isn’t canonical, Requires-Dist can simply be omitted from the sdist).

Both approaches explicitly say that the value in the wheel might vary, and neither helps lockers to ensure that their assumption that metadata will be the same in the wheel as in the sdist is valid (because it isn’t!).

I’ll also note that I don’t like the idea that PyPI-Wheels-Requires treats PyPI as special. I think that if we wanted to take this proposal further, we’d need to work out a way to define what it means to be “published” in a way that doesn’t mean “is on PyPI”. Which is essentially the idea of “visibility” that I referred to in my original post.

jamestwebber · April 4, 2024, 8:25pm

Is it a fair assessment that the previously-mentioned PEP 725 would resolve some (most?) of the conflict, in that it allows both the sdist and wheel to state external dependencies in a consistent way? Or is there still more metadata that needs to be recorded?

I guess I’m curious how much it resolves the issue, and what problems would be left unsolved.

pf_moore · April 4, 2024, 8:41pm

My impression is it lets us express the problem (“package X needs a BLAS library”) but it doesn’t help address the problem (there’s still no way to actually find a suitable BLAS library to install). Unless I’m missing something, there’s nothing in PEP 725 that would, for example, help me (or a tool I’m using) find a library with a compatible ABI.

And I don’t think it helps the “consistent metadata” problem, because aren’t we just adding more metadata that will need to vary between sdist and wheel?

jamestwebber · April 4, 2024, 9:12pm

It doesn’t help address the problem but I was thinking it might allow the metadata to be consistent for these cases. Maybe not though, I think I was misreading what the PEP is currently proposing.

I think that’s where I don’t understand, but maybe I need to understand PEP 725 better, and how it translates to metadata in wheels. Now that I start to look at it more closely, it is currently written in such a way that inconsistent metadata is standardized as best practice (e.g. this section implies to me that vendor-and-remove-from-metadata is the right way to do things)

I was thinking that e.g. numpy would add host-requires = [ "virtual:interface/blas" ], etc to its pyproject.toml (thus the sdist) and the metadata in the wheel would say the same thing–thus they are consistent. But if the wheel would instead lists the specific BLAS library it is using^[1], it’s just a new kind of inconsistency between the two.

Leaving the generic specifiers in a wheel might allow people to create wonky environments (i.e. somehow you end up with numpy using a different BLAS from scipy, except when scipy is calling a numpy function), but I don’t know if it breaks things?

or removes the requirement entirely ↩︎

ntessore · April 4, 2024, 9:43pm

But is it really any more inconsistent than dependencies with environment markers? There is only a syntax-level difference between having a separate field and

Requires-Dist: numpy
Requires-Dist: scipy-openblas32==x.x.x; some-imaginary-wheel-only-tag

in both PKG-INFO and METADATA.

oscarbenjamin · April 4, 2024, 10:17pm

This is the first bullet point from the motivation section:

Enable tools to automatically map external dependencies to packages in other packaging repositories,

PEP 725 does not translate to metadata in wheels or at least not in any way that the metadata would be used meaningfully by pip et al. The dependencies are referred to as “external” because they are not available as PyPI packages. The purpose of PEP 725 is to provide a way to represent the metadata that is not used in wheels but that is needed in other packaging systems.

There are possible fringe benefits that would come from this metadata being used by pip-like tools for better error messages but the real purpose is to translate to metadata in conda, apt etc.

jamestwebber · April 4, 2024, 10:32pm

Yes, of course, I get the specific motivation for the PEP itself. The relevant question here is how adding a way to specify external dependencies would affect the issues of metadata consistency between sdist and wheel that have been under discussion in this thread. If the sdist specifies the external dependencies, and the wheel reflects that specification, is there still an inconsistency possible?

konstin · April 5, 2024, 10:27am

Thanks, i’ll reply to that too.

Ralf Gommers:

To improve binary sizes by unvendoring:

the sdist still has zero dependencies

all numpy wheels gain a dependency on scipy-openblas64

That’s a mismatch between sdist and wheels (metadata 2.2 nor your proposal here allow for this). I think your question is: why can’t the sdist gain the dependency too? There are multiple reasons why that is not possible:

There is no sdist for openblas, so adding the dependency would break numpy installs on platforms not supported by wheels

It wouldn’t make sense to create such an sdist for openblas, not only because it’s not a Python package but also because we build the wheel in very specific ways to make it work - no from-sdist build on a random end user’s machine is going reproduce that.

OpenBLAS is not a required dependency of NumPy, but only of the binary numpy wheels. We have users who want to build from source with MKL instead of OpenBLAS, and distro packagers all have different ways of building against some BLAS library too.

A resolver ^[1] doesn’t know yet whether the installation target will use the source dist or wheel, so it needs to create a lock including scipy-openblas64. That means the wheel and the source dist all need to declare it as a dependency in some form. You can use environment markers for that, you could add a scipy-openblas64 source dist that’s a noop or we could add something like a distribution = "source" environment marker. None of these, nor the other syntaxes you proposed to the same effect, change that this information is consistent across a version of a package and can be specified on all wheels: This doesn’t block specifying that you should be able to read a single wheel and get all the resolver information for a version of a package.

To give this an easier example: If i say foolib ==1.2.3 ; sys_platform == "linux", it’s clear that this dep only exists for some wheels (those with *linux tags) and/or only for some source dist build (those on linux), but the information is always present, so even if we run the resolver on windows it adds an entry for foolib in the lockfile.

To give a bad example for comparison: The source dist checks if foolib.h is present on the system and only then depends on foolib if it isn’t. On the locking machine foolib.h is present, so foolib is not in the lockfile, but we resolve torch==2.2 form other deps. On the install machine foolib.h is not present, the source dist now has a foolib=1.2.3 depedency. foolib=1.2.3 wants torch<2, and the entire lockfile falls apart.

This can happen to you and me and other people in this thread who mess with custom builds, but i hope we don’t expect regular users to clear specific wheels from the cache to restore their packaging state. Many of the scientific python users i’ve met weren’t even aware what source dist and wheels are, and i believe we should build tooling that doesn’t require them to know. This does not mean that we don’t want to support people who do things like replace openblas with mkl, we have mechanisms to explicitly support custom things.

We have two cases, one is there are packages that could have totally provided wheels but just didn’t, the other is platform support outside {windows x86_64, linux x86_64, macos universal2} since cross compiling is hard.

I’m not speaking for the case where you have a wheel only deploy to a single docker container, but a lockfile shared between developers and CI on different platforms ↩︎

oscarbenjamin · April 5, 2024, 11:09am

As I understand it the situation with Ralf’s PyPI BLAS package and PEP 725 would be like this:

The SciPy sdist has an external dependency on BLAS that would be described as host-requires = ["virtual:interface/blas"] in pyproject.toml.
A plain built wheel would have been built against a particular BLAS library and so would have an ABI dependence on that particular build of BLAS. I am not sure how this would be reflected by Requires-External in the wheel metadata but the PURL specifiers would not be able to describe this requirement fully.

Currently for SciPy wheels the next step is:

A repaired scipy wheel with BLAS bundled in would have no external dependency and is uploaded to PyPI.

Ralf’s proposal for devendored BLAS is basically that:

A repaired wheel would have the external dependency on BLAS transformed into an internal dependency on scipy-openblas32 which would be an installable Python package. These scipy and scipy-openblas32 wheels would be what is uploaded to PyPI.

In practice to make such a wheel I imagine that you would not literally “repair” it like that but instead build the scipy wheel against the scipy-openblas32 wheel using a bespoke build process whose purpose is solely for PyPI upload. The ABI dependency would then be captured as an exact version pin between wheels with fully matching tags and there would be no sdist for scipy-openblas32.

Someone building SciPy using the normal python -m build would still need to have BLAS installed system-wide and the resulting wheel would still have an external dependency on that particular BLAS. The difference then is that when you build the wheel yourself you don’t get the same wheel as the one on PyPI which has an additional internal dependency on scipy-openblas32 rather than an external dependency on your local system BLAS library.

I don’t see how the PEP 725 metadata would be able to represent the fact that the wheels on PyPI have an additional internal dependency that the sdists do not.

aragilar · April 5, 2024, 11:11am

The natural solution to this would to be able to specify something like cargo’s features (Features - The Cargo Book), and be able to set whether foolib is to be used or not. Then you could specify what features to lock with.

jamestwebber · April 5, 2024, 2:28pm

Right, and this is precisely the part I was talking about in the post you quoted.

The PEP doesn’t specify what should happen to this metadata, but I can imagine a couple versions:

The external dependency is vendored and it is removed from the metadata. This is described^[1] in the PEP but it’s the problem that this thread is discussing: inconsistent metadata between sdist and wheel
The external dependency is precisely specified in the metadata (e.g. virtual:interface/blas becomes openblas32). This is still inconsistent but the wheel is fully described–if this was the approved solution then perhaps installers could adapt to this expectation. This would require canonical names for all of the external dependencies but I think that is part of the PEP 725 purview
The metadata is left unchanged. You end up with a wheel that says “I have blas but I can’t say which version”. The metadata is consistent but not precise

One question I tried to ask earlier, about scenario 3: does that break package resolution? Is there a hard requirement that e.g. numpy and scipy vendor the same BLAS^[2]? In practice, if a user is installing from a consistent source (PyPI, conda, whatever) they should be getting wheels built against the same external dependencies, but it’d be better to warn/fail on installation if they try to install incompatible packages.

possibly even endorsed ↩︎
it makes sense, but is it necessary ↩︎

mdrissi · April 5, 2024, 8:00pm

PyTorch pretty much does this today. Due to interesting index situation/missing environment markers for gpu my experience with PyTorch is it does in practice break universal lock files and at work I have commonly recommended either being very careful with lock files involving PyTorch and having lock per specific machine environment (includes details not captured by python metadata) or simply excluding PyTorch from lock file entirely. I have seen users unfamiliar with python packaging run into this pain point and feel confused/stuck for awhile. A correct lock file for both mac and Linux is tricky here as key relevant metadata for some dependencies (cuda) has no marker. The last time I helped someone with this issue was yesterday.

I think consistency assumptions you want are mostly true but when they break a normal user tends to either move to a different tool or explore hacky options like install some packages separate from rest and manually handle it’s dependencies.

Edit: I think there are multiple consistency assumptions being discussed.

Two wheels with same name/version/tags and maybe index have same dependencies/metadata.
The dependencies of a wheel specify with environment markers are relevant platform variations.
The sdist and wheel dependencies are same when built in same “environment”.

1 I think is safe reasonable assumption. 2 is false for various commonly used ml/data libraries. 3 is borderline but I would consider false as the definition of environment based on python markers/tags does not encompass (nor can it with arbitrary setup.py code) cover all ways library may choose to add dependencies.

3 is better assumption then 2 as libraries it fails on have good chance of not even publishing sdist to pypi. Still can sometimes break though.

kknechtel · April 5, 2024, 9:46pm

As of now, PEP 725 is still a draft, and there is nothing to express the "virtual:interface/blas" dependency. From Pip’s current perspective, it’s an internal dependency against nothing, rather than against an external dependency.

It seems to me that, if PEP 725 gets accepted and package tools are going to take into account, then we can resolve the discrepancy by having some way of communicating that scipy-openblas32 provides the virtual dependency. But then, I suppose this would require the metadata standard to understand and represent such virtual dependencies…

aragilar · April 6, 2024, 1:07am

I’m not sure 1 is a good assumption, in that name/version/tags (currently) gives a unique wheel filename, and so for PyPI which never lets you replace it’s always the exact same file (with the same hash), and for indices which allow replacement, one reason someone might replace said wheel is the original file had bad metadata (and so the metadata has changed).

pf_moore · April 6, 2024, 6:31am

Pip will assume that any two wheels with the same filename are identical, so there’s no guarantee in practice that this will work.