Enforcing consistent metadata for packages

konstin · April 2, 2024, 11:42am

From uv’s perspective: Yes please! We already assume that any wheel will have the same metadata as any other wheels and the built source dist and we’d love to have that codified.

For me, the goal would be to have this on packaging.python.org (somewhere we can point people to as authoritative source) and eventually have pypi enforce this invariant.

aragilar · April 2, 2024, 11:44am

Splitting out each of the cases:

Same wheel filename, different contents:
- Yes (but I suspect you already know this, and are asking only about metadata), win-amd64 will be used as the platform tag whether or not you vendor the required DLLs in the wheel (this seems like a fairly minor issue, but could trip up something that assumed the published/vendored PyPI wheel was the same as a locally built wheel).
(PyPI) Published wheels having different metadata than the built wheel from sdist:
- Yes, if you build h5py with MPI (which is controlled by an environment variable i.e. it’s not an extra, and not expressible as one), the metadata will differ (but only for dependencies which is already marked as dynamic) from those on PyPI (this is done as part of the wheel building process though, so the next case isn’t applicable).
Projects changing wheel metadata as part of the repair process:
- I don’t know of any projects doing this (I can conceive of toy examples, but no real ones), but cibuildwheel would allow it to happen from what I can see (using Options - cibuildwheel). Possibly in addition to PyPI warning about consistency (if that’s the plan), cibuildwheel should also do similar checks (so as to catch this as early as possible)? This also would seem not to be an issue if dynamic is set correctly?

konstin · April 2, 2024, 11:48am

I haven’t seen any such case nor do i remember seeing any bug reports about it; i’m not aware of any tool that would currently rewrite the list of dependencies prior to publishing. We should of course check with the pypi metadata, but i’m confident we’re lucky here for at least the large majority of cases.

pf_moore · April 2, 2024, 12:07pm

The key thing here would then be, how many bug reports have you had where that assumption has turned out to be invalid, and what are the reasons for those cases?

Let me give an example. Suppose packages A, B and C exist on PyPI. And suppose that A depends on B, and B depends on C. Now suppose that I download B’s sdist, unpack it and edit the source to remove the dependency on C, but I do not change the project name or version. Maybe I’m patching out a dependency that doesn’t work on my platform, but I’m not using the affected functionality. Now, suppose that I install the source tree I’ve just created, and then install A. The resulting environment is valid, but does not contain C.

How would you handle this environment (for example, if you were locking it)? The only available source for B is PyPI, but it has inconsistent metadata that would give a different set of packages to install. So you’re going to fail somehow. And yet, this is roughly what I understand Linux distributions do (in principle - in practice, I don’t think they deliberately set up this sort of breakage, of course).

We can declare this sort of behaviour invalid and unsupported, but if we do, then how do we tell the Linux distributors to change their workflows? Do we require them to change the version to add a local identifier every time they patch? Or create a dummy C so that they can leave the metadata unchanged?

konstin · April 2, 2024, 12:28pm

I haven’t seen a single report where this was the problem.

I don’t think we really handle this at all; we have some consistency sanity checks where we report problems to the user but otherwise we just assume the deps in a single release to be the same.

My preferred solution would be distro packages being entirely separate (externally-managed) from what “pypa-style” tools do.

pf_moore · April 2, 2024, 12:28pm

Pip will break on this already - we don’t distinguish between different wheels with the same filename. This is probably the basis for a lot of requests for index priorities in pip, but that’s a side issue here. Let’s just point out that no existing standard supports doing this.

OK, the h5py case is a real issue. It may be that the “visibility” idea I mentioned in my post will address this, if we can be sure that a system built around a build with the environment variable set will never “see” the PyPI distributions. But we need to understand the use cases better. It’s not enough to simply know that h5py does this (otherwise, one counterexample is enough to block any proposal), we need to understand why they do it and what limitations of the current system they are trying to work around.

Yes, this is very similar to the “patching a sdist” case, and like you, I am only able to come up with invented examples.

I’m starting to get the feeling that the “visibility” rule will be crucial - in particular, someone is going to need to try to formalise it better than I did in my post. People have tended to focus on PyPI in this discussion, and I’m concerned that PyPI is not actually the issue here. We can control PyPI, enforce consistency, etc., but as soon as the user adds --extra-index-url, or --find-links, or a direct URL requirement or a reference to a source tree, all those guarantees are lost. Not just for that one run, but for that environment, for as long as it exists. See the constructed example I posted in reply to @konstin for why.

pf_moore · April 2, 2024, 12:30pm

Given that distros use the standards-based tools to build their distribution packages, I don’t think that’s a realistic possibility.

rgommers · April 2, 2024, 3:58pm

One significant issue that I can think of is related to dealing with binary size issues and with native dependencies.

Multiple times when we’ve discussed size limit requests and questions like “why are packages using CUDA so large?”, the suggestion given to package authors to reduce binary size consumption is to split out the non-Python parts into a separate wheel, and depend on that in the main package (which uses the CPython C API and has to be built 5 times if one supports 5 Python 3.x versions). As a concrete example: PyTorch (torch on PyPI) has 5 manylinux-x86_64 wheels (cp38-cp213) of ~750 MB each for its latest release. The conda-forge team just did the “split to reduce binary size” thing, and the CUDA 12 wheels for pytorch are ~25 MB only - and they share the non-python-dependent libtorch part, which is ~400 MB. So it really was an almost 5x improvement. Having that for wheels would be super useful. Note that PyTorch doesn’t publish its sdists, so this may still work. However, there will be quite a few other projects that want to do something similar - making it possible while also publishing sdists seems important.

Similarly, for native dependencies, NumPy and SciPy both vendor the libopenblas shared library (see pypackaging-native’s page on this for more details). It takes up about 67% of the numpy wheel sizes, and ~40% of scipy wheel sizes. With four minor Python versions supported, that’s 8x the same thing being vendored. We’d actually really like to unvendor that, and already have a separate wheel: scipy-openblas32 · PyPI. However, depending on it is forbidden without marking everything as dynamic, which isn’t great. So we’ve done all the hard work, dealing with packaging, symbol mangling and supporting functionality to safely load a shared library from another wheel. But the blocker is that we cannot express the dependency (important, we don’t want to ship an sdist for scipy-openblas32, it’s really only about unvendoring a binary).

PyArrow wants to do something very similar to NumPy/SciPy to reduce wheel sizes. I’m sure there will be more packages.

Another related topic comes to mind is the pre-PEP that @henryiii posted a while back: Proposal for dynamic metadata plugins - #46 by henryiii. IIRC he was going to include the idea of “dependency narrowing” in it (i.e. wheels that can have narrower ranges of the same dependency as in the sdist, e.g. due to ABI constraints).

A conceptual problem here is that we’re striking an uncomfortable balance between:

“sdists are generic and the source release for all redistributors”, and
“sdists must match binary wheels”.

We like to pretend that both are true, but that cannot really be the case when you get to packages with complex builds/dependencies. See The multiple purposes of PyPI - pypackaging-native for more on this. @pfmoore’s idea here heavily leans to (2), but (1) was in earlier discussions considered as quite important and the original purpose of PyPI. For the examples I gave above, we’d like to be able to add a runtime dependency to wheels without touching the sdist metadata (because adding it to the sdist would impact (1) and may not even make sense for from-sdist builds by end users).

I think a requirement that all binary wheels must have the same metadata would be easier to meet, and this would address the need in the cases I sketched above. Requiring the sdist metadata to match is problematic, and should be loosened rather rather than tightened if we want to do anything about the large binary sizes problem.

mdrissi · April 2, 2024, 3:59pm

PyTorch is similarly capable of doing behavior like h5py with PyTorch case being caused by dependencies can change based on presence of cuda. There does not exist environment marker/wheel tag to distinguish cuda/gpu environment. Although PyTorch also does not even publish its sdist to pypi so building from source is less common there.

Edit: One other aspect specific to PyTorch is how about local version tags? When you say two wheels with same name/version should be same/consistent does that include local versions? PyTorch commonly has multiple wheels that only differ on local version published to custom indices where dependencies can differ across wheels.

pf_moore · April 2, 2024, 4:16pm

That sounds awesome, but I don’t see how the wheel metadata is affected? Is the problem around not having markers that encode the information needed to choose the right libtorch wheel?

Again, can you clarify why depending on it is forbidden? I feel like I’m missing some context here, as I know of no standard that would forbid this.

I agree, the dual nature of sdists is a big part of the problem here. As you know, pip has an open issue to make --only-binary :all: the default, and it’s possible that the right solution here is to go even further and (somehow) make wheel-only installs the only supported approach. But I don’t know how feasible that is - we’d break a lot of people who use pip as a build tool rather than an installer, and it would be a lot of disruption in the packaging tool ecosystem.

(Historical background - pip originated as a tool that only installed from sdists. When wheels were invented, pip added the ability to install from them, and that turned out to be a huge success. But we’ve never been able to remove sdist installs, and one reason for this is because there’s a lot of other things you can do with a sdist beyond just installing them from PyPI, and people have built workflows around using pip for these other things).

pf_moore · April 2, 2024, 4:21pm

Right. That’s the detail I was missing. I don’t think varying metadata is the right solution here, but at the moment it’s the only one that woks, and practicality beats purity

I think local versions are different versions, in the sense we’re using here. Although it’s possible that people using the “consistent metadata” assumption for locking might think differently - I don’t know if locking down to the local tag that encodes the CUDA situation is what is wanted here.

… and once again the “visibility” approach looks like it might be a practical compromise. If users pick an index based on the CUDA variant they want, they will never see wheels with conflicting metadata, so maybe we’re OK? (As @rgommers points out, sdists are the fly in the ointment here, but if sdists are not published, or only published on a sdist-only index, then maybe that’s enough?)

rgommers · April 2, 2024, 5:03pm

Example for numpy. Current situation:

the numpy sdist has zero runtime dependencies
all numpy wheels also have zero dependencies, and it vendors libopenblas.so|dll (and other things it needs, like libgfortran.dll)

To improve binary sizes by unvendoring:

the sdist still has zero dependencies
all numpy wheels gain a dependency on scipy-openblas64

That’s a mismatch between sdist and wheels (metadata 2.2 nor your proposal here allow for this). I think your question is: why can’t the sdist gain the dependency too? There are multiple reasons why that is not possible:

There is no sdist for openblas, so adding the dependency would break numpy installs on platforms not supported by wheels
It wouldn’t make sense to create such an sdist for openblas, not only because it’s not a Python package but also because we build the wheel in very specific ways to make it work - no from-sdist build on a random end user’s machine is going reproduce that.
OpenBLAS is not a required dependency of NumPy, but only of the binary numpy wheels. We have users who want to build from source with MKL instead of OpenBLAS, and distro packagers all have different ways of building against some BLAS library too.

I think the solution here doesn’t have to be that radical. The problem here seems to be not the mixing of installing from source and from binaries, but only the assumption made by resolvers that sdist and wheels always have the same dependencies. Which isn’t a great assumption to be making. If instead they would only assume that all wheels have the same dependencies, that’s still not perfect but already a lot better.

Or, there could be some way for package authors to already declare the differences in the sdist. E.g., you may add to or overwrite metadata fields:

[project.pypi-binary-wheels]
dependencies = ['scipy-openblas64']

This would make sense conceptually, since binary wheels on PyPI are definitely not the same as just building the sdist from source (e.g., it’s not like pip runs auditwheel …).

jamestwebber · April 2, 2024, 5:59pm

It seems like OpenBLAS is in effect an extra dependency, but one that should be included by default for most users. The fact that it can’t be built by anyone else is unfortunate but a separate situation, I think?

In YACR^[1], this is analogous to a feature flag that is part of the default. Users who want to build with MKL would explicitly disable it–and if they did, they wouldn’t find a wheel available on PyPI.

Yet Another Comparison to Rust ↩︎

cemici · April 2, 2024, 5:59pm

In a world where we nevertheless insist on fully consistent metadata across built and source distributions, this might take us to one or both of

more use of extras eg numpy and numpy[with-openblas]
- since you want with-openblas to be the default, excludable extras in the style of cargo features would be helpful here
or separate packages numpy-for-most-people and numpy-without-openblas (but with better names!)

Perhaps it is a matter of taste whether this is an improvement, or is the tail wagging the dog. I think I might be ok with it.

jamestwebber · April 2, 2024, 6:02pm

The only viable future is one where pip install numpy downloads a working wheel…if it tries to build from source, or installs a broken wheel, then numpy will drown under the weight of bug reports/help requests.

But numpy[without-openblas] or numpy = { default-features = false } or something like that seems reasonable for the other case.

rgommers · April 2, 2024, 6:30pm

None of the suggestions in the above three posts work at all I’m afraid. It’s not optional in the “you can leave it out” (only replace by another equivalent thing). And it isn’t really specific to numpy, it’s just one example. The problem can be stated more generically: any time auditwheel/delocate/delvewheel/repairwheel are used to vendor shared libraries as part of a release process, that:

increases binary size of the wheels
breaks the equivalence between “build wheel from sdist with pip/build” and the wheels a project actually releases
means a regular from-sdist pip install on Windows is likely broken

Many projects have to do really convoluted things to produce their wheels. Only for the simplest projects with C/C++/Cython/etc. code are released wheels equivalent to the result of python -m build. I think capturing some of that complexity explicitly where it affects metadata will be necessary.

jamestwebber · April 2, 2024, 7:02pm

I think this is still inside the remit of feature flags/extras, though. It’s valid for a package to have N optional backends, and nothing works if you don’t install at least one of them. I think the key is that there exists a default backend for standard installs, which can be disabled explicitly ^[1].

It might not be possible to fail gracefully if users do something very silly (i.e. refuse to install any backend) but they shouldn’t do that.

This isn’t something that can be accomplished in PyPI packages right now but it feels like PEP 735 and the like might get to it eventually.

this isn’t possible now, to my knowledge ↩︎

pf_moore · April 2, 2024, 7:07pm

Thanks for all of the explanations. I now understand the issue much better. One question that I still don’t know the answer to, is why this doesn’t cause issues for lockers like PDM and Poetry that assume consistent metadata already? Or for that matter, for uv which makes basically the same assumption in this case, according to @konstin above. After all, it’s not like numpy is exactly a niche library!!!

Edit: Whoops, sorry, I got confused between “the situation now” and “what we’d like to do”. Given that neither numpy wheels nor the numpy sdist have any dependencies now, there’s clearly no problem at the moment. The issue is that the assumption being made by these tools, which this discussion is about standardising, prohibits the improvements you would like to make.

Which I guess begs the question - how does the way lockers work (or the principles they work on) impact your plans in this area? Because if there is an impact, you should probably be flagging the issue on the latest lockfile discussion, which is looking like it might actually result in a standard this time

cemici · April 2, 2024, 7:12pm

But the problem with possibly-inconsistent metadata came when you described unvendoring.

Is there even a problem when vendoring? All distributions just do not declare that they require the vendored thing, no?

This is also the answer to

why this doesn’t cause issues for lockers like PDM and Poetry that assume consistent metadata already?

because - as earlier -

Current situation:

the numpy sdist has zero runtime dependencies

all numpy wheels also have zero dependencies

which could hardly be simpler for resolvers

kknechtel · April 2, 2024, 10:53pm

I’m afraid I really don’t have a clear picture of the overall system here.

I wrote out my thought process below, but first I want to check something:

I lack experience here obviously, but I’m struggling to imagine what some of these convoluted things might look like, or why they would be necessary. In my mind, a wheel can only contain two kinds of code: native Python modules, and… everything else. The native Python modules come from copying .py files in the sdist (perhaps compiling to .pyc), while everything else comes from some automated process with setup.py at the top (even if it just in turn invokes Ninja or CMake etc.)

Creating a wheel, as I understand it, entails:

putting the Python code in the right places;
creating the non-Python-code pieces;
putting the non-Python-code pieces in the right places;
adding metadata.

As far as I’m aware, any build backend (including vanilla Setuptools) knows how to do 1/3/4 - at least, it can read some tool-specific config data (like [tool.setuptools.package-dir] etc. in pyproject.toml for Setuptools) to figure out what needs to go in the wheel, where it is prior to wheel building, and where in the wheel it should go.

So I would think that the only interesting part is writing code in setup.py that creates the non-Python pieces and puts them in appropriate places. Once that’s done, the rest is formulaic.

So - where does the convolution come in? Why would it be necessary to do things that build can’t do, or that are more than just putting some code in setup.py that ultimately just shells out to some compilers and maybe moves some files around afterward?

Moving on, let’s see if I understand the situation with scipy-openblas32 properly.

Let’s first suppose I have a project where I’ve installed Scipy and call some Scipy function in the code, and Scipy requires some BLAS functionality (i.e., uses a non-Python dependency). I know of a few fundamentally different ways that this could be interfaced:

The code is written in C (or perhaps C++), in such a way that it already conforms to the Python-C FFI. The built and installed distribution contains a corresponding .so (or .pyd on Windows) file which Python can just import directly. My understanding is that BLAS has a decades-long history, its own API, and is normally implemented in Fortran, so this doesn’t apply.
The Python code uses ctypes to communicate with a vendored DLL (still .so, or .dll on Windows).
The Python code expects the system to provide a DLL already; it looks up that DLL (whether by a hard-coded path, or some more sophisticated search/discovery mechanism) and communicates with it via ctypes.
A Python wrapper chooses one of the above strategies at runtime (when using ctypes, a wrapper would normally be used anyway just to avoid littering ctypes calls throughout the rest of the code).

Do I understand properly so far? Did I overlook anything?

Then, let me try to shift to the building/packaging perspective. I infer that SciPy is taking the ctypes approach, and it dynamically wants to use either a vendored DLL or a system-provided one. The existing SciPy sdist includes the necessary pieces to build a vendored DLL, as well as the logic to build and include that DLL in wheels for platforms where it’s necessary. If again as an end user there isn’t a wheel for my platform, I can ask Pip to install from the sdist, and hopefully it will succeed in building the vendored DLL if I need one.

Am I still on the right track?

So, now the goal is to move the DLL-specific stuff into a separate (already existing, in fact) scipy-openblas32 package that doesn’t actually contain any Python modules, and is only provided in wheel form; and then have the vendored DLL come from there when needed.

But the problem is that only some subset of wheels should have this as a dependency; describing it as an “optional dependency” is insufficient because the decision to include it should be made automatically and not by user preference? I.e. the following two situations are unacceptable:

a user who lacks a system BLAS, opts to try to install SciPy without the separate BLAS “extra” and then has code fail at runtime when the BLAS functionality isn’t found
a user who has a system BLAS, opts for an installation with the “extra” and it’s simply redundant

Aside from that, this isn’t clear to me:

Why is this different from the situation with the overall sdist for SciPy? Surely the work required to build and install SciPy from source, for platforms where BLAS support isn’t provided by the system already, would be a superset of the work required to build and install the BLAS support?