Enforcing consistent metadata for packages

rgommers · April 3, 2024, 7:34am

Things work right now indeed. Reducing binary sizes by unvendoring is a wish/opportunity - just like the proposal in this thread is.

I’ve been thinking and talking to other people about binary size topics for quite a while. The problem seems large, and getting more urgent. To illustrate:

The PyPI hosting costs were ~$1.8M/month three years ago. So it’s probably significantly higher by now, given the growth in Python usage and AI in particular. Tens of millions of dollars a year is a lot …
Download volumes are huge. For numpy: 200 million downloads/month * 12 months/yr * 15 MB/wheel = 36 PB/year
More anecdotally: size limit increase requests on PyPI are not getting approved (in time or at all) more often, and I’m not sure if that is only due to PyPI maintainer bandwidth or also due to increasing concerns about binary size.

No one person has a full picture here I think; my personal perception is that the binary size issue/opportunity is overall significantly more important than the opportunity to simplify implementations for resolvers and lock file generation tools. Very hand-wavy: the former seems to be one of the primary concerns for the sustainability of PyPI as a whole, the latter can probably be measured in months of engineering time.

aragilar · April 3, 2024, 7:36am

This metadata difference reflects that if you build h5py with MPI, then you need mpi4py in order for the code to work, but mpi4py is not able to produce wheels for PyPI (because MPI is coupled to the system you run it on, so even if you tried to produce something that worked for a subset of users, it would cause major issues for all other users), so if you don’t use MPI, then building h5py with MPI only causes issues.

I personally use devpi to do your “visibility” idea (and I think it’s a good idea), but the lack of a local/cache wheel tag (or a way to express features, as existed previously in setuptools, and is how rust ecosystem handles optional dependencies/configuration options, which I feel is really what is needed here to correctly express dependency relationships) does make this somewhat more work than it needs to be.

pradyunsg · April 3, 2024, 7:53am

I think it’s OK to enforce the sdist → wheel consistency, except when the project explicitly declares the metadata as dynamic (as permitted under metadata 2.2). I don’t think it’s reasonable to enforce all wheels need to have the same metadata as each other or as the sdist, especially when the metadata is marked as dynamic.

So the point of this post is to give people a chance to publicly describe possible situations where a rule that “all artifacts for a given version of a project must have the same metadata” would cause issues.

I strongly agree that one of the pre-requisites here is identifying usecases for deviations, and deciding if we care about them and how we could cover them. I don’t think we should make a decision here without considering those needs.

My sense is that people with complex dependency relationships that violate those assumptions don’t use these tools. I’ve definitely seen users “silently” not use a tool A that doesn’t work for their use case and switch to a different tool B that does, even if they would miss certain functional behaviours from tool A.

For most of such instances I’ve been exposed to, they end up falling back into pip-compile + pip + venv, instead of any of the other resolver/locker tools that deviate and have different set of bugs.

For example, pytorch is non-trivial to use with Poetry (which was an improvement over impossible) – Instructions for installing PyTorch · Issue #6409 · python-poetry/poetry · GitHub – and this is a part of why (anecdotally) it’s rarer to see ML/AI projects using Poetry.

rgommers · April 3, 2024, 8:13am

This is not the thread to revisit this topic in detail I think, so I’m just going to give you some of the right things to read here:

Neither pip nor build deal with external shared libraries. If there are any and the build succeeds, this will result in either a non-portable wheel or (on Windows, likely) a broken wheel. You can look at GitHub - pypa/auditwheel: Auditing and relabeling cross-distribution Linux wheels. and PEP 513 for some descriptions. As well as the difference between the “install sdist from PyPI” and “build wheel for uploading to PyPI” steps at Build & package management concepts and terminology - pypackaging-native.

It doesn’t, this is the job of auditwheel & co right now.

No, this is a hard error at build time. There are no extras. To learn more, see:

frenzy · April 3, 2024, 9:38am

When packaging Python packages for Linux distributions, we sometimes have to patch the code as well as the metadata. At the same time, we prefer to use the metadata from packages instead of copying them into our configuration so when something changes in the future version, it’s reflected without any manual work required. There are many reasons for that so let me mention a few of them:

Locked versions. If package A requires B<2 and package C requires B>=2, we have a problem. We can have only one version of package B so we have to either patch metadata in package A or C. Usually, the versions restrictions are too strict so we just need to remove the limit but sometimes it requires to patch the code as well. We might as well take an existing unreleased commit from sources of the package etc. There are many ways to solve the issue. But at the end, we change the little bit of metadata and let the rest be used by automation.
Vendored libs. It’s discouraged to use vendored libs in Linux distros when a lib is available as an RPM package. In that case, we have to remove the vendored version and add a requirement on the packaged one.
Fixing a bug or security issue by backporting patches from newer versions. Sometimes we cannot update a package to fix a bug or security vulnerability. Then an option is to backport a fix to the older version - and that fix might also alter the metadata (dependencies, description …).
Removing dependencies we don’t need or have available. When building a package, we run tests to detect problems or incompatibilities with system libs. That makes sense. But it does not make sense to run type checkers or test coverage so we usually remove those from test dependencies before we install them and run the tests.

In all of those cases, using projects’ metadata allows us to maintain a much smaller configuration on our side and remove the need to check the metadata changes between releases.

pf_moore · April 3, 2024, 9:50am

Cool. So what do you think is the right approach for lockers that currently make the “consistent metadata” assumption? Currently, I believe that they silently produce incorrect lockfiles, which feels wrong to me.

Maybe this is a “practicality beats purity” matter, where standards should simply keep out of the question of how lockers can produce lockfiles in the face of the realities of Python packaging. But if so, that would mean that the work on a lockfile spec would likely have to declare the question of how lockers can produce cross-platform lockfiles as “out of scope”, and my understanding is that that’s quite an important use case.

To be clear, I don’t have an opinion on this - and it’s fine if you don’t either. At this point, I’m just trying to explore the options for how a lockfile PEP and/or any future metadata consistency PEP should view the situation.

I agree. However, I’m not aware of any significant standards or infrastructure work going on in the binary size area^[1]. As you’ve mentioned, individual projects are putting a lot of work into this, but IMO we need to look at the wider picture. Otherwise the classic open source “whatever gets developer attention wins” prioritisation technique ends up applying…

PEP 725 might be related, but the metadata it’s proposing seems to be more informational than actionable ↩︎

kknechtel · April 3, 2024, 9:51am

So in this case, the build would actually be impossible for an end user? As I understood it, auditwheel repair fixes the wheel by copying in a system-provided library, but the entire reason we would need to do this is because of the system not providing that library. And then, auditwheel patches a wheel after the fact (rather than setting up the build environment with the binary blob first), and has to be inserted into the process of building for PyPI upload (whereas it can’t work on the user’s side). Is that the main “convoluted” part?

Aside from that, I didn’t get a clear understanding of how auditwheel knows what to include. It seems like it’s necessary for the project to include small wrapper extension modules in C, which then specify those dependencies at a C level?

I was describing a hypothetical, to make sure I understand why the problem is seen as a problem. I understand that you want to make this an error per PEP 725. What I understood from @jamestwebber 's idea is that it should be easy enough to detect at runtime that the dependency is missing (provided neither by the installed wheel nor by the system) and give a suitable message (perhaps with advice like “try using the system package manager to install foo-dep” or “please downgrade Python and/or this package, and wait for a wheel for your platform”).

Let’s suppose that we have a none-any wheel that simply doesn’t vendor any non-Python code, where these checks are done. You describe such a wheel as broken or non-portable, but I see it as just limited-functionality. While this situation certainly isn’t ideal (and it’s why ideas like “only-binary by default” are under consideration), it sounds better to me than getting such a message from a preemptive check in setup.py (assuming the build-time dependencies could be installed, and the logic for the check is practical to express, and it doesn’t lead to yanked releases when new wheels become available…), which in turn is much better than having to infer the problem from a failed build (since it seems that most users’ comprehension begins and ends with “I got a subprocess-exited-with-error error and everything I can find seems specific to some other project”).

oscarbenjamin · April 3, 2024, 11:06am

It is not impossible but it might be difficult. The PEP 517 build interface assumes that non-Python dependencies and build tools are just available somehow externally. If a user has those dependencies available (e.g. from apt, homebrew, built from source etc) then they can build a wheel from source and install it. The wheel built in this way is not portable but that is okay if they only want to install it locally.

To make a portable wheel you need to wrap the PEP 517 build interface between steps that first install/build the non-Python dependencies and afterwards bundle them into the wheel. Typically cibuildwheel is used for this and codifies these additional steps:

Install non-Python dependencies (CIBW_BEFORE_ALL).
Perform per-Python-version setup (CIBW_BEFORE_BUILD).
Run the PEP 517 build
Repair the wheel (CIBW_REPAIR_WHEEL_COMMAND).

The repair wheel step bundles the libs into the wheel. Some of those libs might be things that the user does not have on their system. Other things might be libs that they do have but that the extension modules in the wheel might not be ABI compatible with so the wheel needs to bundle its own versions anyway.

When installing from source with e.g. pip you only get the PEP 517 build step. That can only work if you first install the dependencies and build tools system wide e.g.:

sudo apt-get install libopenblas-dev python3-dev build-essential
pip install --only-binary numpy numpy

For certain dependencies it would be nice to build them as part of the PEP 517 build step and have them bundled into the wheel but there is not currently an easy way to make this work.

It is not just about the dependency being present or not but about ensuring ABI compatibility. You need to know that the extension modules were built against the same build of the libraries that will be used at runtime. There is no general way to detect ABI compatibility except by ensuring that the extension modules and the libraries were built together. There are two common ways to ensure this:

Install the libraries system wide, build against them and only use the resulting wheels on that system.
Bundle all the libraries that were used when building into a wheel that is portable because it only uses its bundled libraries at runtime.

If you have no information about the provenance of the built library files that are found at runtime then there is no way to detect whether they would be ABI compatible with the extension modules.

This is also why ABI dependency between portable wheels is difficult/impossible: you would need to be able to say “this wheel is only ABI compatible with that exact other wheel” regardless of what the version constraint metadata in any sdist says. There are ways that this could be made to work but it necessarily involves the wheels having much tighter dependency constraints than the sdists from which the wheels are built.

kknechtel · April 3, 2024, 11:47am

Ah, so the auditwheel invocation isn’t even originating in setup.py?

So at this point, the wheel no longer needs to include the dependencies anyway, so the lack of an auditwheel step is no longer a problem. But then - how exactly does the presence of these dependencies influence the PEP 517 build step?

What I think I’m missing here: why is it necessary to have these extension modules, as opposed to interfacing via ctypes (as some projects seem to get away with)? And what’s actually in them?

I don’t think I understood the linked example. It seems to me like the dependency isn’t between the two wheels, but rather both would be dependent on the same libgmp, which needs to be made available somehow. So, let’s suppose we had libgmp wheels that somehow or other contain just a binary for the shared library, with various ABIs represented by separate wheels. Between specifying the library version and the wheel tags, why doesn’t that adequately specify the ABI? Isn’t the necessary ABI tag for the shared library, just the same as the ABI tag for the wheel that’s depending on it? I think I need a more concrete example of how this fails.

Or else, just spitballing here - is this something that could be fixed if, say, PEP 508 dependency specifications could mandate a specific ABI for the dependency?

oscarbenjamin · April 3, 2024, 12:03pm

These questions are too tangential for this thread. I suggest to go and look at the code in some projects, try building them, try to make a extension module yourself etc. Many of your questions would be answered.

I will just address this one:

Using ctypes does not solve any of the problems of ABI compatibility. In fact it makes them harder because you have to presume an ABI in your Python code but you lose the benefit of having any of it be checked by a C compiler. With ctypes it is hard to even ensure ABI compatibility with the libraries that are on your local system.

Another reason for not using ctypes is that it is very slow compared to e.g. Cython.

pf_moore · April 3, 2024, 12:26pm

To bring this part of the discussion back on topic, what I’m hearing is that:

In order to reduce the size of wheels, putting shared libraries into their own wheel is an important potential optimisation that projects are working on, but which isn’t yet in use in published wheels.
Requiring consistency between sdists and binary wheels prohibits this optimisation, because the sdist should not have a dependency on the shared library wheel, but should instead build using a locally installed version of the dependency (in order to allow building from sdist on platforms not supported by wheels).

Is that an accurate statement of the issue here?

(As a side note, it seems to me that for projects in this category, building from sdist is something that almost certainly needs special expertise, or a carefully set up build environment - meaning that for general tools like pip or lockers, ignoring the sdist is actually likely to give a better user experience than trying to use it as a fallback installation option).

ntessore · April 3, 2024, 12:43pm

This seems more of a practical problem than a conceptual one: We currently lack the means to express more complicated, real-world situations like “this dependency applies to wheels only” in static metadata. For example, the “wheels have an extra dependency” could be made consistent with e.g. a simple “wheel-only” pseudo-marker. ^[1]

Only to illustrate the point, I’m not putting that forward as a good idea. ↩︎

pf_moore · April 3, 2024, 12:51pm

I think there’s another issue here as well, which is that some sdists are trivial to build^[1], whereas others need specialist knowledge or tools. That results in another “uncomfortable balance”, between “sdists are a format for end users to install from” vs “sdists are a format for redistributors to build binary packages from”.

Pure Python is the obvious case. ↩︎

sinoroc · April 3, 2024, 6:42pm

By chance do you know what metadata gets written at installation in *.dist-info/METADATA? For example if we run something like importlib.metadata.metadata('library-name').get_all('Requires-Dist'), do we get the metadata that was in the original source-tree/sdist/wheel (or whatever the conda packager used) or the conda-altered metadata?

Same question applies for Linux system packages.

I would not know how to test this because I do not know any package whose metadata has been altered by repackager (conda or Linux). I only recall that a while ago on Ubuntu/Debian pip freeze would list pkg_resources==0.0.0, but that is slightly different topic I think.

oscarbenjamin · April 3, 2024, 6:50pm

Yes, although as usual it is not the only issue

Requiring that sdists and wheels have the same compatibility constraints is problematic but reducing the size of wheels is only one part of this. I will describe the situation as it concerns python-flint (which I maintain) and gmpy2 which is closely related. I don’t know as much about pytorch etc as Ralf does but ultimately I think many of the issues are similar, just more extreme in terms of file size because of the larger CUDA binaries.

The C-level dependencies for gmpy2 are GMP, MPFR and MPC. For python-flint the dependencies are GMP, MPFR and Flint so GMP and MPFR are shared dependencies. The primary purpose of both gmpy2 and python-flint is to expose the functionality of the underlying C libraries so that it can be used from Python.

In every other packaging system (conda, apt, homebrew etc) there would be a single shared copy of the GMP and MPFR libraries that would be used by both gmpy2 and python-flint (that’s why they’re called shared libraries!). There would only need to be one GMP package to be maintained. The build farm would only build this package once. A user would only download it once. It would only be in one location on disk and one location in memory at runtime. Projects like python-flint and gmpy2 could just use that GMP package as it comes and could build directly against its binaries.

In the case of the PyPI ecosystem it isn’t easily possible to have a GMP package that gmpy2 and python-flint can share because any binary wheels would need to be ABI compatible. Instead both gmpy2 and python-flint need to build GMP and bundle it. Building GMP is especially difficult on Windows and both projects have had to figure out ways of solving that problem along with all of the associated CI tooling to make it happen and both projects need to maintain that going forwards. Both projects have to carry patches for GMP. Both projects have to have slow CI jobs that build all of these dependencies from scratch. Both projects upload binary wheels containing effectively duplicate libgmp.so etc files.

Many users of python-flint are also users of gmpy2 and so they have to pip install larger wheels containing duplicate libraries. After install those bundled libraries are duplicated on disk in every virtual environment. The duplicated libraries are loaded separately into memory at runtime within each process. Note that in e.g. Linux if you have a single system-wide libgmp.so it would be shared in physical memory by all running processes that use the library. When you install manylinux wheels though you get different copies in each venv and even duplicates across different packages within a venv and each of those would use separate physical memory at runtime.

From the perspective of python-flint and gmpy2 maintainers the primary issue here is duplication of effort: it is a lot of work for each project to package the same libraries. It would be easier if we could share the same build of the same libraries. We would need to either have separate wheels that literally just bundle the C libraries or for python-flint to depend on gmpy2. For that to work with wheels on PyPI the dependency arrangement between the wheels would be an ABI dependency and would need to be an exact constraint between particular wheels like gmpy2==<hash of a gmpy2 wheel file>.

It is also possible to link Flint with a BLAS library in order to accelerate some operations. I imagine that 99% of python-flint users have a BLAS library from NumPy but it just seems too difficult to leverage that with python-flint wheels because it would again introduce an ABI dependency. Even closely coupled projects like NumPy and SciPy struggle to share a BLAS library in their wheel builds. So python-flint is left with the option of either building and bundling another duplicate BLAS library or forgoing those optimisations even though the user has the necessary library installed.

In the case of GMP file size is not such a major issue but to me it seems completely absurd to have multiple copies of libgmp in the same virtual environment: it is a clear sign that something went wrong somehow. This is the best that package authors have been able to come up with though while working with the constraints of Python’s packaging standards. All ABI dependencies have to be avoided because otherwise tools like pip won’t know how to install compatible binaries.

In the case of things like pytorch etc all of the same considerations apply except that the duplicated shared binaries are massive. It would be much better if they could be split out into separate wheels but you would still need to be able to encode the ABI constraints somehow when doing so.

Also I might be wrong about this but I think that the reason the CUDA binaries are so large in the first place is just because they are really provided as a bundle of subpackages for particular GPUs. The user likely has only one particular GPU but they end up absorbing a gigabyte of other code because with wheels there is no way to choose the right subpackage at install time based on something like what GPU they have.

a-reich · April 3, 2024, 7:20pm

In theory, the conda recipe could do almost anything during its build step, so there’s nothing from the Conda perspective even requiring that there will be a dist-info directory after install! And if it is, the contents could match the original, or be analogous to different metadata patched for the conda package, or anything else. But in practice most recipes I’ve seen that have an analogous PyPA-compliant project basically just build & install the wheel into a temp location and put the installed files into the conda artifact. If they are patching metadata I’m not sure if they would leave the PyPA metadata intact or not.

rgommers · April 3, 2024, 7:33pm

Agreed - the mythical packaging steering council would come in very handy for this kind of thing to help set overall priorities and constraints.

Re binary size & standardization: some metadata-related things require standardization, or avoiding standardization in this case (this probably didn’t come up at all in the Metadata 2.2 review) - but a lot of what’s in place in terms of standards (PEP 517, 518, 621) should be fine as is. Tooling support in auditwheel to allow unvendoring was fixed last year; delvewheel/delocate were already okay. So really the main thing missing for unvendoring specifically is what we just discussed.

PEP 725 is related and is indeed only adding missing metadata (that’s step 1), but doesn’t prescribe using it yet in any way (that’d be step 2). Here’s a small presentation I gave recently that touches on what I’d hope comes next: Reliable from-source builds (Qshare 28 Nov 2023).pdf. It doesn’t focus on unvendoring and binary size reductions, but more on external dependency handling at build time (before the vendoring step).

BrenBarn · April 3, 2024, 7:38pm

I’m not totally sure, but my impression is that for the most part conda packages don’t mess with that metadata. So if you use importlib to get something like requires-dist, you’ll likely get the same metadata as you’d get if it were pip-installed; it’s just that that metadata won’t accurately reflect what is required on the conda level.

As an example, I just tried pip-installing and conda-installing scipy. The result of your importlib call is the same for both, but in neither case does it mention things like libblas, which is actually a dependency of the conda package.

Of course, that’s just an example, and probably not a great one since libblas isn’t even something that would be available via pip. I just give it to show that you can get “the same” metadata on the level of the PyPI standards but still have that be relatively meaningless in the conda context.

As @a-reich mentioned, there’s nothing stopping the conda recipe from mucking with the Python-level package in arbitrary ways. What I’m not sure about is whether there are conda packages that feel the need to patch the PyPI-standard metadata as part of getting a particular package to work. My hunch is that that would be rare, because from the perspective of actually installing the package with conda, the PyPI-standard metadata is essentially null and void: the conda package could have a different name, different dependencies, and so on, and the corresponding wheel metadata is ignored when installing from conda. So it would seem unlikely that a conda package would get any benefit from patching the PyPI-standard metadata. For the most part (as @a-reich also mentioned), the conda build process for a PyPI-derived package just runs the pip install as a sub-task and then copies everything wholesale into the built conda package, so there won’t be patching of metadata or anything else.

groodt · April 3, 2024, 7:47pm

From the perspective of a user and as a member of a team who manages Python application builds and environments for internal teams, we’ve noticed that things have improved a bit for unbundled nvidia wheels. There’s no sdist in this cuda part of the forest.

Metapackages

The following metapackages will install the latest version of the named component on Linux for the indicated CUDA version. “cu12” should be read as “cuda12”.

nvidia-cuda-runtime-cu12
nvidia-cuda-cupti-cu12
nvidia-cuda-nvcc-cu12
nvidia-nvml-dev-cu12
nvidia-cuda-nvrtc-cu12
nvidia-nvtx-cu12
nvidia-cuda-sanitizer-api-cu12
nvidia-cublas-cu12
nvidia-cufft-cu12
nvidia-curand-cu12
nvidia-cusolver-cu12
nvidia-cusparse-cu12
nvidia-npp-cu12
nvidia-nvjpeg-cu12
nvidia-opencl-cu12
nvidia-nvjitlink-cu12
These metapackages install the following packages:

nvidia-nvml-dev-cu124
nvidia-cuda-nvcc-cu124
nvidia-cuda-runtime-cu124
nvidia-cuda-cupti-cu124
nvidia-cublas-cu124
nvidia-cuda-sanitizer-api-cu124
nvidia-nvtx-cu124
nvidia-cuda-nvrtc-cu124
nvidia-npp-cu124
nvidia-cusparse-cu124
nvidia-cusolver-cu124
nvidia-curand-cu124
nvidia-cufft-cu124
nvidia-nvjpeg-cu124
nvidia-opencl-cu124
nvidia-nvjitlink-cu124

It used to be a nightmare of huge wheels for us because torch would vendor cuda (which made for easy installs, but fat wheels) and tensorflow would not (which required independent installation of cuda, which was a nightmare in other ways). Then it got worse because if anyone wanted both libraries in a venv, they would often be confused about which cuda to use. Then other libraries also started to depend on cuda or vendor and it started to get out of hand.

Nvidia did notice this and both the torch and tensorflow ecosystems have been able to shift to the unbundled nvidia wheels. It seems to work well from my perspective. I realise it doesn’t solve all the issues and is probably complicated for nvidia to manage compatibility.

sinoroc · April 3, 2024, 7:50pm

Should the posts about wheel size be split into their own topic?