Enforcing consistent metadata for packages

One significant issue that I can think of is related to dealing with binary size issues and with native dependencies.

Multiple times when we’ve discussed size limit requests and questions like “why are packages using CUDA so large?”, the suggestion given to package authors to reduce binary size consumption is to split out the non-Python parts into a separate wheel, and depend on that in the main package (which uses the CPython C API and has to be built 5 times if one supports 5 Python 3.x versions). As a concrete example: PyTorch (torch on PyPI) has 5 manylinux-x86_64 wheels (cp38-cp213) of ~750 MB each for its latest release. The conda-forge team just did the “split to reduce binary size” thing, and the CUDA 12 wheels for pytorch are ~25 MB only - and they share the non-python-dependent libtorch part, which is ~400 MB. So it really was an almost 5x improvement. Having that for wheels would be super useful. Note that PyTorch doesn’t publish its sdists, so this may still work. However, there will be quite a few other projects that want to do something similar - making it possible while also publishing sdists seems important.

Similarly, for native dependencies, NumPy and SciPy both vendor the libopenblas shared library (see pypackaging-native’s page on this for more details). It takes up about 67% of the numpy wheel sizes, and ~40% of scipy wheel sizes. With four minor Python versions supported, that’s 8x the same thing being vendored. We’d actually really like to unvendor that, and already have a separate wheel: scipy-openblas32 · PyPI. However, depending on it is forbidden without marking everything as dynamic, which isn’t great. So we’ve done all the hard work, dealing with packaging, symbol mangling and supporting functionality to safely load a shared library from another wheel. But the blocker is that we cannot express the dependency (important, we don’t want to ship an sdist for scipy-openblas32, it’s really only about unvendoring a binary).

PyArrow wants to do something very similar to NumPy/SciPy to reduce wheel sizes. I’m sure there will be more packages.

Another related topic comes to mind is the pre-PEP that @henryiii posted a while back: Proposal for dynamic metadata plugins - #46 by henryiii. IIRC he was going to include the idea of “dependency narrowing” in it (i.e. wheels that can have narrower ranges of the same dependency as in the sdist, e.g. due to ABI constraints).

A conceptual problem here is that we’re striking an uncomfortable balance between:

  1. “sdists are generic and the source release for all redistributors”, and
  2. “sdists must match binary wheels”.

We like to pretend that both are true, but that cannot really be the case when you get to packages with complex builds/dependencies. See The multiple purposes of PyPI - pypackaging-native for more on this. @pfmoore’s idea here heavily leans to (2), but (1) was in earlier discussions considered as quite important and the original purpose of PyPI. For the examples I gave above, we’d like to be able to add a runtime dependency to wheels without touching the sdist metadata (because adding it to the sdist would impact (1) and may not even make sense for from-sdist builds by end users).

I think a requirement that all binary wheels must have the same metadata would be easier to meet, and this would address the need in the cases I sketched above. Requiring the sdist metadata to match is problematic, and should be loosened rather rather than tightened if we want to do anything about the large binary sizes problem.

10 Likes