Python Packaging Strategy Discussion - Part 1

pradyunsg · January 11, 2023, 8:55pm

The fundamental difference here is that conda-forge is patching the metadata before a conda install ever sees the built conda packages and the pip install + wheel-based model can’t rebuild wheels.

Whatever model we pick here will involve picking/figuring out how pip should (a) find out about the metadata patches and (b) deal with those patches post-install as well. I’ll let someone else do the galaxy-brain thinking that I feel would be needed to come up with a tractable design for this.

I should note that “edit metadata” capability was available in pypi-legacy (i.e. pypi.python.org) but was explicitly removed in warehouse (i.e. pypi.org) because of the quirks around having this be an incomplete change or trying to get into the business of modifying files.

One expensive way to solve this would be a build farm for PyPI and a bunch of buttons that let authors modify their declared dependencies and update the wheels, w/ signing from the build farm. The less expensive way is to rely on authors to do this (see limitations of author-led social model for issues with this. :P)

pradyunsg · January 11, 2023, 9:06pm

FWIW, this is doable with .post builds such that it is mostly transparent to end users – if you published foo-1.0.0 that has bad metadata, you can publish foo-1.0.0.post1 with fixed metadata and pip will pick up the latter even when users pin to foo==1.0.0.

Users checking hashes will still get the old thing.

rgommers · January 11, 2023, 9:07pm

I’m not sure I follow all you’re saying. No wheels need to be rebuilt, that is the whole point of making metadata editable. And conda also has a local binary cache just like pip, and can install into an existing environment where the old, unconstrained version of a package may already be present. I don’t yet see a reason why the repo data patching model can’t be followed closely.

pradyunsg · January 11, 2023, 9:22pm

I’m not sure I follow all you’re saying.

Likely because I was saying incorrect things.

I didn’t catch the detail that Conda doesn’t use metadata from the <...>.conda -> /info/index.json files as the source of truth but uses a repodata.json served by the channel as the source of truth for the package metadata. That’s the difference then, and not what I was saying – the source of truth for metadata is not the files themselves but a separate file in Conda channels vs Simple API+wheel packages which just trusts the metadata within the wheel. This centralized storage of metadata is something that Conda can do by being a only-binary package distribution, but PyPI can’t.

From Channels and generating an index — conda-build 0.0.0.dev0+placeholder documentation

Package repodata is bootstrapped from the index.json file within packages. Unfortunately, that metadata is not always correct. Sometimes a version bound needs to be added retroactively. The process of altering repodata from the values derived from package index.json files is called “hotfixing.”

At this point, I’m gonna stop this digression – if someone wants to discuss this at length, please request the moderators to split this out into a separate topic before continuing to discuss this.

rgommers · January 11, 2023, 9:42pm

Without going into more detail on metadata editing (best to park that conversation indeed), let me just state that I’d be perfectly happy with binary-only metadata patches. Doing a .post release for an sdist takes all of 5 minutes, regenerating 100s of MBs of wheels can take days.

This is just one aspect of a larger meta issue. There are lots of binary-only package managers, and quite a few source-only ones too. Mixing binaries and building from source is extremely ambitious. Only a few package managers that do this come to mind, like Nix and Spack. In those systems, the binary repository really serves as a cache - the binary is guaranteed to match a local build from source. For PyPI, the opposite is true - as soon as you have some C code, a locally built wheel is more likely than not to differ from the wheel on PyPI in significant ways. This leads to all sorts of issues, which is why I focus on a few ways to decouple from-source and binary installs.

henryiii · January 11, 2023, 10:12pm

FWIW, I think “pip install” building from source should be controllable by the package author, but I wouldn’t globally force it from on to off. On Linux, if you don’t have many dependencies, it often works. Some libraries are designed to be really easy to build, and generally just work (boost-histogram & awkward have zero dependencies, GooFit must be built from source to optimize itself for your hardware (and integrates it’s dependencies), etc. Even numpy often works being built from source (on linux), it’s just a bit slow. Much of the work on building good build systems like PEP 517 and scikti-build-core/meson-python is trying to make sure builds from source are as easy as possible and work in as many cases as possible. I think it would be fantastic if a library author could add metadata to the SDist that changed the default to “please prefer a binary” or “only binary” for that package, but I wouldn’t remove the ability to build all SDists without opting in. A few popular packages (like futures and docopt) would break if this happened, too. Also some platforms without wheels or with very few wheels would suffer if they suddenly couldn’t pip install anything. I wouldn’t totally be against opt-in instead of opt-out of SDists, but a per-package control for authors would be much better, IMO. And would keep packages (like PyTorch) from deleting all their SDists (which they did exactly since building from source is non trivial and isn’t going to “just work” for anyone).

A lot of packages are building PyPy and musllinux wheels, due to cibuildwheel enabling them by default. Quite a few more are building PowerPC. The biggest issue is usually dependencies (which usually is numpy, ) missing these wheels - I’m hoping now that numpy is using cibuildwheel some of these will become more common. Good cross-compile support would really help for the special architectures - that’s a mixed bag right now, especially on Linux. The biggest issue is space - with a lifetime PyPI storage cap, adding a bunch of platforms with minimal usage might not a good idea. I don’t build a few special archs (like PowerPC) for some libraries just because I don’t know if anyone is using those and I’d rather not waste storage (and a bit of CI time).

And there isn’t a good way to build armv7, as there’s no manylinux image for it. It’s currently handled by piwheels. I’m currently working with them to get boost-histogram in, and I don’t prefer this model. I’d much rather make them myself in cibuildwheel if I could.

You can just edit the wheels. We don’t have a great tool for that, but there’s no need to rebuild everything if you aren’t changing the binary. I’d be much more worried about the storage - you’d have to store the new one too. If you have 10MB wheels times 50 or 60 wheels for Python versions and platforms, that’s not trivial just to change a few lines of text.

EpicWink · January 11, 2023, 11:06pm

(maintainer of proxpi, but replying as a user)

If third-party (particularly Linux) distributors will be responsible for providing binary wheels for their platforms, does that mean I would have to wait up to two years to get new features or major big fixes for those libraries? What if a pure-Python volunteer-provided library I use has a significant security fix which needs a different version of this binary library?

Not to mention not even Linux package managers can solve the GPU compatibility problem (AMD/NVIDIA/Intel, CUDA compatibility, etc).

Before these last few weeks, it seemed the plan for packaging was to better integrate with external package managers, not offload a section of Python packaging onto them. I had understood this to mean to specify non-Python dependencies in a platform-agnostic manner: is this the intractable problem that Steve is talking about?

steve.dower · January 11, 2023, 11:28pm

There are still no plans, we’re discussing high-level strategy (and a decent amount of context setting).

For me in particular, my style is to throw out a lot of ideas and see which ones make sense to the group. So don’t take anything I suggest as a “plan” until at least a few other people are saying we should do it

In this hypothetical scenario you describe, that would be a question to bring up with your distributor. If you don’t like their answer, you get your Python stuff from a different distributor (who might base it on the same Linux distro’s system packages and only build things they don’t provide). Nobody has a monopoly on packages for their platform, but also very few people have put energy into providing alternatives (some notable exceptions being conda-forge, Deadsnakes, and Christoph Gohlke).

h-vetinari · January 12, 2023, 12:04am

Editing wheels would open up horrible supply chain attack vectors^[1]. Binaries just shouldn’t be touched (especially with signing etc.) – IMO it needs to be done in (wheel-external) metadata.

I don’t see what binary-vs.-source has to do with a metadata index. You can index the packages you know you’re hosting, and point installers to query that index first before actually fetching wheels or sdists.

The fact that metadata patches to that index would also apply to sdists is actually a welcome benefit IMO, because if a new version for an uncapped dependency turns out to break a package, installing from source will not help anyone either.

I understand the distinction between editing the binary in the wheel and the metadata in the wheel. But to any consuming tool that doesn’t understand that distinction, the hash of the artefact has changed, and if you start telling those tools to just permit that (or they half-ass resp. mess up the implementation of the above distinction), you open up huge problems. ↩︎

oscarbenjamin · January 12, 2023, 12:13am

I think the suggestion was that you could make the new wheels for the new version (with .post1 added to version) by starting from the previous wheels and just editing the metadata in them. The point is you don’t need to run any process that takes days just to make new wheels that are the same as the old ones except for a couple of pieces of metadata being changed.

brettcannon · January 12, 2023, 12:35am

I don’t know if we could pull it off, but if we could get the operating system vendors (and in this instance this includes Linux distros as separate entities) to build packages and provide their own feed for such wheels, that would be amazing. Tie in some way to have those wheels specify what they require from the OS package manager and you start to get your controlled environment in a somewhat distributed fashion.

But that’s costly and I don’t know if any distributor cares enough to be the first mover on this idea to see if users would go for it. Otherwise we start to talk about distro platform tags for wheels instead of manylinux which I’m not sure is a direction people want to go on PyPI.

We would also want to make sure installers didn’t potentially fall through to other indexes if a higher priority one happened to have an older version. E.g. if there was a Fedora index and PyPI, you would want to always take the Fedora index’s files and only fall back to PyPI if the Fedora index had a project missing.

And I would love to have it as well as an easy way to bootstrap a list of projects compatible with WebAssembly/WASI.

h-vetinari · January 12, 2023, 12:36am

Good thing I added a footnote just before your post

That would still require the storage of an entirely new wheel with that changed metadata and therefore duplicate the storage footprint (unless we start doing deep magic with symlinking the binaries within the wheels, which I don’t think anyone’s seriously proposing), so I don’t think that was what @henryiii was describing.

h-vetinari · January 12, 2023, 1:04am

This seems to (pardon the pun) wish for a reinvention of the wheel, aside from the (IMO) completely open question why OS vendors would even want to invest many, many personyears^[1] into something that’s already working from their POV (the introduction of a new architecture is a very exceptional event in that regard, where the vendor has a strong interest to bootstrap the new ecosystem, but without necessarily tying themselves to longterm maintenance).

I don’t see why an OS-provided package would even be that much more preferable to that of any other distribution for that OS, because it couples you to the OS’s upgrade cycle of various key infrastructure packages, which – while the norm – can be avoided^[2].

it’s a black hole. once you start building a couple packages, your users will want other packages. and now you’re running a new distribution that needs to be kept up to date, which is a huge job even assuming you already have all the infrastructure for it. ↩︎
perhaps not for libc or the graphics support stack, but for example, by shipping its own libcxx, conda-forge can use newer features than available on the one of the OS, and provide current packages on old MacOS, which are still in broad use despite being EOL’d by Apple. ↩︎

rgommers · January 12, 2023, 8:49am

Good point, thanks. If the PyPI maintainers would prefer this method, which uses more storage but avoids adding a new API which needs long-term support in multiple places, then that works for me. Building a new tool for this is quite a bit of work, but not technically difficult. The important thing is that there’s one easy to use and recommended workflow to apply metadata edits post-release.

It’s because metadata in sdist’s isn’t static, and that can’t be changed. The build process itself can dynamically add whole new constraints, or pretty much anything else. You can mark most fields as dynamic in pyproject.toml.

That was indeed what was described. And it seems fine. A bit brute-force, but whatever works.

rgommers · January 12, 2023, 9:22am

You seem to assume that it must be wheels? For the Windows on ARM case, wheels in a separate repository make perfect sense. For Linux distros, it probably won’t work out - both too much work, and being stuck to the wheel format means you still didn’t solve the non-PyPI /shared library problem. A name mapping mechanism from PyPI names to distro-specific names and then using the distro’s package format seems better.

I explicitly mentioned this (for conda-abi wheels) in my post as a thing not to do. The key problem is that it adds another very large collection of wheels to produce for all package maintainers. The current demands on package maintainers are already unsustainable, so this doesn’t sound good to me. Unless you are thinking of a way so it can be done by a third party and not by package maintainers?

It’s not just numpy, it’s most projects that are low in the dependency tree. Cibuildwheel helps, but not nearly enough. Every build one adds fails at build or test time occasionally, so it’s extra maintainer load. For platforms that maintainers don’t have access to for local debugging - or worse, can’t even run in regular CI but have to run under QEMU - that cost is way higher. See Expectations that projects provide ever more wheels - pypackaging-native.

With various projects we’ve been discussing whether we need a platform support recommendations policy, a la NEP 29 and PEP 11. That’s probably going to be written at some point - better than ad-hoc decisions every time someone wants a new niche wheel for one project. The examples I gave are almost certainly going to fall into the “don’t provide” bucket. Perhaps with the exception of musllinux.

I suspect you’re the exception rather than the rule. You are intimately familiar with all the tooling and have the time and interest to maintain all those builds. This is rather uncommon. The average package maintainer has enough pain already from packaging support for regular Windows, macOS and manylinux x86-64/aarch64 builds for 3-4 CPython versions.

pf_moore · January 12, 2023, 10:21am

It’s not that hypothetical. IMO it’s a fairly common reason we see for people wanting to pip install into their system Python. And “bring the question up with your distributor” is frankly incredibly naïve. The vast majority of users have essentially no influence over what the various distributors do, so “switch distributors” is the only practical option they have, and it’s not actually practical in most cases.

In fact, I’d hazard a guess and say that for a lot of people, they use the python.org distributions of Python (or equivalents like pyenv which simply automate the “build it yourself” aspect of using the standard distribution) precisely because the choices their available distributors make don’t suit them. You could describe the pip/wheel/PyPI toolchain as “the distributor of choice for people who don’t like all the others”. And as that distributor (which is at least a part of our role as PyPA), what we’re doing here is responding to our users’ requests. So saying “go to a different distributor” is essentially rejecting them, just like their other options have.

pf_moore · January 12, 2023, 10:33am

Surely editing metadata is just as bad? You can add a dependency on some malware if you edit the metadata. (And it doesn’t need the user to import the injected package for it to do its dirty work).

In the days before we standardised sdist metadata (and even now we have, as it’s not yet been implemented widely), the only way to get metadata for a sdist was to build it. And even with the standard, there are still “dynamic” metadata fields that can’t be determined without a build. So an index for source metadata would of necessity have to be backed by an actual build to fix the dynamic fields.

I think that if people want to look at modifiable metadata, they should write at least a draft PEP. There are a lot of details that have the potential to make the idea unworkable in practice. For example, you’d have to limit what metadata fields are modifiable - name and version are out, as that would break the link to the project files. And what are the legal implications of modifying the license in the metadata but not in the files? And as I noted above, adding dependencies is a security risk.

pf_moore · January 12, 2023, 10:41am

Does that not require Python installers like pip to know how to install distro packages? I don’t see that happening in any scalable way. Furthermore, how would it work for virtual environments? If I want numpy in my virtualenv, but don’t want to pollute my system with it, how does a distro-supplied numpy help?

encukou · January 12, 2023, 10:42am

FWIW, Fedora’s PoC rebuilds for all of PyPI are here: Packages for @copr/PyPI
Obviously most of it is failing, and the automation is flaky, but recent packaging improvements like PEP 517 allowed attempting the builds, and starting to analyze and solve common issues.

pf_moore · January 12, 2023, 10:56am

@smm Just to check in here, the discussion has drifted quite far from the original question about unification and reduction of complexity . Is this still useful for your original purpose here, or does the conversation need to be brought back on track?