PEP 711: PyBI: a standard format for distributing Python Binaries

numinit · April 9, 2023, 7:51am

To really drive home the point:

Let’s try a NixOS system with one of the “manylinux” pybi archives.

$ mkdir python
$ wget https://pybi.vorpus.org/cpython_unofficial-3.10.1-manylinux_2_17_x86_64.pybi
$ unzip cpython_unofficial-3.10.1-manylinux_2_17_x86_64.pybi
$ ./bin/python
-bash: ./bin/python: No such file or directory

Welp…

$ nix run nixpkgs#file ./bin/python3.10
./bin/python3.10: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=1ce40ff66bcbe8e82fd579d1ae31fb9e9bcae9e2, stripped

Ah, so an absolute path did sneak in there. Let’s try running autoPatchelf.

$ nix-shell -p autoPatchelfHook xlibsWrapper
$ ./bin/python
Python 3.10.1 (main, Nov 20 2022, 20:21:32) [GCC 9.3.1 20200408 (Red Hat 9.3.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

Alternatively:

$ nix run nixpkgs#python310
Python 3.10.10 (main, Feb  7 2023, 12:19:31) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

My point is that we will always be dealing with some number of absolute paths. While this does make things easier, it still assumes things (like the location of lib64).

groodt · April 9, 2023, 8:00am

As a maintainer of various nixpkgs and a fan of many aspects of nix, I don’t think it’s a viable replacement or alternative to PyBI.

Both nix and conda ecosystems are “all or nothing” and are not easily compatible with the diversity of other systems out there. Mixing nixpkgs with non-nixpkgs or conda packages with packages from PyPI often ends in tears.

In both cases (nix and conda), these ecosystems would be able to repackage PyBI artifacts to make them work in their ecosystems (or build from source) as desired.

njs · April 9, 2023, 8:02am

Oh right, I forgot Provides-Dist and Obsoletes-Dist because nothing actually uses them :-). Added.

Provides-Extra is one of the 3 fields the text did mention – is there another one I missed?

When you unpack a pybi you get a real environment, so I guess technically that means it’s not a virtual environment but yes, you could use pybis as a replacement for venvs, or unpack one pybi somewhere and then create venvs from it.

I think these questions are all the same for both pybis and regular python packages on PyPI, and so are the answers? Broken packages are between the uploader and their users; malware generally does get removed by admins when reported; trust is established out-of-band by eg seeing people recommend a package, looking at their github, stuff like that?

That’s one way to do it, but there are other options, like the mangling that auditwheel/delvewheel do. Part of the reason behind this pybi stuff is like, over the last ~decade we’ve built up some mature and sophisticated tools for dealing with this stuff for wheels – platform tags, vendoring, the pypi infrastructure, etc. So we might as well re-use it for python itself.

That absolute path to ld-linux.so is effectively part of the glibc ABI, and inherently required in any executable you distribute on Linux – it’s like #!/bin/sh but for ELF executables. Nix chooses to do its own nix-y thing and refuse to run any kind of ELF binary that someone else built, which is cool for them, it’s a valid choice – but it seems like you’re saying that no-one should ever distribute prebuilt binaries on Linux because nix chose not to support them? That doesn’t make much sense to me If you’re on nix, you probably want to do nix-y things, and the way other systems choose to distribute binaries is irrelevant to you.

numinit · April 9, 2023, 8:08am

@njs Mainly, I’m making sure its place in the ecosystem is clear when compared to others. If you want a manylinux binary, sure, that works, but do keep in mind you’ll be chasing down libraries and trying to make them not collide until you get very tired of it and run into things like libc versions (see also the X11 example - I explicitly had to add it to my shell to get autoPatchelf to detect it). This may also be something which you may not want to bite off specifically because of tools like Nix which just do it for you correctly systemwide.

@groodt Correct on both Conda and Nix ecosystems, though I’d bring up the example of poetry too, which is nix compatible through poetry2nix. See also - making sure the place in the ecosystem is clear with both its strengths and limitations.

njs · April 9, 2023, 8:11am

That’s fair! But foolishly I (with help from others) already bit it off and chased down the libraries and all that – that’s what manylinux is :-). So it’s too late for me.

numinit · April 9, 2023, 8:17am

A fun challenge would be to reroot a Nix build of Python with its entire closure, and make that into a pybi archive…

Much worse has been done, like getting Nix working on Android.

groodt · April 9, 2023, 8:18am

Poetry isn’t a different ecosystem though. It uses PyPI sdist and bdist (wheels) packages. Conda and nixpkgs are new builds hosted on different servers. Poetry is just an alternative installer to pip with a lockfile format and some workflow conventions. Nix will continue to be able to use packages from PyPI through poetry or through repackaging as is currently done. It probably means that Nix has no need for PyBI, which is fine.

numinit · April 9, 2023, 8:32am

Sounds good. Thank you for the detailed clarification.

h-vetinari · April 9, 2023, 8:50am

FWIW, unbundling all the deps is what the builds in conda-forge do, and it’s quite a handful, not least since it’s OS-specific (i.e. windows needs very finnicky patching – check the unvendor-* patches here; unix is easier, more or less just needs to point to conda’s prefix, where we have all the libs/bins).

For us, it would be amazing if python supported a way to build against system-dependencies.

Back to the PEP:

Even though I’m quoting this out of context, I think this applies not just to python but to many other libs that underlie vast parts of the ecosystem. I’d be really interested in having PyBI become general enough to support sane distributions of binary artefacts for things like openblas and other C/C++/Fortran/Rust/etc. libraries. I feel it’s already tantalizingly close, which is why I bring it up even though I’d understand a natural resistance to such scope creep.

And by “close” I mean that something like Pybi-Paths: would already be able to cover much of what’s needed (e.g. /lib, /include, /bin in a standard location), even though I’m fully aware of how much potential ABI pain lies down that road. It might not be easily manageable, but it can be managed. Examples exist – though it would be great if we could eventually achieve the “standard, abstract way” for this for everyone (at least in Python-land).

kknechtel · April 9, 2023, 7:29pm

I tried to explain it before, but I like the idea of the venv tool explicitly supporting this, by including a way to grab a pybi, unpack it and use the result as a venv (by adding the activate/deactivate scripts… I think that’s all it should take, really?).

uranusjr · April 9, 2023, 7:44pm

It’d be a bit awkward for venv to support this since venv would be a part of a pybi that contains a pybi (to create a new env) that contains a copy of venv and it becomes matryoshka dolls. Not impossible, but conceptually and likely practically awkward. I can see virtualenv supporting this though.

kknechtel · April 9, 2023, 8:11pm

I didn’t mean for venv to bootstrap from its own pre-existing pybi; I meant a command-line option to tell it a pybi file to use (or maybe a URI to grab one, or a specification for looking one up in a repository…).

I’ve never really been clear on why virtualenv is a separate product that still exists, honestly.

indygreg · April 9, 2023, 10:12pm

I figure I should weigh in here since I’ve “solved” similar problems that PyBI is attempting to solve with python-build-standalone.

Foremost, excellent work, Nathaniel! I’ve long wanted to see a PEP to formalize standalone Python distributions. You’ve beat me to it and this PEP is off to a great start!

Apologies for the wall of randomly ordered thoughts below.

The technological purist in me doesn’t like the choice of zip files because they yield sub-optimal compression because they use a) a 40+ year old compression algorithm (deflate / zlib) b) individual compression of each file means repeated segments across files can’t be shared and overall archive size is larger. A big benefit of zip is you get a file index and can address/decompress individual files. Since you’ll likely need to extract all archive members for a usable Python distribution, the choice of zip is not ideal. But discarding the precedent of wheels being zips and having to reinvent the wheel (har har) is also not ideal. Zips are fine I guess. But I wish they were tars using a modern compression, like zstd (or lzma, since that’s in the stdlib).

One of the potential use cases for PyBI is to facilitate more turnkey distribution of Python-based applications. There’s a lot of value in being able to take a pre-built Python distribution off-the-shelf and integrating it into a larger application. As I learned writing PyOxidizer, you can need a lot of metadata about the distribution to pull this off. See Distribution Archives — python-build-standalone documentation for all the metadata I ended up adding. Much of this metadata was added to facilitate cross-compilation. When cross-compiling, you can’t just run the interpreter to resolve things like the bytecode cache tag, the path to the site-packages directory, or compiler flags used to build the distribution. I detailed this at What information is useful to know statically about an interpreter? - #7 by indygreg. The metadata currently in PEP 711 is currently inadequate for doing some of these more advanced things. I recognize that defining all this metadata is arguably scope bloat. But if we add a few more missing pieces there might be enough here to allow me to delete the python-build-standalone project or refactor it to produce PyBIs. At the very least I’d love for PyOxidizer to consume PyBIs: if this happens it means others can write tools like PyOxidizer without having to solve the build your own Python distribution problem, which is non-trivial.

On the theme of distribution metadata, python-build-standalone’s full distributions contain the raw object files used to link libpython and extensions and JSON metadata describing them all. PyOxidizer can take this metadata and produce a custom libpython with only the components an application needs. Or it can link a single file binary embedding Python. Powerful functionality. Not something you can currently do with PyBI. Probably way too complicated for what you want PyBI to do (since you ruled out PyBI sdists as out of scope). But I thought I’d mention it as a possible future extension of this work.

Also as noted in the other thread is the presence of licensing metadata. PyOxidizer can strip copyleft components out of a Python distribution and emit licensing info for all included components to make it easier for downstream customers to satisfy legal distribution requirements. It would be amazing to have licensing annotations in PyBI. At the very least I think you need to include the license texts in the PyBI to satisfy legal requirements for distribution. CPython currently contains licenses for components in the CPython source repo. But 3rd party components like OpenSSL, libxzma, tcl/tk, etc need to have their own license texts distributed of those libraries are in the PyBI.

One thing that both current PyBI and python-standalone-distributions fail to distribute is the terminfo database. readline/libedit encode a path to the terminfo database at build time. If you copy to a machine or environment without the terminfo database in the same path as the build machine, readline doesn’t work and a Python REPL behaves poorly. Users complain. PyOxidizer works around this by having Rust code sniff for terminfo files in well-known locations at run-time before the interpreter is initialized. But the correct solution is to build this sniffing into CPython and bundle a copy of the terminfo database with the Python distribution in case one cannot be found.

Everything I just said about the terminfo database arguably applies to the trusted certificate authorities list as well. On Windows and macOS you should always have the OS database to use. (Can’t remember if CPython supports this out-of-the-box yet - it should.) On Linux, you may get unlucky and not have one (common in container environments).

Another limitation with PyBI will be that references to the build toolchain and config are baked into sysconfig data and read by distutils, pip, etc to compile extension modules. (I think I mentioned this in the topic when Nathaniel first introduced PyBI.) There’s a non-negligible chance that the compiler and flags used on the build machine won’t work on the running machine. So if people attempt to e.g. pip install using a PyBI interpreter and there isn’t a binary wheel available, chances are high they’ll get a compiler error. To solve this problem you either need to do the logical equivalent of reinvent autoconf (distutils kinda sorta does aspects of this) or you need to distribute your own compiler toolchain and use it. Hello, scope bloat! You may want to have interpreters advertise that their sysconfig metadata for the compiler came from an incompatible machine so downstream tools like pip can fail more gracefully. Note that this is an existing problem but it will get much worse with PyBI since many people today just install a python-dev[el] system package to pull in dependencies. But this just works today because the Python interpreter was built with the same toolchain used by your OS / distro. PyBI opens us up to e.g. RedHat vs Debian, gcc vs clang, msvc vs gnu, etc toolchain config differences. I think the path of least resistance is distributing your own toolchains since otherwise you’ll be debugging compatibility with random toolchains on users’ machines. Fortunately Python already has a mostly working solution here in the form of quay.io/pypa/manylinux* container images and projects like cibuildwheel to automatically use them. But you might want to start pushing these toolchains’ use in build tools like distutils and pip.

It looks like your current PyBI strip debug symbols. (Presumably for size savings.) Debug symbols are useful. People like me who work on enabling [performance] debugging at scale for engineering organizations like having debug symbols readily available. (Having the ability to get meaningful flamegraphs for any process running in your fleet is life changing.) It’s fine to ship PyBI without debug symbols to cut down on size. But there needs to be a way to get the debug symbols. Either PyBI variants with them unstripped or supplement PyBI-like archives with just the debug symbols (similar to how Linux packaging ecosystem does it). Maybe support for a symbol server. The location of the debug symbols may need to be built into the PyBI metadata. And/or tools consuming PyBI may need to be aware of PyBI variants with debug symbols so users can prefer to fetch them by default. (This problem already exists for binary wheels and I’m unsure if there are any discussions or PEPs about it. Please remember that CPython has its own debug build / ABI settings that are different from debug symbols and therefore debug symbols exist outside Python platform tags. For some reason a lot of people seem to not understand that debug symbols and compiler optimizations are independent and it is fully valid to have a PGO+LTO+BOLT binary with debug symbols - probably because lots of build systems strip debug symbols when building in optimized mode.)

To be pedantic, this stuff is defined by the Linux Standard Base (LSB) specifications. See LSB Specifications

Requirements lists all the libraries that are mandated to exist in the specification. These should all exist in every Linux distribution. So in theory if your ELF binaries only depend on libraries and symbols listed in the LSB Core Specification, they should be able to run on any Linux install, including bare bones container OS images. Python’s manylinux platform tags are kinda/sorta redefinitions/reinventions of the LSB.

But as I learned from python-build-standalone, not all Linux distributions can conform with the LSB specifications! See Fedora 35(x64), error while loading shared libraries: libcrypt.so.1 · Issue #113 · indygreg/python-build-standalone · GitHub and 2055953 – Lack of libcrypt.so.1 in default distribution violates LSB Core Specification for an example of how distros under the RedHat umbrella failed to ship/install libcrypt.so.1 and were out of compliance with the LSB for several months!

Fortunately macOS and Windows are a bit easier to support. But Apple has historically had bugs in the macOS SDK where it allowed not-yet-defined symbols to be used when targeting older macOS versions. And CPython doesn’t have a great track record of using weak references/linking and run-time availability guards correctly either.

I highly advise against doing this. If you allow external libraries to take precedence over your own, you are assuming the external library will have complete ABI and other logical compatibility. This may just work 99% of the time. But as soon as some OS/distro or user inevitably messes up and breaks ABI or logical compat, users will be encountering crashes or other bugs and pointing the finger at you. The most reliable solution is to bundle and use your own copies of everything outside the core OS install (LSB specification on Linux) by default. Some cohorts will complain and insist to e.g. use the system libssl/libcrypto. Provide the configuration knob and allow them to footgun themselves. But leave this off by default unless you want to impose a significant support burden upon yourself.

As I wrote in the other thread, there are several *.test / */test/ packages / directories that also need to be accounted for.

While the justifications for eliding may remain, I think you’ll find the size bloat is likely offset by ditching zip + deflate for tar + <modern compression>.

I’ll note that a compelling reason to include the test modules in the PyBI is that it enables end-users to run the stdlib test harness. This can make it vastly easier to debug machine-dependent failures as you can ask users to run the stdlib tests as a way to assess how broke a Python interpreter is. That’s why python-build-standalone includes the test modules and PyOxidizer filters them out during packaging.

kknechtel · April 9, 2023, 11:44pm

It’s not clear to me why the standard should have to specify whether or not the test folders are included. Obviously official distributions would want to include them; on the other hand, it seems clear to me that Sir Robin (maintainer of the hypothetical “minimal” CPython) wouldn’t. Sir Robin’s distribution, after all, is for people who have done this sort of thing before, have a simple setup unlikely to cause machine-dependent failures (especially given the other things that were excluded), and want to prioritize disk space. Similarly for .pyc files. Those continue to take up space after unpacking, after all (and I might imagine Sir Robin’s clients are the sort to disable bytecode caching).

As for compression, I don’t see why this format should have to do things the same way wheels do simply because the idea is inspired by wheels. Especially given, as you say, lzma is in the standard library. (But is it actually that much better than deflate? .tgz is still a thing, right?)

njs · April 10, 2023, 2:52am

My gut feel is that if we want to handle these, we should treat them more like wheels? In fact I think you could handle them as wheels already, though you need some significant infrastructure to make it practical. See:

To me the key difference between pybis and wheels is just that each environment has exactly one pybi, and any number of wheels.

In order to keep the scope under control, I’m trying to keep PEP 711 restricted to things that are different between wheels and interpreters. I agree that zip files have a lot of downsides (personally I’d love to see something like a zip file, but where entries are compressed with zstd, and each entry can optionally refer to a shared zstd dictionary, so you get the best-of-both worlds for random access + high compression ratio). But if we’re going to make a better archive format, we should make an orthogonal PEP and define it for both wheels and pybis simultaneously :-).

Already replied in the other thread, but for posterity: yeah, totally agreed.

IIRC the stdlib ssl module would need adjustments to its public API before it could use the system trust store (in particular, the system APIs are blocking, and ssl’s non-blocking API assumes that cert checking doesn’t block).

…this is also a fantastic idea. But it also bumps into my scope rule about avoiding anything that applies equally to wheels :-). I’d love to see a PEP adding core metadata fields for “here’s where you download symbols” or “here’s the url for a symbol server”, though.

Yeah, unfortunately LSB was a nice aspiration? but it never really took off and is de facto dead. Manylinux takes the opposite approach of adapting to how the world is, rather than specifying how the world ought to be.

Yeah, it doesn’t have to. In the future I’d like to see us start splitting up the current python distribution into a “core” + a set of preinstalled wheels, and maybe we’d want extensions to the pybi format for that? But that’s very much a future work thing.

BrenBarn · April 10, 2023, 4:04am

I have mixed feelings about this.

First, it’s very cool that you’ve been able to create a self-contained binary Python. It’s something that several people mentioned as a desideratum on one of the other threads. It also seems like it could potentially be a step towards a manager-first ecosystem for official Python releases^[1], since a manager could draw on these builds to populate the environments it manages. And it could be great for distributing self-contained applications that don’t need to assume an existing Python install. So in terms of what can potentially be done with it, it seems good.

On the other hand, the given rationale seems focused in a different direction, particularly on doubling down on the existing PyPI model, which I see as largely a hindrance rather than a help to the improvement of the pypackaging world.

So basically I agree with you that PyBI may potentially be very useful, but I think my reasons for thinking that are not the same as yours.

The most concrete question I have in this regard is: right now, as far as I know, everything available on PyPI is meant to be installed by Python (specifically, by pip). How does PyBI fit into this, when it is by nature something that has to be (or at least may need to be) installed before Python can install anything? What tool would someone use to install Python from a PyBI? What is the gain in leveraging the wheel format for this, when its role in the installed “stack” is going to be so different?

My other comments are really about the non-normative sections of the PEP, because they’re more about the conceptual path on which you (or others) see this PEP as a step.

To be frank, this strikes me as a very weak rationale. It’s something I’ve heard on these threads before, and I’m still puzzled by it. Again and again it is mentioned that conda solves problems that people have, but the response is “well I don’t want to use conda”, with a generic justification like “I had issues with it”, followed by some attempt to re-implement or re-conceive what conda does. What irks me about this is that there does not seem be corresponding sympathy or uptake for those who said “well I don’t want to use pip because I had issues with it”. I hope by that I make clear that what I’m concerned with is not the use of a tool spelled c-o-n-d-a but the actual functionality provided by each tool and what needs it does or does not serve. As I mentioned above, I can see that having a self-contained Python binary is useful; I do not see how tying that into the existing PyPI ecosystem is more useful than using it to create a better ecosystem.

As I’ve mentioned on other threads, my view is that the main and in some cases the only reasons people use PyPI are:

it is the default repository for the default installer that comes with Python
it has a lot of packages people want

If a different installer came with Python that used a different repository but still had the packages people want, they would use that instead. So my view is that there is no need to hew closely to the way things have heretofore been done on PyPI, because many people will happily switch to a new system if it is better.

Moreover, as discussed on pypacking-native, the multiple usage of PyPI as a source for end-users pip-installing things as well as for distro maintainers, plugin-architecture programmers, etc., is in some ways actually a problem with PyPI, not an advantage. Probably those functions should be separated.

I don’t think that is special, or insofar as it is special, some of its specialness is of a negative kind. The PyPI dependency mechanism is, for instance,“special” in that it precludes non-Python dependencies, which I see as a disadvantage. As mentioned occasionally on other threads, perhaps most cogently by @steve.dower here, wheels are not the solution to everything and the PyPI ecosystem has major limitations.

In my view, the main reason PyPI and the wheel format are special, the main reason that people use them, is, like Mount Everest, because they’re there. It is not because of any wonderful qualities they have; they are simply the most convenient vehicle for access to the wonderful qualities of Python and various libraries written in Python. For me the utility of something like PyBI would be to move away from the existing limitations, down a different path.

So, like I said, I think from a technical perspective PyBI is interesting, and actually I think it has the potential to improve the Python packaging world. But I wouldn’t say that crafting a wheel-like format to fit into PyPI is the way to achieve that improvement. Rather it would be to use PyBI as the kernel of an alternative packaging ecosystem, in which a manager installs Python by using a PyBI. This would allow moving all the dependency resolution, environment solving, etc., out of Python-level libraries like pip and into the manager. The main question for me is just whether PyBI would be a more effective starting point for this than conda.

which as I’ve mentioned repeatedly is what I’d like to eventually see ↩︎

steve.dower · April 10, 2023, 10:34am

Haven’t read the spec thoroughly yet, but as you know I’m +1 on the general idea and exactly 0 (no +/-) on whether it will turn out to actually be practical.

And in case it’s not clear, the Nuget distribution is literally the same files that go into the python.org installer and the embeddable distro. They all get packaged up and published as part of the same automatic/scripted build - the only difference is that Nuget uploads must be attached to a username, and since it’s my API key right now, they’re on my name.

FWIW, if this gets going, I’ll be putting up the embeddable distro as a PyBI for sure. A more convenient way to <command> install python-embeddable==3.10.* into your project (e.g. Blender) than curl’ing from a URL you had to construct yourself would be great.

And what I’ve done with the embeddable distro is strip it down as much as is physically possible without losing core functionality (if you want to strip it down further, you can delete extension modules). The definition of “core” functionality is vague right now, but I expect if there’s an ecosystem of people trying to share slimmed down Python installs then we’ll figure out which bits aren’t actually that important - right now it’s a bit “what I say goes”, so it’s not really been explored.

I’m also excited about this (it’s certainly what I had in mind for the Nuget packages). Very grateful you’ve done the work (along with others) to make it more feasible cross-platform.

kknechtel · April 10, 2023, 11:07am

FWIW, I’d be interested in taking part in that discussion.

brettcannon · April 10, 2023, 11:39pm

Depends if you have to also deal with packages and thus are going to have to handle both situations or not. I’m in the latter camp.

Or some mechanism to know where to check the file system for the metadata once you have located the path to the interpreter.

brettcannon · April 11, 2023, 12:01am

Backwards-compatibility (it can create them the “old” way or just calling out to venv)
Speed (virtualenv can create a virtual environment and then get pip and setuptools in there quickly, I think by symlinks)

It’s also a question of compatibility. For instance, on Windows you can just rename a .whl file to .zip and then unzip it straight from Windows Explorer; can’t do that with a tarball.

Have to pick your battles.

I suspect it won’t in the end, and whatever comes from python.org will contain everything in the stdlib. We can leave it to the community to customize things for one’s needs as we can’t easily guess what exactly people will not want included.

Nathaniel created posy which is written in Rust. I would probably look at adding support to the Python Launcher for Unix which is also written in Rust. There’s a myriad of ways that tooling could be built around downloading a PyBI.

That’s up to those who feel that way to speak up. But the key thing is the people who do feel that conda doesn’t meet their needs are speaking up and doing the work.

Momentum. Trying to create an entirely new packaging ecosystem is a massively hard undertaking. If someone wants to attempt it and try to convince the community to shift then they are obviously welcome to. But as I said above, the people doing the work don’t want to go that route, and so they are taking the approaches they are where PyPI still plays a part.