PEP 711: PyBI: a standard format for distributing Python Binaries

indygreg · April 9, 2023, 10:12pm

I figure I should weigh in here since I’ve “solved” similar problems that PyBI is attempting to solve with python-build-standalone.

Foremost, excellent work, Nathaniel! I’ve long wanted to see a PEP to formalize standalone Python distributions. You’ve beat me to it and this PEP is off to a great start!

Apologies for the wall of randomly ordered thoughts below.

The technological purist in me doesn’t like the choice of zip files because they yield sub-optimal compression because they use a) a 40+ year old compression algorithm (deflate / zlib) b) individual compression of each file means repeated segments across files can’t be shared and overall archive size is larger. A big benefit of zip is you get a file index and can address/decompress individual files. Since you’ll likely need to extract all archive members for a usable Python distribution, the choice of zip is not ideal. But discarding the precedent of wheels being zips and having to reinvent the wheel (har har) is also not ideal. Zips are fine I guess. But I wish they were tars using a modern compression, like zstd (or lzma, since that’s in the stdlib).

One of the potential use cases for PyBI is to facilitate more turnkey distribution of Python-based applications. There’s a lot of value in being able to take a pre-built Python distribution off-the-shelf and integrating it into a larger application. As I learned writing PyOxidizer, you can need a lot of metadata about the distribution to pull this off. See Distribution Archives — python-build-standalone documentation for all the metadata I ended up adding. Much of this metadata was added to facilitate cross-compilation. When cross-compiling, you can’t just run the interpreter to resolve things like the bytecode cache tag, the path to the site-packages directory, or compiler flags used to build the distribution. I detailed this at What information is useful to know statically about an interpreter? - #7 by indygreg. The metadata currently in PEP 711 is currently inadequate for doing some of these more advanced things. I recognize that defining all this metadata is arguably scope bloat. But if we add a few more missing pieces there might be enough here to allow me to delete the python-build-standalone project or refactor it to produce PyBIs. At the very least I’d love for PyOxidizer to consume PyBIs: if this happens it means others can write tools like PyOxidizer without having to solve the build your own Python distribution problem, which is non-trivial.

On the theme of distribution metadata, python-build-standalone’s full distributions contain the raw object files used to link libpython and extensions and JSON metadata describing them all. PyOxidizer can take this metadata and produce a custom libpython with only the components an application needs. Or it can link a single file binary embedding Python. Powerful functionality. Not something you can currently do with PyBI. Probably way too complicated for what you want PyBI to do (since you ruled out PyBI sdists as out of scope). But I thought I’d mention it as a possible future extension of this work.

Also as noted in the other thread is the presence of licensing metadata. PyOxidizer can strip copyleft components out of a Python distribution and emit licensing info for all included components to make it easier for downstream customers to satisfy legal distribution requirements. It would be amazing to have licensing annotations in PyBI. At the very least I think you need to include the license texts in the PyBI to satisfy legal requirements for distribution. CPython currently contains licenses for components in the CPython source repo. But 3rd party components like OpenSSL, libxzma, tcl/tk, etc need to have their own license texts distributed of those libraries are in the PyBI.

One thing that both current PyBI and python-standalone-distributions fail to distribute is the terminfo database. readline/libedit encode a path to the terminfo database at build time. If you copy to a machine or environment without the terminfo database in the same path as the build machine, readline doesn’t work and a Python REPL behaves poorly. Users complain. PyOxidizer works around this by having Rust code sniff for terminfo files in well-known locations at run-time before the interpreter is initialized. But the correct solution is to build this sniffing into CPython and bundle a copy of the terminfo database with the Python distribution in case one cannot be found.

Everything I just said about the terminfo database arguably applies to the trusted certificate authorities list as well. On Windows and macOS you should always have the OS database to use. (Can’t remember if CPython supports this out-of-the-box yet - it should.) On Linux, you may get unlucky and not have one (common in container environments).

Another limitation with PyBI will be that references to the build toolchain and config are baked into sysconfig data and read by distutils, pip, etc to compile extension modules. (I think I mentioned this in the topic when Nathaniel first introduced PyBI.) There’s a non-negligible chance that the compiler and flags used on the build machine won’t work on the running machine. So if people attempt to e.g. pip install using a PyBI interpreter and there isn’t a binary wheel available, chances are high they’ll get a compiler error. To solve this problem you either need to do the logical equivalent of reinvent autoconf (distutils kinda sorta does aspects of this) or you need to distribute your own compiler toolchain and use it. Hello, scope bloat! You may want to have interpreters advertise that their sysconfig metadata for the compiler came from an incompatible machine so downstream tools like pip can fail more gracefully. Note that this is an existing problem but it will get much worse with PyBI since many people today just install a python-dev[el] system package to pull in dependencies. But this just works today because the Python interpreter was built with the same toolchain used by your OS / distro. PyBI opens us up to e.g. RedHat vs Debian, gcc vs clang, msvc vs gnu, etc toolchain config differences. I think the path of least resistance is distributing your own toolchains since otherwise you’ll be debugging compatibility with random toolchains on users’ machines. Fortunately Python already has a mostly working solution here in the form of quay.io/pypa/manylinux* container images and projects like cibuildwheel to automatically use them. But you might want to start pushing these toolchains’ use in build tools like distutils and pip.

It looks like your current PyBI strip debug symbols. (Presumably for size savings.) Debug symbols are useful. People like me who work on enabling [performance] debugging at scale for engineering organizations like having debug symbols readily available. (Having the ability to get meaningful flamegraphs for any process running in your fleet is life changing.) It’s fine to ship PyBI without debug symbols to cut down on size. But there needs to be a way to get the debug symbols. Either PyBI variants with them unstripped or supplement PyBI-like archives with just the debug symbols (similar to how Linux packaging ecosystem does it). Maybe support for a symbol server. The location of the debug symbols may need to be built into the PyBI metadata. And/or tools consuming PyBI may need to be aware of PyBI variants with debug symbols so users can prefer to fetch them by default. (This problem already exists for binary wheels and I’m unsure if there are any discussions or PEPs about it. Please remember that CPython has its own debug build / ABI settings that are different from debug symbols and therefore debug symbols exist outside Python platform tags. For some reason a lot of people seem to not understand that debug symbols and compiler optimizations are independent and it is fully valid to have a PGO+LTO+BOLT binary with debug symbols - probably because lots of build systems strip debug symbols when building in optimized mode.)

To be pedantic, this stuff is defined by the Linux Standard Base (LSB) specifications. See LSB Specifications

Requirements lists all the libraries that are mandated to exist in the specification. These should all exist in every Linux distribution. So in theory if your ELF binaries only depend on libraries and symbols listed in the LSB Core Specification, they should be able to run on any Linux install, including bare bones container OS images. Python’s manylinux platform tags are kinda/sorta redefinitions/reinventions of the LSB.

But as I learned from python-build-standalone, not all Linux distributions can conform with the LSB specifications! See Fedora 35(x64), error while loading shared libraries: libcrypt.so.1 · Issue #113 · indygreg/python-build-standalone · GitHub and 2055953 – Lack of libcrypt.so.1 in default distribution violates LSB Core Specification for an example of how distros under the RedHat umbrella failed to ship/install libcrypt.so.1 and were out of compliance with the LSB for several months!

Fortunately macOS and Windows are a bit easier to support. But Apple has historically had bugs in the macOS SDK where it allowed not-yet-defined symbols to be used when targeting older macOS versions. And CPython doesn’t have a great track record of using weak references/linking and run-time availability guards correctly either.

I highly advise against doing this. If you allow external libraries to take precedence over your own, you are assuming the external library will have complete ABI and other logical compatibility. This may just work 99% of the time. But as soon as some OS/distro or user inevitably messes up and breaks ABI or logical compat, users will be encountering crashes or other bugs and pointing the finger at you. The most reliable solution is to bundle and use your own copies of everything outside the core OS install (LSB specification on Linux) by default. Some cohorts will complain and insist to e.g. use the system libssl/libcrypto. Provide the configuration knob and allow them to footgun themselves. But leave this off by default unless you want to impose a significant support burden upon yourself.

As I wrote in the other thread, there are several *.test / */test/ packages / directories that also need to be accounted for.

While the justifications for eliding may remain, I think you’ll find the size bloat is likely offset by ditching zip + deflate for tar + <modern compression>.

I’ll note that a compelling reason to include the test modules in the PyBI is that it enables end-users to run the stdlib test harness. This can make it vastly easier to debug machine-dependent failures as you can ask users to run the stdlib tests as a way to assess how broke a Python interpreter is. That’s why python-build-standalone includes the test modules and PyOxidizer filters them out during packaging.