Disk space minimization for Python distributors

torsava · October 14, 2020, 12:38pm

We found out that Python as distributed in Fedora is larger than it could be. This is not a problem in general use, but in some container environments, people want their system as minimal as possible.

Currently, for every .py source file, we ship the corresponding .pyc bytecode caches. We ship these for all 3 optimization levels (none, -O and -OO).

This means that:

For regular users, import is as fast as possible, no matter the optimization level they choose.
For the superuser (root), Python does not create files in system locations. (These files would be complicated to track, verify and clean up on uninstall.)
Python does not attempt to write files to system locations. Such attempts are indistinguishable from malicious software written in Python attempting to inject ecexutable files. These attempts can be flagged as such by security software (SELinux).
All files of the standard library are installed and tracked by the package manager, so their integrity can be verified with standard tools.

However, for minimal environments, shipping 3 .pyc files and the source for each Python module is undesirable.

We’re looking for a way to cut the size down for space-minded people, while preserving functionality and security for everyone else.
Is anyone else having similar issues?

If possible, we would like to standardize how downstream distributors with similar goals can ship Python libraries (starting with stdlib), so we don’t have to all reinvent the wheel, rely on hacks and/or break users’ expectations.

One more restriction we have is that we’d like any minimal environment to be a subset, file-wise, of a more complete one. Installing additional files (or removing unneeded ones) is much easier to deal with than having several different sets of files.

Options

We see two possible approaches to fixing the issues.

Option 1: Shipping only `.py` files, and disabling creation of `.pyc` files

Python programs will run fine from only .py files, so shipping .pyc bytecode files is not necessary.
Programs will take a bit longer to start, but in our testing the slowdown is acceptable.
For example, importing importlib.py takes on average 0.025s longer on our machines compared to the .pyc bytecode file.

The problem is that when Python imports a .py source file without finding a corresponding .pyc in __pycache__, it will try to create it. This is undesirable for the reasons listed in Motivation. Also, under the superuser account, these files are created and the disk footprint starts to grow, defeating the purpose of minimization.

To remedy this, distributors could mark the __pycache__ directories as write-protected.
When Python loads a .py file, it would look for a marker file called __pycache__/__dont_write_bytecode__ and, if found, it would skip writing the .pyc file.
Using a file as a marker would make it easy to configure, manage and verify this mechanism with package managers (and other tools that aren’t Python-specific).

The directory structure could look like this:

project/
├── some_file.py
└── __pycache__
    └── __dont_write_bytecode__

The __dont_write_bytecode__ marker would only prevent creating and updating the .pyc files.
If they already exist, they would be checked and used as usual.
This means that for normal installations, we would ship .pyc files along with the __dont_write_bytecode__ marker.
We could also let the user choose which optimization levels to install.

Option 2: Shipping the non-optimized `.pyc` files and compressed `.py` source files

While option 1 has its advantages, it suffers from somewhat slower start times and needing a new kind of a marker.

Alternatively, we can ship non-optimized .pyc bytecode files instead of .py source files.
The non-optimized .pyc files would be placed where the .py source files would have been (i.e. outside of __pycache__: files inside __pycache__ are checked only if a corresponding .py file exists in the directory above).

To save space further, we would not ship the optimized .pyc files (.opt-1.pyc and .opt-2.pyc).
This would mean that users running Python with optimizations (-O, -OO, $PYTHONOPTIMIZE) would get non-optimized library modules.
We believe that this would not have an adverse impact: the optimizations are rather superficial.
If desired, we could devise a mechanism for CPython to handle the relevant bytecode parts (docstrings, __debug__) properly when running in either mode (optimized or non-optimized).

We can start shipping only .pyc files right now without any changes to Python.
However, this would be problematic because Python tools generally assume the .py source files are available.
One prominent example are Python tracebacks, which need the source to display source line contents, useful for debugging.

To fix this, we can add a new optional __pysources__ directory, which would hold the source files.
Python would load the .pyc files for execution, but when it needs the source, it would look in __pysources__.

The directory structure could look like this:

project/
├── some_file.pyc
└── __pysources__
    └── some_file.py

Minimal systems would not have __pysources__ installed at all, while others get almost all benefits of having the sources available.

On normal systems, __pysources__ would be installed, so the system would behave as it does now.
Importing could even become a bit faster, as importlib would only need to stat the .pyc file to run it (compared to 2 files today: the *.py and the corresponding __pycache__/*.pyc).

A caveat is that Python wouldn’t pick up modifications to the sources. Users would need to copy the source outside __pysources__ if they wished to edit installed libraries.
We think that the __pysources__ directory is a sufficiently strong signal to make them research the situation.

For more space savings (and a stronger “don’t edit” signal), the sources in __pysources__ could be compressed.
Since this mechanism is intended for distributors, there would be no problems with compression libraries being optional: only distributions with zlib would compress the files.

Size impact

We calculated the size impact on Fedora’s python3-libs RPM package, which contains most of the standard library, but omits tkinter (and IDLE, turtle, turtledemo) for dependency reasons, and test (and test suites of other modules) for size reasons. The omitted parts can be installed from other packages. Such a split is fairly typical in Linux distributions.

The exact numbers will vary between distributions and Python versions, but the following table should be represenative.

Option	Size	Difference (MiB)	Difference (%)
Status quo	31.8 MiB
Shipping `.py` and non-optimized `.pyc`	22.8 MiB	-8.9 MiB	-28%
Option 1 Shipping only `.py`, disabling creation of `.pyc`	15.2 MiB	-16.5 MiB	-52%
Option 2 Shipping non-optimized `.pyc` and zip-compressed `.py`	17.1 MiB	-14.6 MiB	-46%
Minimal 2 Shipping non-optimized `.pyc` only	13.5 MiB	-18.3 MiB	-57%

Other ideas

We have a much more thorough brainstorming document.

One idea we already implemented is that large auto-generated files (pydoc_data and several encoding modules) are shipped as .pyc only, without source, since the source is not very informative and differences in optimization levels are negligible.

vstinner · October 14, 2020, 1:43pm

See also Compress the marshalled data in PYC files open for 6 years. It’s unclear if the disk space is really reduced significantly and if the slowdown of the Python startup time is acceptable or not.

vstinner · October 14, 2020, 1:45pm

See also this discussion in 2018 about putting the stdlib in a ZIP file: Re: [Python-Dev] Python startup time.

indygreg · October 14, 2020, 8:07pm

I’ll throw out the Python packed resources data structure I developed for PyOxidizer as a potential path forward. While not yet implemented, I have plans for it to support compression and just-in-time extraction of files to the filesystem to support cases where we can’t or don’t want to load things from memory.

You can play around with this data structure using the oxidized_importer Python package.

My hopes are that eventually this work will mature to the point where someone can propose a PEP to make it a recognized Python resources distribution format with stdlib support (just like zip files are recognized today).

steve.dower · October 15, 2020, 12:39am

I’ll suggest right now that stdlib support is unlikely, but probably unnecessary. Packaging tools are basically all out of the stdlib already, so you’d be more interested in PyPA adoption of the project. That would allow you more freedom in terms of release schedule, support for older CPython versions, and not having to worry so much about alternate implementations.

I get that being in the stdlib is a badge of honour, but the more practical aspects really make it something that you should want to avoid

(What you’re more likely to need and get in the stdlib is a way to register import hooks automatically. I’d put effort into that approach, rather than trying to get all your code in there.)

indygreg · October 15, 2020, 1:13am

Thanks for the context, Steve! I’m happy to work with PyPA if that is the path forward.

I also think we’re still a ways out from going down this road. I definitely don’t want to “standardize” before I have more confidence in the technical approach. I like where we are where oxidized_importer can be used as a custom meta-path importer installed via pip install oxidized_importer.

That being said, I will say that if we’re talking about alternative ways of packaging the standard library itself, some level of core support is needed. That’s because .py based modules are imported during interpreter startup. But if we can somehow teach interpreter startup to register a custom meta-path importer without having to compile your own executable embedding Python, that could alleviate the need for custom packaging/importing support in stdlib. I have no clue how this would work though: we’re talking about code that has to run between Py_PreInitialize() and Py_InitializeFromConfig and you are effectively limited to loading/running modules with no dependencies outside of built-in extension modules since the interpreter isn’t fully initialized. It is possible to pull off (see PyOxidizer + oxidized_importer). But as a generic extension mechanism, it will require a bit of thought!

ncoghlan · October 26, 2020, 2:40pm

A variant on option 1 to consider for some use cases is to ship only source files and set PYTHONPYCACHEPREFIX to a user-writable directory: https://docs.python.org/3/using/cmdline.html#envvar-PYTHONPYCACHEPREFIX

This approach isn’t necessarily suitable for general purpose multi-user systems due to the extra attack vectors that writable cache directories can open up, but for a lot of IoT and cloud use cases, it would allow the py source files in the base image to be cleanly separated from a runtime pyc cache stored in /var or /tmp (the original motivating use case was container based dev environments where the previous workaround of turning off bytecode caching entirely was costing several seconds per test run).

ncoghlan · October 26, 2020, 3:00pm

Some variants on option 2 to consider:

allowing __pysource__ to contain a metadata file that describes how to obtain the source files, or points to a different directory for them (Note: I dropped the trailing s deliberately, as I don’t think it improves the clarify of the name)
supporting a PYTHONPYSOURCEPREFIX option similar to PYTHONPYCACHEPREFIX to allow parallel trees rather than nested directories

Regarding zipping the entire standard library (from the larger brainstorming doc):
$ python3 -m site | grep zip
‘/usr/lib64/python38.zip’,

Compressing the entire stdlib is theoretically supported, so the interpreter includes an entry for it in the default definition of sys.path. Running that way isn’t as well tested as running uncompressed, but it’s expected to work.

pf_moore · October 26, 2020, 3:06pm

It’s how the Windows embedded distribution works, so it’s better tested than you might think at first glance