We found out that Python as distributed in Fedora is larger than it could be. This is not a problem in general use, but in some container environments, people want their system as minimal as possible.
Currently, for every .py
source file, we ship the corresponding .pyc
bytecode caches. We ship these for all 3 optimization levels (none, -O
and -OO
).
This means that:
- For regular users, import is as fast as possible, no matter the optimization level they choose.
- For the superuser (root), Python does not create files in system locations. (These files would be complicated to track, verify and clean up on uninstall.)
- Python does not attempt to write files to system locations. Such attempts are indistinguishable from malicious software written in Python attempting to inject ecexutable files. These attempts can be flagged as such by security software (SELinux).
- All files of the standard library are installed and tracked by the package manager, so their integrity can be verified with standard tools.
However, for minimal environments, shipping 3 .pyc
files and the source for each Python module is undesirable.
Weāre looking for a way to cut the size down for space-minded people, while preserving functionality and security for everyone else.
Is anyone else having similar issues?
If possible, we would like to standardize how downstream distributors with similar goals can ship Python libraries (starting with stdlib), so we donāt have to all reinvent the wheel, rely on hacks and/or break usersā expectations.
One more restriction we have is that weād like any minimal environment to be a subset, file-wise, of a more complete one. Installing additional files (or removing unneeded ones) is much easier to deal with than having several different sets of files.
Options
We see two possible approaches to fixing the issues.
Option 1: Shipping only .py
files, and disabling creation of .pyc
files
Python programs will run fine from only .py
files, so shipping .pyc
bytecode files is not necessary.
Programs will take a bit longer to start, but in our testing the slowdown is acceptable.
For example, importing importlib.py
takes on average 0.025s longer on our machines compared to the .pyc
bytecode file.
The problem is that when Python imports a .py
source file without finding a corresponding .pyc
in __pycache__
, it will try to create it. This is undesirable for the reasons listed in Motivation
. Also, under the superuser account, these files are created and the disk footprint starts to grow, defeating the purpose of minimization.
To remedy this, distributors could mark the __pycache__
directories as write-protected.
When Python loads a .py
file, it would look for a marker file called __pycache__/__dont_write_bytecode__
and, if found, it would skip writing the .pyc
file.
Using a file as a marker would make it easy to configure, manage and verify this mechanism with package managers (and other tools that arenāt Python-specific).
The directory structure could look like this:
project/
āāā some_file.py
āāā __pycache__
āāā __dont_write_bytecode__
The __dont_write_bytecode__
marker would only prevent creating and updating the .pyc
files.
If they already exist, they would be checked and used as usual.
This means that for normal installations, we would ship .pyc
files along with the __dont_write_bytecode__
marker.
We could also let the user choose which optimization levels to install.
Option 2: Shipping the non-optimized .pyc
files and compressed .py
source files
While option 1 has its advantages, it suffers from somewhat slower start times and needing a new kind of a marker.
Alternatively, we can ship non-optimized .pyc
bytecode files instead of .py
source files.
The non-optimized .pyc
files would be placed where the .py
source files would have been (i.e. outside of __pycache__
: files inside __pycache__
are checked only if a corresponding .py
file exists in the directory above).
To save space further, we would not ship the optimized .pyc
files (.opt-1.pyc
and .opt-2.pyc
).
This would mean that users running Python with optimizations (-O
, -OO
, $PYTHONOPTIMIZE
) would get non-optimized library modules.
We believe that this would not have an adverse impact: the optimizations are rather superficial.
If desired, we could devise a mechanism for CPython to handle the relevant bytecode parts (docstrings, __debug__
) properly when running in either mode (optimized or non-optimized).
We can start shipping only .pyc
files right now without any changes to Python.
However, this would be problematic because Python tools generally assume the .py
source files are available.
One prominent example are Python tracebacks, which need the source to display source line contents, useful for debugging.
To fix this, we can add a new optional __pysources__
directory, which would hold the source files.
Python would load the .pyc
files for execution, but when it needs the source, it would look in __pysources__
.
The directory structure could look like this:
project/
āāā some_file.pyc
āāā __pysources__
āāā some_file.py
Minimal systems would not have __pysources__
installed at all, while others get almost all benefits of having the sources available.
On normal systems, __pysources__
would be installed, so the system would behave as it does now.
Importing could even become a bit faster, as importlib would only need to stat the .pyc
file to run it (compared to 2 files today: the *.py
and the corresponding __pycache__/*.pyc
).
A caveat is that Python wouldnāt pick up modifications to the sources. Users would need to copy the source outside __pysources__
if they wished to edit installed libraries.
We think that the __pysources__
directory is a sufficiently strong signal to make them research the situation.
For more space savings (and a stronger ādonāt editā signal), the sources in __pysources__
could be compressed.
Since this mechanism is intended for distributors, there would be no problems with compression libraries being optional: only distributions with zlib
would compress the files.
Size impact
We calculated the size impact on Fedoraās python3-libs
RPM package, which contains most of the standard library, but omits tkinter
(and IDLE
, turtle
, turtledemo
) for dependency reasons, and test
(and test suites of other modules) for size reasons. The omitted parts can be installed from other packages. Such a split is fairly typical in Linux distributions.
The exact numbers will vary between distributions and Python versions, but the following table should be represenative.
Option | Size | Difference (MiB) | Difference (%) |
---|---|---|---|
Status quo | 31.8 MiB | ||
Shipping .py and non-optimized .pyc
|
22.8 MiB | -8.9 MiB | -28% |
Option 1 Shipping only .py , disabling creation of .pyc
|
15.2 MiB | -16.5 MiB | -52% |
Option 2 Shipping non-optimized .pyc and zip-compressed .py
|
17.1 MiB | -14.6 MiB | -46% |
Minimal 2 Shipping non-optimized .pyc only |
13.5 MiB | -18.3 MiB | -57% |
Other ideas
We have a much more thorough brainstorming document.
One idea we already implemented is that large auto-generated files (pydoc_data
and several encoding
modules) are shipped as .pyc
only, without source, since the source is not very informative and differences in optimization levels are negligible.