We found out that Python as distributed in Fedora is larger than it could be. This is not a problem in general use, but in some container environments, people want their system as minimal as possible.
Currently, for every
.py source file, we ship the corresponding
.pyc bytecode caches. We ship these for all 3 optimization levels (none,
This means that:
- For regular users, import is as fast as possible, no matter the optimization level they choose.
- For the superuser (root), Python does not create files in system locations. (These files would be complicated to track, verify and clean up on uninstall.)
- Python does not attempt to write files to system locations. Such attempts are indistinguishable from malicious software written in Python attempting to inject ecexutable files. These attempts can be flagged as such by security software (SELinux).
- All files of the standard library are installed and tracked by the package manager, so their integrity can be verified with standard tools.
However, for minimal environments, shipping 3
.pyc files and the source for each Python module is undesirable.
We’re looking for a way to cut the size down for space-minded people, while preserving functionality and security for everyone else.
Is anyone else having similar issues?
If possible, we would like to standardize how downstream distributors with similar goals can ship Python libraries (starting with stdlib), so we don’t have to all reinvent the wheel, rely on hacks and/or break users’ expectations.
One more restriction we have is that we’d like any minimal environment to be a subset, file-wise, of a more complete one. Installing additional files (or removing unneeded ones) is much easier to deal with than having several different sets of files.
We see two possible approaches to fixing the issues.
Option 1: Shipping only
.py files, and disabling creation of
Python programs will run fine from only
.py files, so shipping
.pyc bytecode files is not necessary.
Programs will take a bit longer to start, but in our testing the slowdown is acceptable.
For example, importing
importlib.py takes on average 0.025s longer on our machines compared to the
.pyc bytecode file.
The problem is that when Python imports a
.py source file without finding a corresponding
__pycache__, it will try to create it. This is undesirable for the reasons listed in
Motivation. Also, under the superuser account, these files are created and the disk footprint starts to grow, defeating the purpose of minimization.
To remedy this, distributors could mark the
__pycache__ directories as write-protected.
When Python loads a
.py file, it would look for a marker file called
__pycache__/__dont_write_bytecode__ and, if found, it would skip writing the
Using a file as a marker would make it easy to configure, manage and verify this mechanism with package managers (and other tools that aren’t Python-specific).
The directory structure could look like this:
project/ ├── some_file.py └── __pycache__ └── __dont_write_bytecode__
__dont_write_bytecode__ marker would only prevent creating and updating the
If they already exist, they would be checked and used as usual.
This means that for normal installations, we would ship
.pyc files along with the
We could also let the user choose which optimization levels to install.
Option 2: Shipping the non-optimized
.pyc files and compressed
.py source files
While option 1 has its advantages, it suffers from somewhat slower start times and needing a new kind of a marker.
Alternatively, we can ship non-optimized
.pyc bytecode files instead of
.py source files.
.pyc files would be placed where the
.py source files would have been (i.e. outside of
__pycache__: files inside
__pycache__ are checked only if a corresponding
.py file exists in the directory above).
To save space further, we would not ship the optimized
.pyc files (
This would mean that users running Python with optimizations (
$PYTHONOPTIMIZE) would get non-optimized library modules.
We believe that this would not have an adverse impact: the optimizations are rather superficial.
If desired, we could devise a mechanism for CPython to handle the relevant bytecode parts (docstrings,
__debug__) properly when running in either mode (optimized or non-optimized).
We can start shipping only
.pyc files right now without any changes to Python.
However, this would be problematic because Python tools generally assume the
.py source files are available.
One prominent example are Python tracebacks, which need the source to display source line contents, useful for debugging.
To fix this, we can add a new optional
__pysources__ directory, which would hold the source files.
Python would load the
.pyc files for execution, but when it needs the source, it would look in
The directory structure could look like this:
project/ ├── some_file.pyc └── __pysources__ └── some_file.py
Minimal systems would not have
__pysources__ installed at all, while others get almost all benefits of having the sources available.
On normal systems,
__pysources__ would be installed, so the system would behave as it does now.
Importing could even become a bit faster, as importlib would only need to stat the
.pyc file to run it (compared to 2 files today: the
*.py and the corresponding
A caveat is that Python wouldn’t pick up modifications to the sources. Users would need to copy the source outside
__pysources__ if they wished to edit installed libraries.
We think that the
__pysources__ directory is a sufficiently strong signal to make them research the situation.
For more space savings (and a stronger “don’t edit” signal), the sources in
__pysources__ could be compressed.
Since this mechanism is intended for distributors, there would be no problems with compression libraries being optional: only distributions with
zlib would compress the files.
We calculated the size impact on Fedora’s
python3-libs RPM package, which contains most of the standard library, but omits
turtledemo) for dependency reasons, and
test (and test suites of other modules) for size reasons. The omitted parts can be installed from other packages. Such a split is fairly typical in Linux distributions.
The exact numbers will vary between distributions and Python versions, but the following table should be represenative.
|Option||Size||Difference (MiB)||Difference (%)|
|Status quo||31.8 MiB|
||22.8 MiB||-8.9 MiB||-28%|
Option 1 Shipping only
||15.2 MiB||-16.5 MiB||-52%|
Option 2 Shipping non-optimized
||17.1 MiB||-14.6 MiB||-46%|
Minimal 2 Shipping non-optimized
||13.5 MiB||-18.3 MiB||-57%|
We have a much more thorough brainstorming document.
One idea we already implemented is that large auto-generated files (
pydoc_data and several
encoding modules) are shipped as
.pyc only, without source, since the source is not very informative and differences in optimization levels are negligible.