Hello. In Fedora, we are looking at ways how to minimize the filesystem footprint of Python. One of the ideas is to deduplicate identical .pyc
files across optimization levels by hardlinking them between each other. In the 3.8 standard library, this could save 10+ % of what we ship by default. If we do it with all our Python packages, we could save a lot of space. As a single data point: On my workstation I have 360 MiB of various Python 3.7 bytecode files in /usr
and I can save 108 MiB by doing this.
We utilize the compilall module to bytecompile the files. We have recently added the ability to compile for various optimization levels at a time (for example, compileall -o0 -o1 -o2
).
The idea here is to add an optional flag (called for example --hardlink-dupes/-H
) that would compare different optimization level bytecode caches after compiling them and if they are identical in content, compileall
would make them hardlinks instead of copies.
Example:
$ python -m compileall -o0 -o1 -o2 --hardlink-dupes ...
This would hardlink module.cpython-3?.pyc
with module.cpython-3?.opt-1.pyc
with module.cpython-3?.opt-1.pyc
if identical. Or just the two identical files. Or nothing if different.
Given the nature of the bytecode caches, the non-optimized, optimized level 1 and optimized level 2 .pyc
files may or may not be identical.
Consider the following Python module:
1
All three bytecode cache files would by identical.
While with:
assert 1
Only the two optimized cache files would be identical with each other.
And this:
"""Dummy module docstring"""
1
Would produce two identical bytecode cache files but the opt-2 file would differ.
Only modules like this would produce 3 different files:
"""Dummy module docstring"""
assert 1
Is that idea worth doing? We’d rather have the support written in compileall
than doing it by some external tool. We are willing to maintain that code.