Hello. In Fedora, we are looking at ways how to minimize the filesystem footprint of Python. One of the ideas is to deduplicate identical .pyc files across optimization levels by hardlinking them between each other. In the 3.8 standard library, this could save 10+ % of what we ship by default. If we do it with all our Python packages, we could save a lot of space. As a single data point: On my workstation I have 360 MiB of various Python 3.7 bytecode files in /usr and I can save 108 MiB by doing this.
We utilize the compilall module to bytecompile the files. We have recently added the ability to compile for various optimization levels at a time (for example, compileall -o0 -o1 -o2).
The idea here is to add an optional flag (called for example --hardlink-dupes/-H) that would compare different optimization level bytecode caches after compiling them and if they are identical in content, compileall would make them hardlinks instead of copies.
Example:
$ python -m compileall -o0 -o1 -o2 --hardlink-dupes ...
This would hardlink module.cpython-3?.pyc with module.cpython-3?.opt-1.pyc with module.cpython-3?.opt-1.pyc if identical. Or just the two identical files. Or nothing if different.
Given the nature of the bytecode caches, the non-optimized, optimized level 1 and optimized level 2 .pyc files may or may not be identical.
Consider the following Python module:
1
All three bytecode cache files would by identical.
While with:
assert 1
Only the two optimized cache files would be identical with each other.
And this:
"""Dummy module docstring"""
1
Would produce two identical bytecode cache files but the opt-2 file would differ.
Only modules like this would produce 3 different files:
"""Dummy module docstring"""
assert 1
Is that idea worth doing? We’d rather have the support written in compileall than doing it by some external tool. We are willing to maintain that code.