Compileall option to hardlink duplicate optimization levels bytecode cache files

Hello. In Fedora, we are looking at ways how to minimize the filesystem footprint of Python. One of the ideas is to deduplicate identical .pyc files across optimization levels by hardlinking them between each other. In the 3.8 standard library, this could save 10+ % of what we ship by default. If we do it with all our Python packages, we could save a lot of space. As a single data point: On my workstation I have 360 MiB of various Python 3.7 bytecode files in /usr and I can save 108 MiB by doing this.

We utilize the compilall module to bytecompile the files. We have recently added the ability to compile for various optimization levels at a time (for example, compileall -o0 -o1 -o2).

The idea here is to add an optional flag (called for example --hardlink-dupes/-H) that would compare different optimization level bytecode caches after compiling them and if they are identical in content, compileall would make them hardlinks instead of copies.

Example:

$ python -m compileall -o0 -o1 -o2 --hardlink-dupes ...

This would hardlink module.cpython-3?.pyc with module.cpython-3?.opt-1.pyc with module.cpython-3?.opt-1.pyc if identical. Or just the two identical files. Or nothing if different.

Given the nature of the bytecode caches, the non-optimized, optimized level 1 and optimized level 2 .pyc files may or may not be identical.

Consider the following Python module:

1

All three bytecode cache files would by identical.

While with:

assert 1

Only the two optimized cache files would be identical with each other.

And this:

"""Dummy module docstring"""
1

Would produce two identical bytecode cache files but the opt-2 file would differ.

Only modules like this would produce 3 different files:

"""Dummy module docstring"""
assert 1

Is that idea worth doing? We’d rather have the support written in compileall than doing it by some external tool. We are willing to maintain that code.

1 Like

That sounds like a clever way to save some disk space with no obvious downsides, assuming that the overall wall clock time to do a make install is not significantly increased. If someone does a naive copy of the files afterwards, the worst that should happen is that you end up with separate files again, the same as today. The code would need to handle the case where hard links aren’t supported on the file system in use. Unless someone thinks of some other downside, I’d say let’s do it.

The idea is to make the option nondefault at least for now, so make install won’t be affected yet. We would use it in Fedora and later we can eventually have another conversation about making it the default on POSIX like systems. For the same reason, we might want to have the option available on all OSes – users of Windows might try it, but they won’t be affected by hardlink quirks until they opt-in. If we realize it may never work there, we can eventually make the option only defined on POSIX.

There are separate general utilities which deduplicate files. Why not use them in your build scripts? And it is easy to write a Python scripts which does this for this particular case.

What happen if you update .py file and run compileall or just import the Python file? Would not the compiler overwrite the content of linked .pyc files for different optimization levels?

To make the optimization having an effect, you need a Python file which do not contain docstrings (very unlikely) nor assert statements or uses of __debug__ constant.

We use compileall in various places: when we build Python, when we build RPM packages form Python packages and when we build custom RPM packages that happen to contain some importable Python modules. If we add this functionality to the module, we just need to add the new flag to the various places and others can benefit form this as well. If we just adapt our build scripts to add additional code, we would need to maintain this somewhere else or copy it to various places, no 3rd party benefits from this. Generally, my motivation was that doing it in compileall makes sense – it’s easier for us and others can benefit as well.

The compiler replaces the files and effectively “unlinks” them. We might want to add some tests to make sure this is always the case. When you update the source and run Python in different optimization levels, the files are no longer linked even if they are the same. Worst case scenario is that you end up with the same amount of bytes as before using the new deduplicate switch.

The standard library (Python 3.8.1, various tests modules excluded):

  • 607 modules have bytecode files
  • 454 identical optimization 0 and 1 pairs
  • 68 identical optimization 1 and 2 pairs
  • 62 identical optimization 0, 1 and 2 triads (already counted in both of the above)

We anticipate the files without docsrings are most likely empty, OTOH most of the non-empty files have no asserts or __debug__ conditionals.

From python-minimization/document.md at master · hroncok/python-minimization · GitHub counted in python-minimization/python-minimization.ipynb at master · hroncok/python-minimization · GitHub

We have a working draft implementation ready at https://github.com/fedora-python/compileall2/pull/19

How should we proceed? We can either start using our implementation in Fedora and come back with some data, or we can propose an upstream Pull Request. The discussion here seem stalled. Should we bring this up on the python-dev mailing list instead?

I think the discussion is over, rather than stalled. Ned said “let’s do it” and Sehryi’s points are addressed. (Sehryi, please shout if you’re -1 on this; from here it doesn’t like you are.)

I’d go for opening a pull request after some testing in Fedora.

1 Like

The feature is now part of the compileall2 0.7 release.

The installation size of Python 3.9 gets from 125 MiB to 103 MiB.

I just merged the PR: https://github.com/python/cpython/commit/e77d428856fbd339faee44ff47214eda5fb51d57 will be part of the incoming Python 3.9.0 beta1!

1 Like