“__pysource__” file layout for installed modules

In our case we also effectively disable pyc generation at runtime so that’s a non-issue. But I’d assume any such importlib change would also not write to a __pycache__ when a versioned pyc was found next to the .py itself.

Performance wise… your parallel __pysource__ idea might win as that’d only ever be looked for when something needs source to go along with a .pyc (at traceback rendering, or inspect time). Rather than a potential additional stat() or directory entry lookup while pyc hunting.

(Disclaimer: I work on the same team as Petr)

From my point of view, it make sense to have this implemented. I think that the disk space we’ll be able to save with this is more interesting than the import speedup.

I maintain a lot of Python packages in Fedora and RHEL and also some container images on top of those platforms and I think that users running their apps in containerized environments might benefit from this change.

I thought that source-less imports might be even faster than the traditional cache handling with timestamp invalidation (without the sources, there is one file less you have to deal with) but it turned out the times are so small that the difference is almost invisible (at least on Linux). But source-less imports are still 4-7 % faster than imports with hash (in)validation. The exact speedup depends on the size of a module - the bigger a module is the bigger the speedup is.

But I think that the impact on disk space we would be able to save with this in minimal environments is much more interesting. I took 18 out of 20 most popular Python packages on PyPI, found their respective RPMs in Fedora 37 and measured how big part of their size belongs to *.py files and the result is avg = 29,6 % and median = 30,2 %. Note that Fedora includes *.pyc files for all optimization levels; if we only included one optimization level, the percentage might be even bigger (but not that much).

Do you happen to have those measurements in absolute size numbers? I.e. 30% smaller is great, but if it’s 70 bytes vs. 100 bytes that isn’t much of a real savings at the cost of having to support this for decades. :wink:

I do. There are the mentioned 18 out of 20 most downloaded packages from PyPI (botocore and s3transfer) are not packaged in Fedora Linux.

RPM package total size in kB .py files in kB .py files in %
awscli-1.27.4-1.fc37.noarch.rpm 25420 1916 7,54
python3-attrs-22.1.0-1.fc37.noarch.rpm 532 208 39,10
python3-boto3-1.26.4-1.fc37.noarch.rpm 1456 348 23,90
python3-certifi-2021.10.8-3.fc37.noarch.rpm 52 12 23,08
python3-charset-normalizer-2.1.0-2.fc37.noarch.rpm 364 140 38,46
python3-click-8.1.3-1.fc37.noarch.rpm 1180 356 30,17
python3-cryptography-37.0.2-4.fc37.x86_64.rpm 5428 816 15,03
python3-dateutil-2.8.2-4.fc37.noarch.rpm 1036 316 30,50
python3-google-api-core-2.8.2-4.fc37.noarch.rpm 1016 444 43,70
python3-idna-3.3-4.fc37.noarch.rpm 572 280 48,95
python3-jinja2-3.0.3-5.fc37.noarch.rpm 3788 552 14,57
python3-pyyaml-6.0-5.fc37.x86_64.rpm 912 276 30,26
python3-requests-2.28.1-2.fc37.noarch.rpm 596 220 36,91
python3-setuptools-62.6.0-2.fc37.noarch.rpm 9864 3612 36,62
python3-six-1.16.0-8.fc37.noarch.rpm 152 36 23,68
python3-typing-extensions-4.2.0-5.fc37.noarch.rpm 300 72 24,00
python3-urllib3-1.26.12-1.fc37.noarch.rpm 1132 428 37,81
python3-wheel-0.37.1-4.fc37.noarch.rpm 428 128 29,91

Would you distribute .pyc files instead? And if so what sort of space do they take up compared to the equivalent .py files? It seems like Petr’s proposal is for performance reasons by making the .pyc file be the source of truth and for potential space saving by not needing the .py file to begin with, but making it easy to add later as necessary.

I read stdlib and package code often in my work and would not appreciate having to decompress a BLOB to get the source in to my editor.

Also file systems like btrfs will transparently compress the data if you use them.

Here is the same table as before with two additional columns describing how much of the total size belongs to .pyc files.

RPM package total size in kB .py files in kB .py files in % .pyc files in kB .pyc files in %
awscli-1.27.4-1.fc37.noarch.rpm 25420 1916 7,54 2208 8,69
python3-attrs-22.1.0-1.fc37.noarch.rpm 532 208 39,10 228 42,86
python3-boto3-1.26.4-1.fc37.noarch.rpm 1456 348 23,90 392 26,92
python3-certifi-2021.10.8-3.fc37.noarch.rpm 52 12 23,08 12 23,08
python3-charset-normalizer-2.1.0-2.fc37.noarch.rpm 364 140 38,46 172 47,25
python3-click-8.1.3-1.fc37.noarch.rpm 1180 356 30,17 752 63,73
python3-cryptography-37.0.2-4.fc37.x86_64.rpm 5428 816 15,03 1344 24,76
python3-dateutil-2.8.2-4.fc37.noarch.rpm 1036 316 30,50 464 44,79
python3-google-api-core-2.8.2-4.fc37.noarch.rpm 1016 444 43,70 484 47,64
python3-idna-3.3-4.fc37.noarch.rpm 572 280 48,95 244 42,66
python3-jinja2-3.0.3-5.fc37.noarch.rpm 3788 552 14,57 1224 32,31
python3-pyyaml-6.0-5.fc37.x86_64.rpm 912 276 30,26 344 37,72
python3-requests-2.28.1-2.fc37.noarch.rpm 596 220 36,91 272 45,64
python3-setuptools-62.6.0-2.fc37.noarch.rpm 9864 3612 36,62 5284 53,57
python3-six-1.16.0-8.fc37.noarch.rpm 152 36 23,68 48 31,58
python3-typing-extensions-4.2.0-5.fc37.noarch.rpm 300 72 24,00 168 56,00
python3-urllib3-1.26.12-1.fc37.noarch.rpm 1132 428 37,81 584 51,59
python3-wheel-0.37.1-4.fc37.noarch.rpm 428 128 29,91 252 58,88

Currently, the RPM packages contain source files and 3 .pyc files for each source file (for all 3 optimization levels, if they are different from each other).

We’d like to ship only one .pyc file for each module and move the source files to separated RPM packages. The dependency between the main package and the one with source files might be weak which means that you’ll have the sources installed by default in the __pysource__ directory but in special environments (containers, for example) you can manually omit the weak dependencies and save a lot of space.

If we implement this, we’ll probably start with the Python interpreter itself and its standard library which is also divided into multiple RPM packages so I can try to calculate the same for that as well.

Actually, only for O0 and O1. Also, we hardlink them if identical.

Why do you need the __pysource __ folder to package without .py files?
Are you conflating a CPU performance change with a disk space issue?

1 Like

IIRC Python itself (python3-libs) contains cached files for all three optimization levels.

1 Like

In general, one source RPM can build to multiple binary RPMs which is this case. We can, for example, build python3-requests containing only a single .pyc file for each module and python3-requests-sources with pysource folders and source files in them.

If you install only the python3-requests, it should be faster and consume less disk space. If you install both python3-requests and python3-requests-sources, it should still be faster because Python ignores the pysource folder but you’ll have the possibility to copy source files from the directory and modify them for debugging.

I’m just trying to say that this change might be beneficial for both performance and disk space.

You have not answered why it is necessary to have a __pysource __ folder.

Fyi I have years of experience packaging python in RPMs, for fedora and at dayjob.

I package .pyc, .pyo in one RPM and .py in another without needing a special folder.

P.S. watch out for dunder being the bold markup

Because if you have module.py and module.pyc in the same folder, Python uses SourceFileLoader and imports from module.py which is not what we want. We want to prioritize .pyc files even if the sources are present.

Ok you are aiming for two changes.

  1. Never check the .py file even if present beacuse its faster(?)
  2. Install smaller package by removing .py files

Never check the .py file even if present beacuse its faster(?)

Yes. It is measurably faster than hash-based invalidation.
Hash invalidation is currently the default when SOURCE_DATE_EPOCH is set, as is the case in most distro builds.
AFAIK, both Fedora and Debian turn it off due to the cost. In Fedora we’d like to get rid of that patch. But hash-based invalidation has benefits (see PEP 552). Rather than suggesting that CPython changes the default, I propose to avoids the cost and enjoy the benefits.

Install smaller package by removing .py files

Yes. It’s not for everyone, I don’t think it should be the default, but there are definitely users who want to trade debuggability for install size.
(Arguably, there are some modules where this should be default: codecs and pydoc topics are big and uninteresting)

Oh: another point that the next version of the proposal should make is that __pysource__ can be easily extended to other currently “sourceless” modules: compiled extensions. Tools like Cython, which synthesize traceback entries corresponding to the original source code, could “just” ship that code in __pysource__ and have Python pick it up.

1 Like

In other words the pysource is the place where the lines of code that are referenced in the .pyc/.pyo files is obtained from. And it does not have to be .py
source code.

Does the .pyc/.pyo have the full filename of the “source”, including the .py extension?

No just in .pyc files :‍)
And yes, .pyc files have the source filename.

.pyo files aren’t used since Python 3.5.

1 Like

There is another option for that which could keep source… create a future .pyc format to include an optional compressed source file at the end (with a format designed to not have that be loaded and unmarshalled into memory by default - merely loaded and decompressed on demand by the importer when a traceback rendering or debugger asks for source lines).

I’ve never had a need for this, just pointing out the possibility if someone wanted to go that route instead.

Outside of generated code I tend to find Python source code is not huge relative to everything else on a system/container/image today.

1 Like