In our case we also effectively disable pyc generation at runtime so that’s a non-issue. But I’d assume any such importlib change would also not write to a __pycache__ when a versioned pyc was found next to the .py itself.
Performance wise… your parallel __pysource__ idea might win as that’d only ever be looked for when something needs source to go along with a .pyc (at traceback rendering, or inspect time). Rather than a potential additional stat() or directory entry lookup while pyc hunting.
From my point of view, it make sense to have this implemented. I think that the disk space we’ll be able to save with this is more interesting than the import speedup.
I maintain a lot of Python packages in Fedora and RHEL and also some container images on top of those platforms and I think that users running their apps in containerized environments might benefit from this change.
I thought that source-less imports might be even faster than the traditional cache handling with timestamp invalidation (without the sources, there is one file less you have to deal with) but it turned out the times are so small that the difference is almost invisible (at least on Linux). But source-less imports are still 4-7 % faster than imports with hash (in)validation. The exact speedup depends on the size of a module - the bigger a module is the bigger the speedup is.
But I think that the impact on disk space we would be able to save with this in minimal environments is much more interesting. I took 18 out of 20 most popular Python packages on PyPI, found their respective RPMs in Fedora 37 and measured how big part of their size belongs to *.py files and the result is avg = 29,6 % and median = 30,2 %. Note that Fedora includes *.pyc files for all optimization levels; if we only included one optimization level, the percentage might be even bigger (but not that much).
Do you happen to have those measurements in absolute size numbers? I.e. 30% smaller is great, but if it’s 70 bytes vs. 100 bytes that isn’t much of a real savings at the cost of having to support this for decades.
Would you distribute .pyc files instead? And if so what sort of space do they take up compared to the equivalent .py files? It seems like Petr’s proposal is for performance reasons by making the .pyc file be the source of truth and for potential space saving by not needing the .py file to begin with, but making it easy to add later as necessary.
Currently, the RPM packages contain source files and 3 .pyc files for each source file (for all 3 optimization levels, if they are different from each other).
We’d like to ship only one .pyc file for each module and move the source files to separated RPM packages. The dependency between the main package and the one with source files might be weak which means that you’ll have the sources installed by default in the __pysource__ directory but in special environments (containers, for example) you can manually omit the weak dependencies and save a lot of space.
If we implement this, we’ll probably start with the Python interpreter itself and its standard library which is also divided into multiple RPM packages so I can try to calculate the same for that as well.
In general, one source RPM can build to multiple binary RPMs which is this case. We can, for example, build python3-requests containing only a single .pyc file for each module and python3-requests-sources with pysource folders and source files in them.
If you install only the python3-requests, it should be faster and consume less disk space. If you install both python3-requests and python3-requests-sources, it should still be faster because Python ignores the pysource folder but you’ll have the possibility to copy source files from the directory and modify them for debugging.
I’m just trying to say that this change might be beneficial for both performance and disk space.
Because if you have module.py and module.pyc in the same folder, Python uses SourceFileLoader and imports from module.py which is not what we want. We want to prioritize .pyc files even if the sources are present.
Never check the .py file even if present beacuse its faster(?)
Yes. It is measurably faster than hash-based invalidation.
Hash invalidation is currently the default when SOURCE_DATE_EPOCH is set, as is the case in most distro builds.
AFAIK, both Fedora and Debian turn it off due to the cost. In Fedora we’d like to get rid of that patch. But hash-based invalidation has benefits (see PEP 552). Rather than suggesting that CPython changes the default, I propose to avoids the cost and enjoy the benefits.
Install smaller package by removing .py files
Yes. It’s not for everyone, I don’t think it should be the default, but there are definitely users who want to trade debuggability for install size.
(Arguably, there are some modules where this should be default: codecs and pydoc topics are big and uninteresting)
Oh: another point that the next version of the proposal should make is that __pysource__ can be easily extended to other currently “sourceless” modules: compiled extensions. Tools like Cython, which synthesize traceback entries corresponding to the original source code, could “just” ship that code in __pysource__ and have Python pick it up.
In other words the pysource is the place where the lines of code that are referenced in the .pyc/.pyo files is obtained from. And it does not have to be .py
source code.
Does the .pyc/.pyo have the full filename of the “source”, including the .py extension?
There is another option for that which could keep source… create a future .pyc format to include an optional compressed source file at the end (with a format designed to not have that be loaded and unmarshalled into memory by default - merely loaded and decompressed on demand by the importer when a traceback rendering or debugger asks for source lines).
I’ve never had a need for this, just pointing out the possibility if someone wanted to go that route instead.
Outside of generated code I tend to find Python source code is not huge relative to everything else on a system/container/image today.