“pysource” file layout for installed modules

encukou · March 25, 2022, 3:24pm

Hello,
Here’s a proposal to fix several niggles we found when distributing Python libraries in Fedora. What do you think?

Abstract

For modules loaded directly from bytecode cache (*.pyc) files, Python will
look for corresponding source in a __pysource__ directory.

The existing ability to load modules from *.pyc files only is
unchanged, but conceptually it becomes a special case of a “pyc-first”
file layout.

Motivation

Most pure Python code is installed as a source file (*.py), combined with a
bytecode cache file (__pycache__/*.pyc), which is created/updated ahead of
time or on demand.

This layout is designed for rapid iteration. Each time a module is imported,
Python assumes the source might have changed: if a bytecode cache is present,
Python normally checks whether it still corresponds to the source.

PEP 552 introduced an “unchecked” mode, in which this check is skipped.
However, this causes updates to the source to be silently ignored, possibly
confusing users that aren’t aware of this rarely used mode.

The remaining checking modes have their own disadvantages.
In both, the best case scenario (the cache is present and fresh), Python must
access at least two files (the source and the cache). Further:

In the timestamp-based mode, the source file’s last-modification time is
used as part of the cache key, causing issues with reproducible builds
as described in PEP 552 .
In the hash-based mode, the entire source file is read and hashed.
This is potentially a slow operation. [XXX data needed.]

Another way to install Python modules is to not install the source,
and use the *.pyc file directly in place of the *.py file
(removing Python version tag from the filename and moving the file
out of the __pycache__ directory).
This layout has two main issues:

The Python version tag is not used, meaning that modules using
this layout are only usable by a specific version, and
the source is not available, making it hard to debug (tracebacks
and the inspect module don’t show code; file is unreadable to the
debugging human).

The first issue is usually not relevant, as most installations are tightly
tied to a specific interpreter. [XXX any examples where this isn’t the case?]

This PEP proposes to solve the second issue by allowing installers to
distribute the source file alongside the file with the bytecode.

Rationale

The new file layout is optimized for “installed libraries”: third-party
libraries installed on a user’s system.
This can include the Python standard library.

We assume that these files will most likely not be edited after installation.
Python will only consult the bytecode file (*.pyc) when loading
a module, and not check whether a *.py file was edited.

We assume than retreiving a module’s source is useful, but it is not a
performance-sensitive operation. It is used when displaying tracebacks
or debugging.
This makes it more palatable for distributors to use the resource-intensive
“checked hash” bytecode files and enjoy their benefits (explained in PEP 552).

On the other hand, we believe that Python should remain “hackable”: if a
source file is available, it should be possible to modify it and use the
result – for example, to add a few print calls to a library for
some quick-and-dirty debugging (in a throwaway virtual environment, of course),
or even to explore the standard library by breaking it.
The proposed file layout makes this relatively straightforward: when the
source (*.py) file is moved out of the __pysource__ directory,
Python will ignore the bytecode file and load the source instead, producing
a cache in __pycache__. (This is the existing behavior when both a
*.py and *.pyc are present for a given name.)
We hope that users who’d like to do this, but aren’t familiar
with the proposed mechanics, will notice the extra directory, search the Web
for __pysource__ and find relevant instructions.

The proposed layout makes it easy to omit the source files, which will be
useful in resource-constrained environments (e.g. minimal Linux containers).
Omiting them should not affect non-debug functionality.
Adding the sources to an installation that omits them involves only creating
directories and copying source files to the right places, which is relatively
easy even for non-Python-specific tools (like Linux package managers).

This PEP does not propose that any particular distributor or installer
(including Python’s build system) should immediately switch to the new layout.
The PEP will be implemented when importlib supports reading the layout
and stdlib tools like py_compile can generate it. Switching to it should be
a separate decision – although one that might not need a PEP.

Specification

importlib.machinery.SourcelessFileLoader, the loader that handles
stand-alone *.pyc files, will be renamed to BytecodeFileLoader.
The old name will remain as an alias for the foreseeable future,
with no DeprecationWarning. However, third-party linters and code-quality
tools are encouraged to treat the old name as suboptimal.

The get_source_filename method of BytecodeFileLoader will
be changed to return the expected location of an auxiliary source file, e.g.
dir/__pysource__/module.py for dir/module.pyc.

The get_source method of BytecodeFileLoader will
check if the auxiliary source file corresponds to the bytecode file
(as returned by get_filename).

… note::
This check is done at the time of the call. There is no check that the
source file corresponds to an in-memory module loaded by the
BytecodeFileLoader. For example, if both *.pyc and *.py are
changed after a module is loaded, tracebacks will show lines of the updated
source, which might not correspond to the running code.
The same “gotcha” applies to current handling of *.py files.

The py_compile and compileall modules will gain arguments and CLI
options for compiling to the new layout.
[XXX: This needs fleshing out. The original source needs to be moved. Need to ensure that compilation is still idempotent.]

Implications

The following follows naturally [XXX verify this!] from the changes above, but will
be tested separately.

inspect.getsource, inspect.getsourcefile, inspect.getsourcelines,
the python -m inspect CLI will retreive source for modules using the new
layout (if the __pysource__/*.py file is available and current).

Tracebacks will show source lines for modules using the new layout
(if the __pysource__/*.py file is available and current).

Backwards Compatibility

The proposal is backwards compatible.
However, once an installer (including Python’s build process) switches to the
new layout, tools that are not prepared for it may stop working.

This affects tools like IDEs, debuggers, API doc generators, etc. if they
either don’t use importlib or inspect, or use these modules from a
different version of Python than the code they are handling.

Even in that case, the failure – not being able to retreive source code
for a third-party module – is usually a quality-of-life issue rather than
a serious flaw.

Security Implications

None known.
The proposal adds source code information to modules that can already be
loaded and executed.

How to Teach This

This change does not affect code that users write directly.
Most teaching materials can stay unchanged.

Authors of existing installer tools should read this PEP.
Authors of future installer tools should read documentation that will be added.

Searching for the __pysource__ directory name in Python’s documentation
should yield relevant documentation.
We hope that people exploring the libraries installed on their system will
naturally reach relevant docs by searching for __pysource__.

Reference Implementation

→ GitHub - encukou/cpython at pysource

Rejected Ideas

Nothing yet.

Open Issues

See XXX’s above.

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.

FFY00 · March 25, 2022, 3:41pm

I think the proposal is sound, but I personally struggle to justify the added complexity for the added benefits, especially for a user-facing component. That is subjective though.

Have you benchmarked bytecode only vs source + bytecode with timestamp invalidation? Most of the time should come from the two extra stat operations, right? How does it look like on different operating systems? Does that really warrant the creation of a mechanism like this?

uranusjr · March 25, 2022, 3:45pm

How does the loader check whether the bytecode file and source match in this layout, and what should it do if they don’t match? In the __pycache__ layout, the bytecode would be automatically re-generated, but I’d assume this shouldn’t be the case in the new layout, since the bytecode is the canonical executable in this case.

encukou · March 25, 2022, 4:34pm

This should be a bit faster than timestamp invalidation, but as you say, probably not by much. But timestamp invalidation isn’t the enemy here.
The current default for libraries compiled with SOURCE_DATE_EPOCH (which is AFAIK set in many distro builds) is hash invalidation, and that is costly. Changing that default is one of the ~3 patches to CPython in Fedora.
Still, you’re right, before making this a PEP I’d like to collect some hard current data.

It should use the usual mechanism (according to the pyc’s invalidation scheme), and it should ignore the source file if it doesn’t match.
It cannot regenerate the source from bytecode, and shouldn’t do it even if it could.

brettcannon · March 25, 2022, 7:12pm

I personally view .pyc files as an implementation detail for performance reasons. This would elevate .pyc files to a higher, more exposed level.

I would rather see a change in cache validation checks if that’s the great performance concern than directly expose .pyc files to this extent (I personally wanted to kill of pyc-only usage in the 2->3 transition, but I obviously lost that argument). There’s also py_compile — Compile Python source files — Python 3.10.4 documentation to avoid checking the source at all.

FFY00 · March 25, 2022, 7:45pm

Right, but the distributors could select a different cache invalidation scheme when installing the packages. Since this is something they could easily change, instead of introducing a new layout, we could work towards that. If necessary, we could also change the default, or allow distributors to configure it.

Anyway, I think ultimately this might come down to the performance, and if the benefits warrant the complexity.

gpshead · March 27, 2022, 9:31pm

I don’t see this as an issue that exists. Installers already distribute the source file along with the .pyc today and it works.

Things managed by a system package management system should not be using .pyc files that rely on module import time timestamp metadata or content hash validation of installed .pyc bytecode files. They should simply be assumed to be correct (ex: use the unchecked hash option when generating that bytecode) to skip runtime check overhead entirely. It is up to the system package management to verify that what is on the file system remains what it should be in an out of band fashion of its own choosing.

Fedora appears to use checked hashes when building rpm’s by default (at least according to the 00328-pyc-timestamp-invalidation-mode.patch you linked to). That surprises me. Unchecked hashes make more sense for a package as there is little value in checking them and silently not reporting the mismatch upon every import.

What meaningful value does using a checked hash on a .pyc file provide for a system installed package? I can’t think of any. If the check fails, it’ll load from source. But nothing is ever told there is a mismatch and Python has zero information as to if the source .py or the .pyc is the desirable “correct” code that matches what the package is supposed to have installed. So why waste I/O and CPU time on all of these checks upon every process’s import from a package on every computer running this around the world.

(Also, Fedora shouldn’t have to do this by patching their Python interpreter. I’d expect a change to the package build system itself to control how all packages pyc files are generated. I assume this was just deemed a lot more convenient than improving a complicated plethora of package build system stories for now - creative hack!)

FFY00 · March 28, 2022, 3:39am

I think Petr wants to avoid frustration when users change the source, even if that is explicitely not supported by the distributions, and their changes are not picked up.

I think that would be the right approach when ignoring what I mentioned above. But if one wanted to improve the use-case I pointed out, timestamp based invalidation should provide a very reasonable middleground, being much better performant than checked hash, while still invalidating the cache if someone changes the source.

I completely agree, this should be done by the package installer. These days, with the stardedizations we have been pushing over the years, it should be perfectly doable to change the cache invalidation scheme in the RPM macros themselves. It is still work nonetheless, so I understand why it hasn’t been done already.

steve.dower · March 28, 2022, 3:33pm

All of this can be done with a custom importer, can’t it? Along with as many other optimisations as you like.

Maybe we just need to make it more obvious how to install a custom importer?

encukou · March 28, 2022, 3:51pm

I don’t see it that way. The pyc would be about as hidden as today – of concern to installers (which need to compile pycs already), and people digging in implementation details (in this case, details of installed libraries).

Sure. But checked hash does have advantages – in a debugging session, the check that the displayed source corresponds to the loaded module is pretty useful IMO.

If all installers should use unchecked hash, why does Python default to checked hash (when building reproducible packages)?

Umm, I’d really like to use Python’s build system when building Python. That beast is not easy to replicate. (And if the change there is then picked up in other packages automatically, why fix it elsewhere?)

Python has an explicit special case for building system packages (=reproducible mode, with SOURCE_DATE_EPOCH). Is that a wrong default? Should it be UNCHECKED_HASH instead? What would be the reason for having CHECKED_HASH at all?

Sure, on both counts. But if it turns out to be useful it should go in stdlib – at least eventually.
But also note that this would technically (not conceptually) be a pretty small addition to the current importer, and that it needs some internal API for the invalidation check. (And that API is not easy to expose.)

One thing that’s I should have mentioned more prominently is the installation size, which turns out to be important in cointainers and VMs. Installing without the dead weight of the source for production, with it still being easily installable for debugging, would be quite useful.

pf_moore · March 28, 2022, 4:20pm

Why would installers need to know? Pip for example, just calls compileall from the stdlib.

encukou · March 28, 2022, 4:35pm

Yep, pip is a lucky installer that doesn’t need to care, because Python’s compileall is enough for it!
Fedora is in the same position after Lumír spent quite some effort on it, though we’re ready to diverge again if need be.
I imagine other installers might well have bespoke code, though.

And if not, well, then all this is even more of an implementation detail.

steve.dower · March 28, 2022, 4:55pm

True, but we can easily add new classes to importlib. The issue right now is there’s no easy way for users to activate them (i.e. write a sitecustomize.py), so to be broadly useful (does it need to be?) it somehow has to be on by default.

But if you’ve got a custom importer, you don’t need to rely on the standard validation check. You can create any format .pyc file you like and store whatever info you want. For example, you probably want to combine all the bytecode into a single file, because reading a single larger file is basically always faster than the equivalent size split into smaller ones, and then you can fixup filenames however you like to wherever you keep sources, and validate however you like based on whatever hashes/stamps you’ve collected.

I think you’re just not thinking big enough here Trying to make the minimal possible changes to enable it makes sense when modifying the core, but if this makes more sense as a separate thing (and I think it does), better to separate from core and make better decisions all the way down.

An alternative “minimal” change that might make more sense is a new checking mode that doesn’t run the check until the source file is loaded (I believe inspect goes back through the importer for this), and then it only warns. That way the regular compileall can create pyc files as usual under __pycache__ and if they exist they’ll be loaded without checking the source file, but all the paths are still correct for when they need to be reported.

brettcannon · March 28, 2022, 11:10pm

Because one setting leads to less false-positive bug reports than the other. But if you know what you’re doing then UNCHECKED_HASH is the faster solution.

gpshead · March 29, 2022, 12:17am

Right. I assume OS distro packages should meet the high bar of “know what you’re doing”. I would not make the same assumption of pip created PyPI packages for installation in a local environment. As such an environment is writable by the user running Python.

gpshead · March 29, 2022, 12:19am

As to custom importers and sitecustomize things… best avoid those. You’ll always regret that in the long run as things will come to depend on it and other things will not work while it is present - you gain maintenance burdens that are hard to untangle from. These are nearly as regrettable as a need to modify your interpreter. Aim to avoid that.

EpicWink · March 29, 2022, 3:18am

With regards to putting the source in the sibling __pysource__ directory, I suggest compressing all the text source. The proposal doesn’t seem to say, but if the __pysource__ layout is achieved during package-install time (after extraction from zip), then there should be no concern about compression format compatibility (the environment performing the compression should have the ability to perform the inverse decompression).

Package-install times for systems with strict size requirements should be unimportant: in my experience, when building a tiny container image, I don’t mind CI taking many minutes to build it, not that I expect compression for many packages of text files to take more than seconds anyway.

steve.dower · March 29, 2022, 2:18pm

Since we’re already at the point (for this discussion) where files are being precompiled, rearranged on disk, and shipped in non-Python-standard packages, I don’t see how this makes things much worse The context is definitely not “pip starts doing this by default for everyone”.

And my reference to sitecustomize was lamenting that it’s the best we have right now, and we need something better for startup customizations to be realistic.

gpshead · March 29, 2022, 3:47pm

The higher level view of my reply is that I still don’t understand what problem exists and thus why this proposal helps. But I’m trying to.

We can already identify .py files to be removed/separated out from a package. By scanning the package for .pyc files - the .py sources are in an obvious computable location nearby. Rename the .pyc into the .py file’s basedir if it was in a __pycache__. You already identified that in OS package situations the version tag is not usually relevant.

In a way, I can now read your proposal as “seeking a way to distribute sources separately from the application code”. Similar to how shipping debug symbols is sometimes done for compiled binaries. Is that correct?

We use a modified importlib at work that allows loading modules using a .cpython-XY.pyc style name regardless of being in a __pycache__/ subdir or not. That way we don’t have to bother putting files in different locations based on source presence. While unusual for us to have multiple versions present at once, it avoids issues in necessary scenarios where that does happen (ex: mid-upgrade). We didn’t see a reason the rest of the world might want this so hadn’t proposed it as an official feature. But supporting this layout seems like it might also solve the problem I’m assuming you have? By not using __pycache__/ subdirs in your rpms at all. As those are more useful in a development tree where the idea was created to reduce clutter from pyc’s generated at program runtime. I’m happy to share our patch if so. But if you’re able to guarantee you never have multiple versions existing in one tree, you can just use a plain .pyc name outside of __pycache__ already and avoid it all.

Side note: A feature request that might help Fedora drop that patch: A standard ability to control compileall pyc output mode (or any compilation to pyc?) via an environment variable. So you could set that from the rpm package builder parent process instead of a modification relying on a magic rpm specific environment variable within your patched runtime.

encukou · April 1, 2022, 2:31pm

Hm, but I do think of it as something that should be in core – specifically, adding missing functionality to the existing “pyc only” layout. (Which does have its uses.)

That would make it harder to tell which file actually runs if they don’t match. You modify the source, wonder why nothing happens – and find that a file in a “cache” subdirectory has a bit set that makes Python not look at the source at all. Quite non-obvious, IMO.
(Unchecked hash mode has the same issue, of course.)

Not always. Mucking around with pip-installed packages is easier for a Python developer, but there are people who don’t fear modifying system packages. Especially if the system is a VM/container.

That’s possible, but the proposal allows not installing any source at all on production (for smaller install sizes), while making it installable for debugging (without needing to unzip to read it).
I don’t see the appeal. (Maybe the .pyc should be zipped, but that’s a whole different discussion.)

Yes, that’s a very good analogy!
And I’m arguing that for code you’re not actively developing, this approach is much better than Python’s current source-first philosophy.

For this use case, that doesn’t seem too different than distributing .pyc without the tag outside __pycache__, except that it solves the version mismatch on upgrades.
But as far as I can see, it doesn’t work as well if you install the source for debugging: once you add a .py file, it becomes the source of truth , the existing .pyc is ignored, and a new .pyc is created in __pycache__, right?
(In some security contexts, even attempting to create executable files in system locations at runtime is a big red flag. I’d much prefer to have all the .pycs in their place when software is installed.)

On systems I know about, nearly all code is version-specific directories already, and an upgrade switches to a new tree. Mid-upgrade state isn’t much of an issue.

Dropping the patch is an eventual goal, but I’d be happier if we could get a solution to CPython that would be useful to more than Fedora :‍)

“__pysource__” file layout for installed modules