Hello,
Here’s a proposal to fix several niggles we found when distributing Python libraries in Fedora. What do you think?
Abstract
For modules loaded directly from bytecode cache (*.pyc
) files, Python will
look for corresponding source in a __pysource__
directory.
The existing ability to load modules from *.pyc
files only is
unchanged, but conceptually it becomes a special case of a “pyc-first”
file layout.
Motivation
Most pure Python code is installed as a source file (*.py
), combined with a
bytecode cache file (__pycache__/*.pyc
), which is created/updated ahead of
time or on demand.
This layout is designed for rapid iteration. Each time a module is imported,
Python assumes the source might have changed: if a bytecode cache is present,
Python normally checks whether it still corresponds to the source.
PEP 552 introduced an “unchecked” mode, in which this check is skipped.
However, this causes updates to the source to be silently ignored, possibly
confusing users that aren’t aware of this rarely used mode.
The remaining checking modes have their own disadvantages.
In both, the best case scenario (the cache is present and fresh), Python must
access at least two files (the source and the cache). Further:
- In the timestamp-based mode, the source file’s last-modification time is
used as part of the cache key, causing issues with reproducible builds
as described in PEP 552 . - In the hash-based mode, the entire source file is read and hashed.
This is potentially a slow operation. [XXX data needed.]
Another way to install Python modules is to not install the source,
and use the *.pyc
file directly in place of the *.py
file
(removing Python version tag from the filename and moving the file
out of the __pycache__
directory).
This layout has two main issues:
- The Python version tag is not used, meaning that modules using
this layout are only usable by a specific version, and - the source is not available, making it hard to debug (tracebacks
and theinspect
module don’t show code; file is unreadable to the
debugging human).
The first issue is usually not relevant, as most installations are tightly
tied to a specific interpreter. [XXX any examples where this isn’t the case?]
This PEP proposes to solve the second issue by allowing installers to
distribute the source file alongside the file with the bytecode.
Rationale
The new file layout is optimized for “installed libraries”: third-party
libraries installed on a user’s system.
This can include the Python standard library.
We assume that these files will most likely not be edited after installation.
Python will only consult the bytecode file (*.pyc
) when loading
a module, and not check whether a *.py
file was edited.
We assume than retreiving a module’s source is useful, but it is not a
performance-sensitive operation. It is used when displaying tracebacks
or debugging.
This makes it more palatable for distributors to use the resource-intensive
“checked hash” bytecode files and enjoy their benefits (explained in PEP 552).
On the other hand, we believe that Python should remain “hackable”: if a
source file is available, it should be possible to modify it and use the
result – for example, to add a few print
calls to a library for
some quick-and-dirty debugging (in a throwaway virtual environment, of course),
or even to explore the standard library by breaking it.
The proposed file layout makes this relatively straightforward: when the
source (*.py
) file is moved out of the __pysource__
directory,
Python will ignore the bytecode file and load the source instead, producing
a cache in __pycache__
. (This is the existing behavior when both a
*.py
and *.pyc
are present for a given name.)
We hope that users who’d like to do this, but aren’t familiar
with the proposed mechanics, will notice the extra directory, search the Web
for __pysource__
and find relevant instructions.
The proposed layout makes it easy to omit the source files, which will be
useful in resource-constrained environments (e.g. minimal Linux containers).
Omiting them should not affect non-debug functionality.
Adding the sources to an installation that omits them involves only creating
directories and copying source files to the right places, which is relatively
easy even for non-Python-specific tools (like Linux package managers).
This PEP does not propose that any particular distributor or installer
(including Python’s build system) should immediately switch to the new layout.
The PEP will be implemented when importlib
supports reading the layout
and stdlib tools like py_compile
can generate it. Switching to it should be
a separate decision – although one that might not need a PEP.
Specification
importlib.machinery.SourcelessFileLoader
, the loader that handles
stand-alone *.pyc
files, will be renamed to BytecodeFileLoader
.
The old name will remain as an alias for the foreseeable future,
with no DeprecationWarning
. However, third-party linters and code-quality
tools are encouraged to treat the old name as suboptimal.
The get_source_filename
method of BytecodeFileLoader
will
be changed to return the expected location of an auxiliary source file, e.g.
dir/__pysource__/module.py
for dir/module.pyc
.
The get_source
method of BytecodeFileLoader
will
check if the auxiliary source file corresponds to the bytecode file
(as returned by get_filename
).
… note::
This check is done at the time of the call. There is no check that the
source file corresponds to an in-memory module loaded by the
BytecodeFileLoader
. For example, if both *.pyc
and *.py
are
changed after a module is loaded, tracebacks will show lines of the updated
source, which might not correspond to the running code.
The same “gotcha” applies to current handling of *.py
files.
The py_compile
and compileall
modules will gain arguments and CLI
options for compiling to the new layout.
[XXX: This needs fleshing out. The original source needs to be moved. Need to ensure that compilation is still idempotent.]
Implications
The following follows naturally [XXX verify this!] from the changes above, but will
be tested separately.
inspect.getsource
, inspect.getsourcefile
, inspect.getsourcelines
,
the python -m inspect
CLI will retreive source for modules using the new
layout (if the __pysource__/*.py
file is available and current).
Tracebacks will show source lines for modules using the new layout
(if the __pysource__/*.py
file is available and current).
Backwards Compatibility
The proposal is backwards compatible.
However, once an installer (including Python’s build process) switches to the
new layout, tools that are not prepared for it may stop working.
This affects tools like IDEs, debuggers, API doc generators, etc. if they
either don’t use importlib
or inspect
, or use these modules from a
different version of Python than the code they are handling.
Even in that case, the failure – not being able to retreive source code
for a third-party module – is usually a quality-of-life issue rather than
a serious flaw.
Security Implications
None known.
The proposal adds source code information to modules that can already be
loaded and executed.
How to Teach This
This change does not affect code that users write directly.
Most teaching materials can stay unchanged.
Authors of existing installer tools should read this PEP.
Authors of future installer tools should read documentation that will be added.
Searching for the __pysource__
directory name in Python’s documentation
should yield relevant documentation.
We hope that people exploring the libraries installed on their system will
naturally reach relevant docs by searching for __pysource__
.
Reference Implementation
→ GitHub - encukou/cpython at pysource
Rejected Ideas
Nothing yet.
Open Issues
See XXX’s above.
Copyright
This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.