Thought experiment about adding extra loadable stuff to a .pyc file

larry · April 10, 2023, 3:31pm

In my recent-ish thread about revising PEP 649, Petr brought up the possibility of enhancing .pyc files so we can add additional lazy-loaded stuff. I was discussing this in a private email thread this morning, and had a brainstorm about how it all could work: the API, the semantics, and the implementation.

Quick recap, the current structure of a .pyc file is as follows:

<magic value>
<4-byte flags int>
<8 bytes of stuff, contents vary depending on flags>
<module code object>

AFAIK this structure is invariant for CPython. The “8 bytes of stuff” can be either a 4-byte datetime stamp (seconds since epoch) and a 4-byte size, or an 8-byte hash of the source.

I’m gonna call our new thing an “overlay”. This isn’t technically an overlay in the traditional sense, but it’s a good enough word to hang the concept on for now–we can find a better word later. So.

An “overlay” is an optional code object appended to a .pyc file, referenced by a “name”. An overlay “name” is a marshallable, hashable, constant value–a string, an int, etc.

“Loading” an overlay means running the overlay code object in an existing module’s namespace. This can do whatever it needs to in the module–add new attributes, modify existing ones, run arbitrary code. There is no mechanism to “unload” an overlay.

The overlay loading machinery remembers what overlays have been loaded by maintaining a __overlays__ attribute in the module’s namespace. It’s unset when the module is new, and only added after loading the first overlay. It’s a set object containing all the names of the loaded overlays.

We add a new function to load overlays, something like

int PyImport_LoadOverlay(PyObject *module, PyObject *name, int force_reload);

You’d call it with an already-loaded module object, the name of the overlay you want to load, and the force_reload flag. It’d return nonzero for success and zero for failure. force_reload relates to caching: if force_reload is zero, it doesn’t re-load an overlay that’s already been loaded; if it’s nonzero, it always loads the overlay, whether or not it was previously loaded. The return value could indicate what happened, like 1 for “successfully loaded” and 2 for “already loaded, didn’t reload”.

We’d also provide this function in Python, presumably in the sys module. Something like

def load_overlay(module, name, *, force_reload=False):
    ...

The simplest possible way to store overlays would be appending alternating names and values of the overlays to the .pyc file. In this example, our .pyc file has three overlays with the names 'foo', 'bar', and 'third_thing':

<magic value>
<4-byte flags int>
<8 bytes of stuff, contents vary depending on flags>
<module code object>
<constant, string 'foo'>
<overlay code object 'foo'>
<constant, string 'bar'>
<overlay code object 'bar'>
<constant, string 'third thing'>
<overlay code object 'third thing'>

But this would force us to wade through N overlays to find the one we wanted.

With only slightly more data–one 4-byte int per overlay code object, and two additional 4-byte ints–we could create a structure (inspired by zip files) that could still be written in one pass but lets us load any overlay with at most three seeks.

<magic value>
<4-byte flags int>
<8 bytes of stuff, contents vary depending on flags>
<module code object>
<overlay code object 'foo'>
<overlay code object 'bar'>
<overlay code object 'third thing'>
<number of overlays, 4-byte int>  # start of "directory
<constant, string 'foo'>
<absolute seek offset for overlay code object 'foo'>
<constant, string 'bar'>
<absolute seek offset for overlay code object 'bar'>
<constant, string 'third thing'>
<absolute seek offset for overlay code object 'third thing'> # last entry in "directory"
<absolute seek offset to number of constants>

In practice we could get it down to two seeks or even one. For example, we could seek to EOF-4k bytes and read 4k bytes. With luck, that’ll get you the entire “directory” and the ending absolute seek offset, and if you’re extra-lucky the overlay you want to load would be in that 4k chunk too.

Any good?

barry · April 10, 2023, 3:42pm

Interesting idea. Can you talk a little bit about how you seeing this relate to the source in a .py file (like how would bits of the source be “allocated” to an overlay), and how that might all tie together to give lazy loading type functionality?

Also, maybe load_overlay() should just be a method on module objects?

larry · April 10, 2023, 3:55pm

I hadn’t thought about how you’d create overlays from Python. This was in the context of “how could we store the original source code of annotations in a .pyc file but only load them on demand”, so I was only working on how the Python compiler could create overlays.

If we want to extend Python to be able to write overlays, here’s my off-the-cuff idea: we add a new statement to a Python, perhaps

overlay <name>

This ends compilation into the current module, or the current overlay, and creates a new overlay and starts compiling to it.

Notes:

There’s no way to “resume” compiling into the module or an already-defined overlay.
Specifying overlay <name> with the same name twice raises an exception at compilation time.
The overlay statement can only be executed at module scope, unindented. (It can’t be in an if block, a class, a function, a for loop, a match block, etc.) Executing an overlay statement anywhere else raises an exception at compilation time.
Maybe Python reserves all overlay names that are strings starting with two underscores. If so, executing the overlay statement with one (e.g. overlay __something) raises an exception at compilation time.

Yeah, that’s probably better. I should learn OOP programming!

guido · April 10, 2023, 5:10pm

See also various discussions we’ve had about lazily loading parts of code objects, e.g. some of the threads returned by this query.

@markshannon ^^

larry · April 10, 2023, 5:12pm

You’ll notice I didn’t propose any implicit lazy-loading–for now you have to manually load your overlay. But objects could build lazy-loading mechanisms themselves. For now I’m thinking Python shouldn’t do any implicit lazy loading of overlays.

But here’s more off-the-cuff thinking, on how the module object could support lazily-loaded attributes. Name the attribute

('lazy', 'attribute_name')

When used with the overlay statement, this looks reasonable:

overlay lazy, attribute_name

The module object could overload __getattribute__. If someone references an attribute that doesn’t exist, and an overlay exists with the right name, it loads the overlay and tries again. (If the attribute still isn’t defined, hmph! The module would throw AttributeError like normal. And shame on you!)

It might be nice if modules knew what overlays they had available. Again, a module-level attribute, a set called __overlays_available__ or something.

Glenn · April 10, 2023, 5:38pm

So it has been said that the benefit of lazy loading is in two parts:
avoiding finding the file, and avoiding loading it.

This scheme doesn’t seem to have the first benefit, and seems mostly
aimed at having optional parts not loaded unless needed for memory saving?

On the other hand, if it supported a hierarchy of names, I could see the
whole stdlib in one .pyc. Not sure if there is any benefit versus the
whole stdlib in one .zip, with non-hierarchical lazy pieces.

Not going quite that far, one could assemble modules that have
submodules (os, os.path for example) into one file.

Either of these latter two could have some of the benefit of avoiding
finding the file.

Both would probably need external utilities to assemble the parts into a
whole.

Maybe, since you borrowed the directory structure from .zip, you should
just borrow .zip completely, at the cost of a few more header and
directory bytes, an uncompressed .zip could be your lazy load format. Or
even a compressed .zip.

larry · April 10, 2023, 5:50pm

It’s aimed at the latter. The use case was a hypothetical future in which we stored the source code to annotation expressions in the .pyc file, rather than rebuilding it at runtime as PEP 649 proposes to do. Folks pointed out that this might make modules much larger, other folks counter-proposed lazily loading the annotation expression strings, and here we are.

You seem to be proposing a very different facility. I encourage you to make your own proposal. For me, I’m not going to join you in this avenue of thinking. My idea for overlays was a small, cheap facility to allow delayed loading of parts of a module; I wasn’t trying to combine the entire standard library into a single file.

I don’t see much benefit, and it has some costs. Zip files deal with “filenames”, which doesn’t map neatly to “Python constants”. So e.g. ('lazy', 'attribute-name') wouldn’t work. We’d have to restrict overlay names to strings, or do something janky like store the repr of the overlay name.

It would also add a lot of unnecessary junk, like “uncompressed size” vs “compressed size” (if we store the code objects uncompressed), and a checksum, and attribute bits, and modification time.

I guess it would make it easy for external users to unpack (and maybe repack?) overlays. But what’s the use case for that?

Glenn · April 10, 2023, 6:56pm

Larry:

You seem to be proposing a very different facility. I encourage you to
make your own proposal. For me, I’m not going to join you in this avenue
of thinking. My idea for overlays was a small, cheap facility to allow
delayed loading of parts of a module; I wasn’t trying to combine the
entire standard library into a single file.

Absolutely! But there’ll be no proposal, just the idea.
I’d love to have time to become a Python developer, instead of just a
Python user, but it seems doubtful as my time is already over-committed.

I was just trying to point out that there might be a synergy between
this type of lazy loading, and other types of lazy loading that could be
solved with the same solution.

names probably should be readable strings anyway, and separating the use
cases is just a matter of avoiding name collisions between whole files
and subsets of files.

FFY00 · April 10, 2023, 11:57pm

Can you elaborate on the benefits of this proposal over just using __getattr__ (which we support on pure Python modules, per PEP 562) for lazy loading?

larry · April 11, 2023, 12:28am

I don’t understand your question. By “this proposal” do you mean the whole thing? Just part of it? If just a part, which part?

brettcannon · April 11, 2023, 12:51am

Technically the magic number is the only invariant as that versions the file format (along with the bytecode). Past that, you probably need to support the rest of what’s in the header, but you can do that however you want (albeit while breaking folks who make some assumption of the format which has changed over the years).

larry · April 11, 2023, 12:54am

Sorry, should have been clearer–I meant “invariant” as a practical matter. I’m not aware of CPython ever reading or writing a .pyc file in any other format, in any released version.

[edit]
Technically, there’s a third format: older versions of Python didn’t include the size. Back then a .pyc file was just

    <4-byte int magic>
    <4-byte int source file mtime>
    <code object>

I think that this list of CPython .pyc file formats is now exhaustive. It’s always been one of those three. (Unless there’s a fourth variant. If there is, I bet it’s from very early in the .pyc file’s history, from before I was paying attention to details like that.)

gpshead · April 11, 2023, 5:49am

The 4 + 8 bytes of stuff past the magic number used to just be 8 bytes of stuff pre-3.7. From the perspective of anything reading .pyc files that may as well have been a different format.

We’re free to do whatever we want to the rest of the format so long as we support the existing pyc feature concepts regardless of how they are stored by simply changing the magic number.

But I don’t think discussing binary file format layout implementation details is super exciting for now, there are plenty of ways to represent that. The more interesting one is motivating reasons why we might want to do it.

I can forsee a future where .pyc files can contain a lot more than they do today. Perhaps as you’re suggesting some of it will be “optional” stuff that isn’t always necessary to load. You came at this from a storing additional details about annotations perspective. Other ideas for additional pyc storage include the equivalent of debug info, unoptimized bytecode, lazy loaded docstring data, a portrait of our FLUFL, platform specific pre-deepfrozen structures, or even multiple precomputed native code translations - all possible future desires.

In terms of structure, we explicitly do not need to remain compatible with anything.

The only thing to not regress on is rapid loading for data needed during typical application startup imports. Some kinds of additional optional data we could add in the future could be desirable for fast application startup… So lets not overthink the format. We can reredesign the format as necessary for whatever our new motivating measured reasons are.

encukou · April 11, 2023, 8:15am

tl;dr: I don’t think this is the right mechanism for lazy-loading annotations.

When it comes to lazy loading annotations (__doc__, type annotations, line number tables, etc.), we should think about what happens when things are changed between the initial load and the lazy one:

What happens if the modules/classes are changed?
What happens when the .pyc is changed?

Some options, from most desirable to least:

It behaves as if everything was loaded eagerly
It behaves as if the annontation wasn’t present
Annotations are attached to the wrong object
Undefined behavior

It looks like PyImport_LoadOverlay would typically run code in a module whose classes/functions can be different from what it expected, e.g. it could inject annotations into unrelated classes. Optional C speedups is a common, if rather benign, example of what to watch out for.

If the .pyc changes, the overlay could attach annotations to the “wrong” object – though one with the same module & name, so it’s probably OK.

For reference, here’s a rough sketch of what I had in mind:

.pyc files will end with an additional marshalled tuple at the end, which doesn’t get unmarshalled at first
annotations (type.__doc__, ModuleType.__doc__, FunctionType.__doc__, whatever ends up storing type annotation data) become non-data descriptors
when first loaded, for each of those attributes type/module/function objects store a shared “stub” and an index within it
initially, the “stub” “remembers” the .pyc file and the offset at which the extra tuple starts
on first access to the “stub”, it unmarshals the tuple, and closes/unmaps the file. The stub keeps a reference to the tuple.
on first access to an attribute, the value is taken from the stub’s tuple and stashed in __dict__, and the “stub” is decref’d.

Ideally, the .pyc can be opened/mapped with copy-on-write semantics on major platforms, and we don’t blow limits on open files. We can fall back to eager loading on exotic platforms.

larry · April 11, 2023, 10:06am

I agree that figuring out the binary format isn’t hard. I wrote it up partially for fun, partially just to present this as being completely thought through and ready to go (or “shovel-ready” as a politician might say).

I assume we won’t move forward with my specific proposal, because I don’t think we have a burning need for the functionality yet. Our as-yet-unknown important use case will tell us what semantics we need once it arrives. I’d be surprised–if pleasantly so–if those semantics aligned neatly with my “overlay” proposal as I sketched it here.

All very true! I figured, the best way to not impede existing rapid loading was to leave the existing mechanism unchanged. As you surely noticed, my proposal wouldn’t impact the speed of the initial module load–in fact it should be identical to what it is today.

You seem to assume that in the overlay we’d attach the source code string annotations directly to the object, under some novel dunder name (perhaps __source_code_annotations__). That’s certainly one obvious and reasonable way, but it’s not the only way.

Call me weird–or overworked and distracted–but the scheme I actually had in mind was quite different, and would continue to produce correct results even if you shuffled the names around in the module after the initial load but before loading the “source code string annotations” overlay. As follows:

We define a new “format” for inspect.get_annotations and __compute_annotations__ etc, let’s call it SOURCE_CODE_KEY. This is a per-module unique identifier (probably just a monotonically increasing serial number) assigned during compilation to each annotated object. The __compute_annotations__ method generated for that object gets an extra bit of code added to the top:

if format == 5: # SOURCE_CODE_KEY = 5
    return 38 # this object's assigned SOURCE_CODE_KEY serial number

Then, in the overlay, we assign to a single module-scope attribute called __source_code_annotations__, a dict mapping these keys to the source code annotation strings.

overlay 'source code annotations'
__source_code_annotations__ = { 38: {'a': 'int', 'b': 'str'}, ... }

inspect.get_annotations(o, STRING) would then be implemented something like this:

if format == STRING:
    key = o.__compute_annotations__(SOURCE_CODE_KEY)
    m = o.__module__
    m.load_overlay('source code annotations') # assume it raises on error
    return dict(m.__source_code_annotations__[key])

Of course, this approach isn’t bullet-proof either. Malicious user code could rebind o.__globals__ to a different module, or pre-load the module and modify m.__source_code_annotations__, at which point we’d return the wrong source code annotations for STRING. It’s Python, you can only do so much.

encukou · April 11, 2023, 11:28am

I see.
Why does load_overlay run a code object? Could it not simply return the stored constant?

larry · April 11, 2023, 11:37am

I like the unlimited flexibility of running code. It seemed like approximately the same amount of work–or maybe even less–than a less flexible approach. After all, unless we severely restrict the lazy-loaded objects to constants expressible as marshalled objects, we’re going to have to run some code anyway.

edit: Of course, there are infinitely-many other approaches, and perhaps we’d prefer some sort of “happy medium” that gives us sufficient flexibility without untrammeled power. For example, we could store the lazy-loaded attributes as pickled objects rather than marshalled constants. That opens up the possibility of storing a lot larger variety of values as these lazy-loaded attributes. Of course, maybe pickle is a bad example; it’s also known to be insecure, so if security is a concern this wouldn’t help. I also don’t know how convenient it would be to use pickle as the serialization format, for the compiler or for user-created “overlays”. Again, these are all off-the-cuff ideas, I’m not making any serious proposals here.

FFY00 · April 11, 2023, 11:42am

Sorry, I was not clear! Let me try to clarify.

This is a fairly big change, so it’d make sense to have a strong motivation behind it. In your post, the main goal you mention is lazy-loading stuff, however, this is already possible via a module __getattr__ for eg., unless I am missing something here (which I very well might be ). So, my question is what kind of benefit do .pyc overlays have over the current available mechanisms?

IMO it’d make sense to provide a stronger, or perhaps clearer, argument for the motivation behind this proposal, which is what I was trying to get at here

encukou · April 11, 2023, 11:42am

Sure, but it’s bytecode generated by the compiler, and tightly coupled to it. It sounds like it should be more efficient to only serialize the parameters, and share the logic interpreter-wide.

larry · April 11, 2023, 11:51am

Module __getattr__ is where you’d put the code to load or compute a lazily-loaded attribute. But it’s not, itself, a lazy-load-things-from-disk technology. It doesn’t specify where to load things from, nor how they would get loaded, nor how to put them there in the first place. My “overlay” proposal answers those questions.