Thought experiment about adding extra loadable stuff to a .pyc file

Arn’t .pyc files an implementation detail of CPython and not a feature of the Python language?

MicroPython will not be forced to do anything as a result of .pyc implementation changes I assume.

2 Likes

The internal structure of .pyc files is an implementation detail. However, the existence of .pyc files is a defined part of the language. They aren’t mandatory, but e.g. PyPy uses them. I believe MicroPython doesn’t.

If CPython used SQLite as an on-disk storage format, CPython would presumably vendor SQLite, both for stability and to ensure it was always available. It turns out SQLite has a pretty large footprint. The current source tree compresses into a 2.6MB ZIP file, and when compiled with size optimizations the library is “slightly larger than 500k” on x86_64.

If Python takes advantage of SQLite features to implement its on-disk storage format for lazy-loaded attributes, and defines language-level behavior based on those features, it seems like that would would require other implementations to also use (and presumably vendor) SQLite, to replicate those features and that defined behavior.

And if Python isn’t taking advantage of SQLite features to implement its on-disk storage format for lazy-loaded attributes… why bother using it? That’s a pretty big library to add to CPython and then not particularly use.

Using SQLite as a simple key-value store is overkill–there are faster and more efficient solutions, including implementing one ourselves. ZIP might be a better choice, if we abused the “filename” field to smuggle in a marshalled object. (It appears the ZIP file format could encode a “filename” containing arbitrary bytes, including embedded NUL characters, as it stores the filename’s length separately. However, ZIP tools might be allergic to this, as ZIP was invented on DOS, where there are many restrictions on permissible filename characters.)

It is already included in stdlib. Sqlite has an astonishing stability guarantee.

The developers of SQLite intend to support the product through the year 2050. To this end, the source code is carefully documented to promote long-term maintainability. We prefer mature and stable over trendy and cutting-edge.

From Sqlite web site: High Reliability

It is currently an optional dependency, in that it is possible to build CPython without it. Larry’s idea might require SQLite for building CPython, which currently isn’t the case.

2 Likes

Wasn’t my idea. Steve Dower proposed it in this thread. So far I haven’t particularly liked or championed the idea; I’m willing to have my mind changed but so far it hasn’t happened.

Not that it’s important of course, I doubt the proposals in this thread are going anywhere for now. As I said before:

I don’t know if it was ever true, but it certainly hasn’t been true anytime recently. SQLite’s a full-fledged RDBMS with a transactional page store, B+ trees, constraints, query planner, the works.

Dunno what the point would be. SQLite’s surprisingly usable as a superpowered data-structure on-disk or in-memory, and can be used for stuff like complex office documents. But why would you need to do indexed lookups or joins in a .pyc? Can’t think of any reason you’d want more than some blobs + and index to them.

/me deletes a rambling reply

This thread seems like it’s going backwards: as proposed it’s a solution looking for its problem.
The implementation will depend on the use cases. We know about a few of those, but IMO we need to focus on those first. That’ll answer questions like “do we need __overlays_available__?”, “do we need to run code?”, or “are the names str/int, or arbitrary marshallable constants?”, which should then drive the discussion about the API and storage mechanism.

4 Likes

For the use cases that need that, you can marshal a code object, and run it in a higher layer, like the function that gets annotations.
I don’t know enough about the intended use cases of overlays. Type annotations, __doc__, line number tables… What else? Why would you need to open files in an overlay?

For use cases that only need a constant (think __doc__), generating and running code that stashes it in some attribute, and then retrieving it from there, is overhead that can be avoided.

Compilers have been generating clever code for a long time for this kind of additional metadata (I just spent a whole day generating native function tables for my generated native functions to enable native call stack profiling, so my head is still there).

How about this: the extra loadable stuff gets stored in a secondary .pyc as a set of functions that return the needed information. Over time, the compiler can learn how to defer particular calculations into this file as we decide they’re needed.

I’ll try and show a __doc__ example, and hopefully the reader can extend the idea to the more complex options.

We start with my code file:

def spam():
    """My docstring"""
    my_code()

Today, this gets compiled into my_code_file.pyc[1] containing roughly:

  • constant "My docstring"
  • bytecode to execute spam’s body
  • constant "spam"
  • bytecode to execute FunctionType("spam", spam_bytecode, "My docstring")

If we assume that most of the time, the docstring isn’t interesting, we could instead generate a second my_code_file-overlay.pyc from:

def get_spam___doc__():
    return """My docstring"""

And now my_code_file.pyc contains:

  • bytecode to execute spam’s body
  • constant "spam"
  • bytecode to execute FunctionType("spam", spam_bytecode, _overlay_docstring_getter)

Where that _overlay_docstring_getter is a new (internal) getter that knows how to import my_code_file-overlay in context and call get_{name}___doc__ to return the docstring. So the main .pyc no longer contains rarely used string literals, and we get smaller files that load quicker for the main code.

Basically, it’s lazy micro-imports, built pretty much on the same system we already have, just massively complicating the compiler :wink: But since the compiler is choosing where to place the extra information, it should be able to restrict itself to things that match safe patterns, and over time we can expand it without impacting compatibility (e.g. we could detect/decorate rarely used functions and not load their definitions until used).

Clever users can even inspect it themselves and potentially experiment with new patterns. No need for a new public API (though having one would of course be helpful).


  1. Yes, I’m skipping irrelevant details like cache tags. This is an idea, not a specification. ↩︎

  • So if my_code_file-overlay.pyc contained two of your “overlays”, it might be opened/loaded/run twice?
  • What arguments would you pass in to _overlay_docstring_getter? No arguments? '__doc__'? 'spam', '__doc__'?

Haven’t thought these through completely, but off the top of my head:

Once it’s loaded, it’s loaded. Just like importing a module. So we’d want to move data there that is all clearly for “the other” scenario, whatever that may be. (But since probably 99% of modules don’t normally need docstrings or annotations or anything, we still get a serious win even if we load docstrings for one module because someone asked for the annotations.)

Just the name of the thing to get the docstring for, and the module it belongs to, both of which we should know already when loading the main .pyc. It could be a special marker or a normal-ish function, it doesn’t really matter, because we get to process it internally before a user would ever see it - they’d just do obj.__doc__ (or help(obj)) as usual.

Modules are reachable through sys.modules. Once my_code_file-overlay.pyc is loaded, where does the stuff inside it live? Is it a module?

Or, are you saying, lazy-loading any single thing inside my_code_file-overlay.pyc lazy-loads all the things, and it’s one-and-done?

So if you loaded the docstring for a method on a class, the name passed in to the function would contain a dot? It addresses “object I’m going to glue a __doc__ to” relative to the module namespace, using the names as they were defined at the time the compiler compiled it?

How would this scheme lazy-load the docstring and annotations to a function defined inside another function (probably with a closure) that is returned and gets stored somewhere?

Okay, maybe we store an integer in the main .pyc and have a lookup for it in the other one. It’s our compiler, we can make it do whatever we need.

The point was to throw in a slightly different idea to defer loading so much data while making the most use of existing infrastructure, including arbitrary code execution on load, and hopefully in a way that works with existing and future innovations based on that infrastructure. I wasn’t trying to spec out every possible edge case.

I don’t think this really matters. I’d be far more interested in contemplating a merged .pyc format (SQLite DB?) so that we can load a whole set of cached data in one go rather than individual filesystem requests.

Surely if files are the best way to store the data then a single zipfile containing all necessary fixtures would indeed allow you to load a whole set of cached data in one go. Not only that, the recent enhancements to importlib.resources make it easy to distribute the zip file alongside your code. Is a modification to .pyc files really needed?

1 Like

Adding more files makes an existing problem worse. We’re talking about data that is directly tied to the exact specific contents of todays pyc file. Adding another file is another thing that will (from experience: will, not can) get out of sync and cause difficult problems as a result. Keep it in the existing pyc file as extra sections.

The idea is not to do anything that’d be slower than loading today’s pyc files at process startup today. Just opening up opportunities by adding additional information. We don’t need to read and process the whole file. Some of it may be useful outside of the process (tooling doing code flow analysis for example). Some of it could be useful in process on an as rarely needed on-demand basis (docstrings for example?). None of that would be touched at import time pyc loading. Nothing gets slower. Nothing possibly gets out of sync worse than today. New opportunities are provided.