Thought experiment about adding extra loadable stuff to a .pyc file

larry · April 11, 2023, 2:25pm

I note that the rest of the .pyc file is also bytecode generated by the compiler and tightly coupled to it. So I’m not sure what you’re getting at here. It’s not like that’s a downside for my proposal.

If you’re suggesting that tightly-coupled-to-the-compiler is a misfeature, and you’d prefer an alternate approach that would be Python-version-independent, by all means propose one! But that wasn’t one of my goals.

Again, my use case was literally for data generated by the Python compiler as it went about its work, that we might want to make lazy-loaded, so we don’t waste memory on it in the likely scenario that we don’t need it at runtime. I think my proposal is a reasonable approach to solving that problem. Solving my problem didn’t need version independence, backwards/forward compatibility, etc.

encukou · April 11, 2023, 3:14pm

Sorry, I guess I’m not communicating clearly.
Being tightly coupled is necessary (it’s, like, the advantage of .pyc over .py).
But being tightly coupled means we should know, in advance, what the code object for __source_code_annotations__ will do (given the parameters to the compiler function that generates it). So, serializing the bytecode and co_* data into each individual .pyc sounds wasteful: we only need to store the parameters.

I don’t understand the flexibility argument. I don’t see a situation where __doc__ would need running code, which suggests that a .pyc-level overlay¹ mechanism doesn’t need this flexibility. Type annotations might need to run code (I don’t know enough about those), but… code objects are marshallable constants, so inspect.get_annotations could expect a code object and run it itself:

if format == STRING:
    key = o.__compute_annotations__(SOURCE_CODE_KEY)
    m = sys.modules[o.__module__]  # btw, I think a fragile indirection is necessary here
    try: 
        return m.__source_code_annotations__[key]
    except AttributeError:
        code = m.load_overlay('source code annotations')
        exec(code, m.__dict__)
    return dict(m.__source_code_annotations__[key])

I guess the name “extra data” would fit this better than “overlay”.
You could avoid remembering which overlays were loaded (__overlays__), and drop the force_reload argument, without (AFAICS) sacrificing flexibility.

larry · April 11, 2023, 3:41pm

I think we’re talking past each other. I don’t know what you’re talking about. What parameters? What compiler function that generates it?

I don’t follow this either. You seem to be saying “because one use case doesn’t need the flexibility of running code, I extrapolate that to mean that no use case would need the flexibility of running code”.

For the record, annotations often do need to run code. They’re expressions, and annotations which are type hints need e.g. to construct objects defined in other modules.

In my specific use case, where all I need is a dict mapping strings to strings, I think it literally is a marshallable object. So we could solve my specific use case by loading a marshalled constant and storing it as an attribute on the module. But I didn’t realize that until just now, long after I made this proposal. Anyway I would worry that that such a solution would be too tailored to a specific use case and would make the facility useless for solving other problems.

For example, you couldn’t lazy-load a function with “load a marshalled constant and store it as an attribute on a module”, because you can’t marshal a function object. You can only marshal code objects; you have to run code in order to bind them to functions, at runtime.

I concede that “overlay” isn’t a great name. Its main strength as a name is that it’s a rarely-used word, so in the context of this conversation if you say “overlay” everyone knows what you’re referring to. I’d contrast that with “extra data” which is pretty generic and inspecific. Also, the semantics I proposed enable not just “extra data”, but also extra functions and classes, and code that runs automatically when the thing (the “overlay”) is loaded.

Given those semantics, maybe “modification” is a better name. A little long, but “module” and “modification” start with the same three letters, so maybe that’d be a helpful mnemonic for the harried Python programmer.

Again, I’m not worried about it, as I remain dubious we’re going to go forward with this proposal.

encukou · April 11, 2023, 3:51pm

Yeah, that’s probably it. We’re talking about different use cases, and focusing on different aspects of the proposal.
I’ll wait for the PEP…

… which might never come :‍)
Sorry for taking your time by nitpicking an initial idea!

gpshead · April 11, 2023, 7:34pm

In general I agree. This is a real world problem I’ve seen in production with the way Python already works: If source or pyc files are updated out from underneath a running application, it needs to not attempt to use them anymore. Hilarity ^W bugs that are hard to debug ensue in environments like Linux distros packages where in place upgrading of files is still a thing swap stuff out from underneath existing processes…

Ideally any file we want to lazy load something from should be kept open from the time of first opening so that it sticks around (this is the happy path on POSIX, on Windows which has strange semantics it probably blocks in place upgrades… but are those even a thing on Windows?). Our import mechanisms (yes plural) do not do this.

In the absence of being able to hold the exact thing we want open and present as long as anything refers to it, we’d need a mechanism to confirm it matches what we expect. And code that might get something lazily would need to be prepared to handle an error or at least get None for the thing it rightfully expected to be there… that quickly becomes complicated.

(the thread is getting long, I haven’t read any past what I just quoted yet)

-gps

steve.dower · April 11, 2023, 8:07pm

Keeping files open on Windows costs resources. Much better to close them if we can. If things changing out from underneath is a concern, grabbing the file timestamp at first access and making sure that the associated file is no newer than that when we eventually open it is probably safe enough.

In place upgrades while things are running aren’t really trustworthy at the best of times. At least the main installers are going to require that python.exe isn’t running before they do anything, but that can’t stop people from using pip/equivalent to modify libraries.

I don’t think this really matters. I’d be far more interested in contemplating a merged .pyc format (SQLite DB?) so that we can load a whole set of cached data in one go rather than individual filesystem requests.

larry · April 11, 2023, 8:57pm

This is all true. It’s also orthogonal to the lazy loading discussion at hand. As Greg observes, source files can already get edited after the initial load of a module, throwing the lnotabs etc out of date. Adding lazy loading neither ameliorates nor exacerbates this existing problem.

[edit]
Oh, my apologies! I thought you were only talking about changing source files. I missed it when you also asked, what happens if a .pyc file changes between its initial load and a later attempted lazy-load?

I don’t know what the right answer is here. All I can say is, any lazy-loading system is going to have to answer that question; the problem isn’t unique to my proposal. I don’t have any special insight into the problem. I imagine you’ve already thought of all the solutions I might come up with.

If you’re still interested in my opinion, I’m firmly in the “in the face of ambiguity, decline to guess” camp. So if the .pyc file changed between initial module load and later lazy-load, I’d want the lazy loader to throw an exception by default. I’d also want the user to be able to suppress that exception and force it to attempt to load anyway by supplying a keyword-only boolean argument set to True.

Anecdote: one of the things that killed the long-defunct Monotone DVCS was its repository file format: a SQLite database. This was a permanent sandbag dragging down performance; IIRC, they wanted to get rid of it and change to a more efficient repository format, but the design choice was so baked in to the project (and the team working on it so small) that this was untenable. If Monotone had been fast enough for Linus when he played with it back in 2005 we might not have git (and Mercurial!) today.

So if we’re interested in performance at all I recommend against SQLite as a new .pyc file format. Really I’d recommend against using it for anything except relational databases. If we want to store arbitrary tagged data in .pyc files I’d prefer we roll our own simple tagged chunk file format a la RIFF.

(Fun trivia: the creator / original project lead of Monotone was Graydon Hoare, who went on to great success as the creator of Rust. I met him once, at a Monotone summit, and NJS probably knows him really well as I remember him kind of being Graydon’s right-hand-man at the time.)

methane · April 12, 2023, 3:20am

Unix is similar. My mac can have only 256 fds by default.

$ ulimit -n
256

This setting can be configured, but Python should work fine with the default setting.

barry-scott · April 13, 2023, 9:13pm

I guess you would raise this limit if that is required for python to work well.
The hard limit is 9223372036854775807 on my mac.

methane · April 15, 2023, 11:48am

I can, of course.
But if Python start keep opening hundreds of pyc files, many Python users should do it too.

I think we need to store bytecode cache into one file per site for lazy loading (and mmap-ping, hopefully).

barry-scott · April 15, 2023, 5:23pm

I was thinking that python would raise the limit when it starts up if it is deemed that 256 is too small.

larry · April 15, 2023, 5:30pm

I don’t know about Mac and Windows, but on Linux, raising the per-process file limit is a systemwide setting and requires privilege escalation. It would be tiresome if running a userspace Python script required typing in your password and modifying a systemwide setting. And if you ran two Python scripts at the same time, how would they negotiate who reset the setting after they complete?

[edit]

And what if your account was on a locked-down machine, such as a corporate laptop, and you didn’t have privilege escalation privileges?

I think maybe I understand your proposal now. If I understand you correctly, you point out that we know in advance what the code will do. So, instead of storing the overlay as a code object, we simply run the code at compile-time, then simply observe what changed, and simply store that in the “overlay”. If that’s what you’re proposing, I see some complications to that approach.

First, I think we hit the halting problem there.

Second, this would mean analyzing what the overlay did. So we’d have to have insight into the makeup of the objects in the module, so we could compare them before and after. For pure Python objects, it’s possible to completely understand their structure, but if there were objects whose implementation is in C (etc) we might not have visibility into their internal structure.

Third, what if the code in the overlay isn’t a pure function?

What if it reacts to information accreted over time by code in the original module?
What if it loads information from an external file, which could change after the .pyc file is generated?
What if the overlay opened a file and appended to it?
What if the intention was for the initial module to load, then compute some information that might change from one run to the next, and the code in the overlay reacts to the computed information?
What if the overlay loads a new module, then instantiates types from that module?

Fourth, if code in the “overlay” constructs objects that can’t be marshaled, it follows that we can’t use marshal to recreate them when the overlay is loaded. The only technique guaranteed to be able to reproduce the objects is a code object. And if we have to use a code object anyway, what has this approach saved us?

eryksun · April 15, 2023, 5:38pm

Inada Naoki:

steve.dower:

Keeping files open on Windows costs resources. Much better to close them if we can. If things changing out from underneath is a concern, grabbing the file timestamp at first access and making sure that the associated file is no newer than that when we eventually open it is probably safe enough.

Unix is similar. My mac can have only 256 fds by default.
$ ulimit -n
256
This setting can be configured, but Python should work fine with the default setting.

The C runtime on Windows supports 8192 open file descriptors. The maximum number of FILE streams is 512 by default, but it can be extended to 8192 via _setmaxstdio().

At the OS level, there’s no practical limit to the number of open handles in a process. It’s in the millions, subject to available memory.

In terms of deleting an open file, starting with Windows 10^[1], the NTFS filesystem^[2] supports POSIX delete semantics. In this case, the file gets unlinked as soon as the open that’s used to delete it is closed. That’s in contrast to a classic delete on Windows, in which a deleted file is only unlinked from its parent directory after all opens have been closed, which prevents deleting the parent directory. There are a couple of caveats to note:

All opens of the file that have read, write, or delete data access must share delete access, else trying to open the file with DELETE access in order to delete it will fail as a sharing violation (error 32).
A PE binary file cannot be mapped as a process image – i.e. referenced by a section object with the SEC_IMAGE attribute (e.g. a loaded EXE/DLL), else the delete request will fail with access denied (error 5). Regular data mappings are allowed (e.g. mmap.mmap).

If a file descriptor is needed, open with delete sharing using CreateFileW(), and then get a file descriptor via _open_osfhandle(). Use fdopen() to get a C FILE stream, if needed.

Python 3.12 is the first version that requires at least Windows 10. ↩︎
Not exFAT or FAT32, and probably not most non-Microsoft filesystems – not yet at least. ↩︎

barry-scott · April 15, 2023, 5:44pm

On Fedora and other unix I have used there are two settings; the default for a resource and a max for that resource.
In the case of Fedora 37 I see this (I have not changed the Fedora installed config).

:>>> resource.getrlimit(resource.RLIMIT_NOFILE)
(1024, 1048576)
:>>> resource.setrlimit(resource.RLIMIT_NOFILE, (1048576,1048576))
:>>> resource.getrlimit(resource.RLIMIT_NOFILE)
(1048576, 1048576)

I can raise to 1M FDs without privs per process.

njs · April 15, 2023, 7:35pm

Larry Hastings:

Anecdote: one of the things that killed the long-defunct Monotone DVCS was its repository file format: a SQLite database. This was a permanent sandbag dragging down performance; IIRC, they wanted to get rid of it and change to a more efficient repository format, but the design choice was so baked in to the project (and the team working on it so small) that this was untenable. If Monotone had been fast enough for Linus when he played with it back in 2005 we might not have git (and Mercurial!) today.

So if we’re interested in performance at all I recommend against SQLite as a new .pyc file format. Really I’d recommend against using it for anything except relational databases. If we want to store arbitrary tagged data in .pyc files I’d prefer we roll our own simple tagged chunk file format a la RIFF.

I don’t think the Monotone experience is very relevant here – sqlite was never really an issue as I remember it (people blamed it because people love to blame things on “bloat”, but the reality was that monotone was research code holding production data, so everything was optimized for flexibility and data integrity checking, neither of which are conducive to fast highly-tuned code), and anyway sqlite of 10-15 years ago is a totally different beast from sqlite today.

Yeah, fd limits are an archaism, that folks are trying to untangle – see File Descriptor Limits. Practically speaking they’ll still be an issue for a while, at least in some situations.

Throwing out another idea: consider an intermediate step, of loading the .pyc file into memory as a flat chunk of bytes, but then be lazy about parsing out parts of it into full-fledged useful structures. That might give a substantial memory win in the common case while avoiding all the tricky issues raised by keeping some data in memory and some on disk.

larry · April 15, 2023, 7:52pm

I wouldn’t be at all surprised if I was out of date on SQLite. I’ve only glancingly used it, and not for years now.

If you’re reasonably conversant, let me ask: I remember reading once that SQLite’s on-disk format was
really just the log of all the SQL statements used to create the current state of the database, from the earliest CREATE DATABASE forwards. Was that ever true? Is the current on-disk format an efficient binary format?

And, blue-sky, ignoring for now other salient aspects of the discussion, and really just out of idle curiosity: do you think a SQLite database would be a reasonable solution for .pyc files in this day and age?

barry-scott · April 15, 2023, 8:34pm

Would be easy to prototype. Having sqlite defining the format opens up lots of possibilities for adding custom sections.

You could take the contents of a .pyc and put in to a sqlite .db and see what the size and access speed is like.
First with pure python then using sqlite’s C API - that is very nice BTW.

Oh and sqlite has a 20(?) year API stability guarantee .

larry · April 16, 2023, 4:32am

If we made CPython use SQLite for language-visible features, would that force MicroPython to ship SQLite too?

guido · April 16, 2023, 4:47am

What language visible features are you thinking of? The language doesn’t even assume a file system. The stdlib does, but micropython presumably doesn’t require one (it doesn’t support much of the stdlib).

larry · April 16, 2023, 5:33am

One blue-sky proposal is that lazily-loaded module attributes could use a SQLite database as its on-disk storage format. The proposal didn’t make it clear whether the feature would require, or expose, advanced features of SQLite; if it did, this would explicitly require other implementations to also depend on SQLite, which I suspect would be a hardship for MicroPython.