Why do bytecode files store the length of the corresponding source code?

kknechtel · April 28, 2024, 8:57pm

My research so far has told me:

Up until Python 3.2 (including all 2.x branches), Python (at least CPython) bytecode (.pyc) files consisted of: a 4-byte magic number (2 bytes identifying the bytecode version, plus a constant 0d 0a used to detect corruption by reading or writing in text mode); a 4-byte timestamp; and a marshalled code object.
Python 3.3 added a 4-byte value after the timestamp representing the length in bytes of the corresponding source code file.
Python 3.7 implemented PEP 552, thereby adding 4 bytes (used as a single boolean flag) before the timestamp field (when set, the “timestamp” Is interpreted as a SipHash instead - incidentally, the corresponding link in the PEP is broken).

What I don’t understand is why the source-length field was added. The “What’s New in 3.3” doc tells me:

importlib.abc.SourceLoader.path_mtime() is now deprecated in favour of importlib.abc.SourceLoader.path_stats() as bytecode files now store both the modification time and size of the source file the bytecode file was compiled from.

While there are many other mentions of changes to the import process and bytecode format, this was the only thing I could find in the document that concretely describes a purpose for, or use of, the new field. I also don’t see a PEP describing the addition of this field.

Surely it wasn’t added just so that importlib.abc.SourceLoader.path_stats() could exist and get both pieces of information in the same place?

In what contexts does this information actually matter?

jamestwebber · April 29, 2024, 12:06am

From the changelog:

I don’t know why that helps but maybe the bpo-13645 discussion would have details.

cameron · April 29, 2024, 9:35pm

Without having looked at the bpo, let me describe a scenario (which I’ve
invented out of whole cloth).

You want to know if the pyc’s bytecode is current so that you can avoid
remaking it. If you do that based just on the modification time of the
pyc and the source it is possible that a sufficiently rapid source
update will not be recognised as “new”, thus not invalidating the stored
byte code.

Imagine if you will some process which goes:

checkout some code revision, or just download some code, time X
compile the bytecode, time Y
update the code with a patch/diff, time Z

In a script (eg a build system), this could be quite fast.

If the patch is fast enough, the time Z will not be greater than Y as
stored in the filesystem timestamp, or in the 4-byte timestamp. And the
bytecode will not look out of date.

By including the file size the likelihood of this error is greatly
reduced.

Similarly, the rsync(1) comand has for many many years, possibly since
inception, used file (size,mtime) as a heuristic: if the local and
remote files have the same (size,mtime) then it does not examine their
contents for changes. This is very fast. When it’s correct.

WRT to low resolution timestamps, modern OSes and filesystems store a
fair bit of precision. But the rsync(1) man page mentions:

 --modify-window
     When comparing two timestamps, rsync treats the timestamps
     as being equal if they differ by no more than the modify-window
     value.  This is normally 0 (for an exact match), but you
     may find it useful to set this to a larger value in some
     situations.  In particular, when transferring to or from
     an MS Windows FAT filesystem (which represents times with
     a 2-second resolution), --modify-window=1 is useful (allowing
     times to differ by up to 1 second).

This is kind of the inverse problem: an option to make the rsync command
more likely to consider files the same with this quick heuristic.

Only last night I was refactoring a little content cache I use for
content checksums, and considering exactly this issue (in no small part
because I’ve used rsync for many years and knew of the above scenario),
and accordingly my default “is this cached value current” state function
is this:

 @staticmethod
 def stat_size_mtime(
     fspath: str, round_mtime=int, follow_symlinks=True
 ) -> dict:
   ''' Return the default cache state mapping.
       This function `stat`s the `fspath` and returns `{'size':st_size,'mtime':int(st_mtime)}`.
   '''
   st = os.stat(fspath) if follow_symlinks else os.lstat(fspath)
   return dict(st_size=st.st_size, st_mtime=round_mtime(st.st_mtime))

See the default rounding function round_mtime?

kknechtel · April 29, 2024, 10:53pm

Thanks, I think that makes a complete answer, then.