Change .pyc file format to record more accurate timestamp and other information

This topic is related about pycache 32-bit timestamp seems to have second granularity and uses pycache when they're the same · Issue #121376 · python/cpython · GitHub

#!/bin/sh 
 
mkdir -p src 
rm -fr src/__pycache__ 
rm -fr __pycache__ 
 
echo 'from src.test import f' > main.py 
echo 'print(f())' >> main.py 

echo 'def f():' > src/test.py 
echo '    return "Hello!"' >> src/test.py 
# prints "Hello!"
python3 main.py 

echo 'def f():' > src/test.py 
echo '    return "Goodbye!"' >> src/test.py 
# also prints "Hello!", but note the file does not say that anymore
python3 main.py

In the code up, the python use the old bytecode cache because of the timestamp issue(in 32 bit timestamp, the old code and the new code are runned at the same time)

So I think maybe we need to change timestamp from 32bit to 64bit to avoid this issue. or we can provide an more stable signature for the pyc file?

cc @storchaka

A better way already exists, PEP 552, but I don’t think there is a way to always generate such files, you need to that manually with e.g. compileall.

The problem is that there is no guarantee that better than 1 second resolution of the time stamp is available across OSes.

But also, are you completely sure that your bash script does what you think it does? The pyc file should have the source file length in addition to the timestamp, and from what I can tell you are changing the length.

1 Like

Actually, PEP 552 is a good choice. I think it would be better if the hash mode is default

I am sure this has been discussed before, I suspect that this isn’t done because reading in the entire source can be very slow, defeating the point of the cache. But for detailed discussions you would have to search around a bit.

I have the following planes for future pyc format:

  • Larger (at least 4 bytes, preferably 8 bytes) signature to detect pyc files. Currently starting bytes of the pyc file are changed for every Python version (and several times during developing time), so utilities like file need to support a list of signatures for different Python versions.
  • Explicit Python version. Currently the py launcher maintains a table of Python versions to detect Python version.
  • More precise timestamp of the source file. Most filesystems support more than seconds precision. This will reduce a chance of errors when the source file has been very quickly overridden with new content with the same size.
  • Larger range for timestamp. This is not critical. Currently the problem can only occur after 2120th when load pyc files created in 1990th.

We can also consider larger changes. For example, change the marshal format to support lazily loadable content (docstrings, line numbers, annotations).

1 Like

Thanks for the reply sir. I might have some issue about your plan

I 'm not sure about this part. For now, we have 4 32bits for the pyc header

  1. First 32bits used for version magic number
  2. Second 32bits used as control flag
  3. 3/4 32bits used for timestamp/filesize when not in hash mode
  4. 3/4 merged as 64bit to hash data for hash mode.

I’m not sure about the signature means. Is this just for the magic number part?

Yes maybe we need to extend the timestamp from 32bit to 64bit!

Can’t agree anymore!