How CPython understand that the .py file has been changed and a new .pyc needs to be compiled?

Hello! My question is about CPython implementation.
I know that when we try to import a module, it is compiled into byte code and this byte code is written into a .pyc file so that don’t compile it again next time. If the source .py file changes, the .pyc will be compiled again. My question is, how cpython understand that the .py file has been changed and a new .pyc needs to be compiled?

The .pyc file stores the timestamp and size of the source and this is compared when loading the module.

You can also opt into a system of hash-based checks.

I’m trying to run old code. For example, we have a file test.py, it compiled into test.pyc, and then we change test.py, and python should still run the old code from test.pyc.

That’s what I’m doing:

  1. create file test.py with contetn print(1)
  2. copy this file to other folder with save edit file timestamp attribute
  3. create new file test.py with content print(2)
  4. compile this new file to .pyc
  5. create new folder and __pycache__ include
  6. copy old test.py (from step 1 and 2) to this new folder(from step 5) with save edit file timestamp attribute
  7. copy .pyc file(from step 4) to new __pycache__ folder(from step 5) with save edit file timestamp attribute

Now, we have test.py file with old timestamp and test.pyc with new timestamp, but python still recompiles the .pyc file and run print(2)

Ok, so you’re trying to trick it, but it’s too smart. :wink:

I don’t see how you can be sure of editing the (correct) time stamp. Maybe a move rather than a copy would work?

FWIW, here’s where it happens: cpython/Lib/py_compile.py at a531fd7fdb45d13825cb0c38d97fd38246cf9634 · python/cpython · GitHub

This is intriguing, what is the use case that leads to needing this?

Have you considered only allowing python to find the .pyc?
Python will work with the .py file not being present.

I’m just wondering if this is possible

It is definitely a problem. Nothing below is intended to disparage your
concerns.

But at least with open source the end user can purge any __pycache__ and
.pyc files themselves, and work from the source. You don’t get that
choice with closed source stuff.

I also do not know if PyPI requires eg sdists (the “source distribution”
flavour of the uploaded package) to contain only source code. Which
means that the approach which came to my mind (installing using eg
pip install --no-binary .....) might not be as effective as I
imagined.

Of course, you can (less effectively) conceal malware in source code. I
know little, but I’ve read reports where code is concealed simply by
indenting it a very long way, such that (some) editors don’t display it.
For example, the code snippets in this forum crop at the end of the box -
I could put code beyond where are it would not be obvious. And someone
know blindly copy/pasted from such a view could be given a nasty
surprise.

Also you could also distribute a source only package with nice clear safe
looking code which imported some malware-containing package which had a
very similar name to some legitimate package (the “typosquatting”
approach, but deliberately using the typo).

In some ways, if you really want safety, you’re better off running a
suspect package in a sandbox of some kind. Docker containers come to mind,
but they are harder to secure that you might think. And anyway, if the
malware just wants your CPU eg to waste energy mining some
cryptocurrency a docker container wouldn’t help much.

And of course even then we don’t run programs in total isolation - even
if we prevent them getting at our files or the network etc, at some
point we want to use the data they produce …

1 Like

It’s not all bad. To exploitation, it is necessary that the file modification time attributes of files be saved in the file system; git does not save these attributes, so git repositories are not vulnerable. Archives are vulnerable because most archive formats save time attributes, so any format that is based on an archive is vulnerable, so .exe files are vulnerable in most cases.

The process of installing packages is also vulnerable, I don’t fully understand the process of packaging/installing packages, but I know that python packages are distributed in zip format, which save the time attributes of the files.

It’s not all bad. To exploitation, it is necessary that the file
modification time attributes of files be saved in the file system; git
does not save these attributes, so git repositories are not vulnerable.

Post-checkout hooks? Just wondering.

One reason I was thinking about this is that while Mercurial also does
not record or apply the file mod time on checkout, I’ve got a script I
apply to my prep-a-checkout-for-install which applies the timestamp of
the individual file’s last commit to the file, which I think of as
quality of export in some ways. of course, that only extracts .py files
so any .pyc files will still get “new” dates.

Archives are vulnerable because most archive formats save time
attributes, so any format that is based on an archive is vulnerable, so
.exe files are vulnerable in most cases

Pip’s wheel files are zip archives! One could upload a misleading wheel
to the cheese shop.

1 Like