Redesign pycache

jamestwebber · December 30, 2024, 1:39am

On linux you could check the inode of each file and verify that they are the same (i.e. the data is at the same location on the filesystem).

In python this is os.stat(file_path).st_ino or the inode() method of an os.DirEntry object.

oscarbenjamin · December 30, 2024, 1:55am

Okay here is a demonstration of two venvs created with uv venv and then uv pip install numpy:

$ stat venv1/lib/python3.13/site-packages/numpy/__init__.py | grep node
Device: 811h/2065d	Inode: 17715891    Links: 4
$ stat venv2/lib/python3.13/site-packages/numpy/__init__.py | grep node
Device: 811h/2065d	Inode: 17715891    Links: 4

They both have the same inode and it shows that there are 4 links so this file is shared in 4 places.

Here is another venv created with python -m venv:

$ stat venv3/lib/python3.13/site-packages/numpy/__init__.py | grep node
Device: 811h/2065d	Inode: 26872598    Links: 1

That’s a different inode and it only has 1 link so it is not shared anywhere else.

hauntsaninja · December 30, 2024, 1:59am

uv will use either hard links or reflinks by default, depending on platform. Both of these options save space.

It would be awesome if uv also hard linked or reflinked pyc files. For one, this could mean faster start up in a lot of scenarios. Currently uv does not compile pyc on installs at all (pip does — this is an important way uv’s benchmarks against pip are not apples-to-apples). See also Speed up pyc compilation · Issue #2637 · astral-sh/uv · GitHub

NeilGirdhar · December 30, 2024, 3:05am

Awesome! Thanks for explaining, and for linking the relevant issue.

Looks like the situation is already pretty great, and only getting better

gcewing · December 30, 2024, 4:29am

ls -l shows the link count in the second column. For an ordinary file,
if the link count is greater than 1, then there is another hard link to
the file somewhere.

You can also use ls -i to see the inode numbers. Two hard links to the
same file will have the same inode number. (But if two files have the
same inode number, they’re not necessarily the same file – they could
be on different file systems. You can use df to find out what file
system a file is on.)

mikeshardmind · December 30, 2024, 4:54am

You can also disable writing the bytecode to disk entirely if it’s adding up that much. There’s a relatively small hit to performance on first import that’s more noticeable in an environment where python is used for scripting or other short lived tasks than for programs that run for a while once started. (I’ve got PYTHONDONTWRITEBYTECODE=1 set in my dev environment (not in production though))