Would it be possible to redesign the __pycache__ folders so that it’s not on a per-folder basis?
Proposal 1: Library/script level python cache
Could we create single top-level folder called .pycache at the library/script level?
This would save all of the space in folders, which it’s been argued is very significant.
Also, having a single python cache folder would eliminate this script:
function pyclean
fd -HI -e pyc -t f . -x rm {}
fd -HI __pycache__ -t d . -x rmdir {}
end
which comes in handy every now and then, but which a lot of new Python programmers will never even think to do. Instead, they can clear their cache with the much more obvious: rm -r .pycache.
Proposal 2: User/system level python cache
Could we put the cache files in local folder specified by platformdirs, and share them on a per-user or per-system basis?
Now that tools like uv and poetry are more prevalent, nearly every one of my repositories has its own virtual environment. And every virtual environment has its own libraries, and every library folder has a __pycache__ with its own pyc files.
This user cache could be implemented by storing each pyc file under its correct path, but with its name modified to include the hash value: pycache_dir/python_version/path/to/module/module_name__module_hash.pyc. No database needed.
I simplified the proposal so that the pyc file would just be stored at pycache_dir/path/to/module/module_name__hash_value.pyc. Thus, the user could simply delete whatever she doesn’t want anymore.
I’m not sure what you mean exactly? And for which proposal? For the first proposal, each pyc is unique to the library root and module path, for the second, it’s to the python-version, module path, and module hash.
I’m not sure what you mean. It uses the same timestamp-based or hash-based invalidation that pyc files always use; nothing changes in that regard. And it uses the full path of the .py file to determine the path of the .pyc file within the PYTHONPYCACHEPREFIX; it doesn’t cause any collisions between pyc files.
It solves the problem of making it easier to clean up pyc files, and not having to git-ignore them. It doesn’t attempt to solve any kind of sharing of pyc files between the same version of libraries installed in different envs; these will still get separate pyc files, because the py files are in different locations.
Why not use a path like: pycache_dir/python_version/path/to/module/module_name.pyc?
That way as long as both venvs have the same python-version, they can share the same pyc? This would save compilation time and a huge amount of space.
Also, why not just make this the default behavior of Python and use the appropriate platform-directory? (You could throw ~/.python_history into that directory too.)
Because it is entirely possible to have two entirely different foo/bar.py in two different projects, and now you’ve silently introduced cache thrash, where either project running invalidates the pyc for the other project. This is even true if we limit consideration to just libraries installed in a venv site-packages; nothing enforces package or module naming uniqueness between pip-installable libraries.
If wasted disk usage in virtual environments is a concern, I think it will be both simpler and strictly more effective to go after the duplicated py files themselves, and then the pyc files will follow without any additional work, rather than allowing the py file duplication to remain and trying to de-duplicate only at the pyc level.
Some older packaging tools like zc.buildout used to share libraries across projects by constructing a longer sys.path for each project, with an entry for each shared library used by the project, but this approach fell out of favor because disk space is cheaper than process startup time, and longer sys.path makes every import take longer.
Hard-linking schemes are another option.
There are strong reasons to prefer both behaviors, and so status quo wins, because changing the default behavior would be very disruptive.
A centralized cache would grow without bound, whereas __pycache__ directories are deleted when the project tree they are in is deleted. And if you’re debugging issues related to the existence of a pyc file, __pycache__ directories are locally visible.
I guess you could just have disambiguate them by storing the module hash in the filename, and then there won’t be any collisions?
That makes perfect sense to me. It’s uv that creates venvs with separate files unfortunately. Maybe they will move to a consolidated (or hard-linking) approach eventually.
You could just trash files every few months and have a config file to control this frequency in the centralized location? I feel like the multiple directories are an anachronism since I don’t know of any other software that does that. But I see your point about local visibility. That’s a definite benefit of the status quo.
That would solve the correctness problem, but at the cost of very fast cache size growth for any actively-developed project, as every file edit would result in an entirely new copy of the pyc file for that module, that would stick around until explicitly removed. I don’t think you’d be winning on the disk-usage front in this scenario.
I’m not sure why this is unfortunate! Changing the behavior of uv (or adding a feature to it) is much, much, easier and more likely than the changes you’re suggesting to Python’s default pyc handling.
Though I don’t really get the sense that disk space wasted by multiple projects on the same system is all that high on the list of concerns of Python packaging users these days?
Yes, sorry if it was unclear, I agree with what you’re saying here completely. I just mean that uv’s current behavior is unfortunate. I would have preferred they did what you’re suggesting.
I don’t know about packaging, but some of these virtual environments in ordinary development environments are 6 Gb!
Anyway, I withdraw my proposal. I’ll wait for uv to share source files between virtual environments.
Can you clarify what you mean by “uv’s current behaviour”?
My understanding is that uv creates a global cache of hard-linked files so that when you create a venv and install things into it:
All installed packages in each venv are hard-linked to the global cache.
The disk usage for any given package is shared across all venvs in the same machine that have the same version of the package.
Entries in the global cache are automatically deleted when all linked venvs are deleted.
To me this seems like a great idea for reducing the footprint of venvs and is a major reason that I now use uv for all venvs. Getting this to work correctly did require me to configure the location of uv’s cache so that it is on the same physical drive as all of my venvs (otherwise hard-linking doesn’t work).
Note that since the OP references my comment about the size of venvs if we add 4KB to every __pycache__ directory that the same point still holds: whether you use uv or not increasing total disk usage of every installed Python package by 1% is significant.
It’s possible I don’t know how to use linux properly to be honest. I never make hardlinks. How can we check whether it’s making hardlinks and how can we check if the pyc files are unique?
I moved my uv cache because uv was warning me that it was making copies because the cache was not on the same filesystem as the venvs. I don’t see the warning now which I assume is because it isn’t making copies any more because now that they are on the same filesystem hard-linking is possible.
Okay, I’ll just trust you that it’s doing the right thing. It is odd that some of my repos show like 6 Gb of venv, but maybe those venvs have new things? Or maybe I don’t understand how folder sizes are counted (ever set of hardlinks each shows the same folder size)?
Hard-linked files still show the same disk usage but that usage is shared. I’m not sure what is a good way to demonstrate/explore this but it is probably better to ask how much disk space you have free rather than how much disk space is used by some files.