I am not at all sure if this is a good idea or not, but now that a .gitignore is added to virtual environments, I wonder if it’s a good idea to also add them to __pycache__ directories? Maybe it’s not worth the overhead, or there are good reasons to check these into VCS in some cases.
In a lot of beginner projects and projects by people that are not career programmers (academics, data scientists) I do see people checking in these files incorrectly, it would be nice to avoid this.
I think the main difference here is that venv already has a lot of platform specific logic around adding the right activation and deactivation scripts, so adding a .gitignore to the set of generated files is a much smaller conceptual change.
By contrast, __pycache__ operates at a much lower level inside the interpreter, and doesn’t even necessarily catch all compiled Python files (files may still be explicitly compiled adjacent to their source file, and there are options to allow the compiled file cache to use a completely parallel directory tree instead of nested __pycache__ folders).
The sheer number of __pycache__ folders that can be generated in a complex source tree also pushes things towards favouring a single __pycache__ entry in a top-level .gitignore file.
We’re also looking at data volumes that are frequently smaller than the corresponding source files, whereas venvs can be enormous if they bring in dependencies like pytorch or tensorflow.
So it’s an interesting idea to consider, but I think the balance of consequences comes down on the side of “the interpreter is not the right level to mitigate the problem of compiled Python caches being checked into source control” (whereas the venv module was a decent place to solve it for venv creation).
This feels to me like a distinction without a difference. It’s not clear what relevance this abstract contrast has to a very real issue of beginners checking in numerous folders that should not be checked in. Adding a 1-byte .gitignore file there costs basically nothing but would be a tangible QoL improvement.
Call me cynical but I suspect that that lesson will always need to be learned the hard way at some point. If we deal with __pycache__ then it’ll just be some other built/generated/junk file that that beginners numerously check in before they come to appreciate why the .gitignore exists.
But it is strange to me to have a lesson the importance of which comes mostly from us deciding to keep the pain points around that necessitate learning the lesson. Similar to keeping random junk laying on the floor to teach the lesson of watching your step - which is a valuable lesson, indeed, but maybe not one we need to be taught every time we cross a room?
It does seem like most of the ecosystem is moving to having all auto-generated folders have .gitignore inside. Not only .venv but .idea, .ruff_cache, .mypy_cache, .pytest_cache (to name the most common) also do it. We are facing the reality where __pycache__ becomes the main reason to even have a user-created .gitignore, and for many small and/or beginner projects the only reason.
So the floor now is mostly clean except for a large rake we keep for… pedagogical reasons?
Everyone learns how to cross a room sooner or later. Removing one specific thing from the floor won’t change that - there are plenty of other things on the floor.
Most of my post was talking about the fact that in a growing number of cases that is approaching 100% for small projects, __pycache__ is indeed the only thing left lying on the floor.
Not to mention that the whole point of convenience comes from making unavoidable lessons become a thing of the past and only something rarely needed in special cases.
(I feel like I’ve made my point so I’m going to step aside to let other people contribute)
That hasn’t been the case in my experience, so I don’t think there’s that much value in removing just one thing. There’s inevitably something else around, and a .gitignore file usually has to be customized to the particular workflow and project. Despite having standardized .gitignore files all over the internet, people still need to understand what it’s doing and make one that is appropriate to this project. That’s never going to change.
Adding a .gitignore file into a __pycache__ directory would need to be done for every __pycache__ directory so it is many lots of 4KB spread all over the place.
How is the interpreter supposed to know that this __pycache__ directory is in a git repo rather than say site-packages or stdlib etc? Is it going to add 4KB files everywhere?
Aside for whoever happens to be reading this thread and does not know it yet: it is possible to set gitignore rules at the user account level, for example in ~/.config/git/ignore:
venvstacks isn’t a super complicated project and doesn’t have a tremendously deep dependency tree, but it’s deep enough that when tox emits environments for a dozen or so different commands, there end up being a couple of thousand directories that contain Python source files, each of which gets its own __pycache__ subfolder.
The Python interpreter has no way of knowing that the vast majority of those cache folders are already being ignored because they’re from virtual environment installations managed by either tox or pdm:
$ find . -type d -name lib | wc -l
13
The total number of .gitignore files in the working tree is on the same order of magnitude as the number of virtual environments created by tox and pdm (it’s inflated a bit by a couple of dependencies that either inadvertently or intentionally ship .gitignore files to ensure that the containing folders are created at installation time)
These numbers are a couple of orders of magnitude smaller than the number of __pycache__ folders in the same working tree.
It’s that difference in scale that really tips the balance between adding an implicit .gitignore file being a reasonable idea when creating a virtual environment (especially since so many tools that create virtual environments already do that anyway) and it being a far more dubious idea when considering doing it for every__pycache__ folder the interpreter creates.
Adding a __pycache__ entry to a top-level .gitignore also has the benefit of keeping the folders themselves out of source control, whereas the “.gitignore in the folder” approach is mainly useful when it’s desired (or at least acceptable) for the checkout to record the folder’s existence.
There are 6,406 __pycache__ directories on this machine. Adding 4 KB in each would take 26 MB. But those existing directories already take up 2.2 GB, so the extra .gitignores would only add about 1%.
You say “only” but I would consider 1% to be significant when talking about increasing the install footprint of every Python package in every installation environment in the world. There would need to be some significant benefit to justify that.
The venv I have sitting here uses 241MB with 0.002% of that being the .gitignore file it contains. If I was looking for someway to make venvs smaller I would not waste time considering whether that .gitignore file could be stripped out. However if I realised that I could make the whole venv 1% smaller by recursively stripping out hundreds of pointless .gitignore files from every package directory then that would seem like a worthwhile improvement.
It seems like this thread basically balances the space cost of .gitignore files with their benefit.
I don’t think the space cost is that significant in the modern world, but I added a separate topic about redesigning pycache so that it shares its cache on a user or system level, which would eliminate it entirely.
I think people are under-appreciating the benefit of simplifying the gitignore experience for new and experienced users. Not creating a .gitignore is one fewer step to do and to learn. Once you get used to doing out of habit, it’s easy to forget the mental cost it took you to get there.
It’s notable how incredibly dominant git is in people’s minds here. Not a single thought is given to any OTHER source control system. If Python had to create not one, but a bunch of different ignore files (.hgignore, .bzrignore, etc), would that change things? Or is it assumed that people are going to have to figure those other systems out on their own?
Maybe git really IS that dominant, and I just have older experiences. Maybe Firefox is the only project left in the world that doesn’t use it.
I think it’s dominant enough that future version control systems are likely to accept .gitignore files (since so many other tools already either emit or read them, there is a genuine practical benefit to a new VCS taking that approach).
Yeah, although I would be cautious of assuming that “primary” equates to “sole”. Again, it PROBABLY is true for a lot of people, but I would consider git to be my primary VCS on account of it being the one I’d choose for any brand-new project, while still happily using whatever system someone else is already using.
So, this point is probably moot, and git really IS that dominant. But I’m not 100% certain of that.