Venvs silently install 25,000+ files — a real problem for users on Google Drive

First, thank you to everyone who volunteers their time to maintain Python packaging. I know it’s a massive, often thankless job,
and I genuinely appreciate what you do. I’m writing this to be helpful, not to complain.

About me: I was the VP of Operations for a search engine, so I’m comfortable with infrastructure, large-scale systems, and
debugging. I’m an advanced user. And this problem still cost me 15–20 grueling hours to diagnose and fix.

What happened

I have two Python projects that do audio/video transcription using Whisper. Each project created a virtual environment using venv
and pip install. The resulting venv folders contained:

  • Project 1: 15,627 files, 2.4 GB (the project’s actual code was ~90 files)
  • Project 2: 24,064 files, 898 MB (the project’s actual code was ~48 files)

That’s roughly 40,000 files of dependencies for what amounts to a few Python scripts. Over 99.5% of the files in each project
folder were pip-installed packages.

These project folders lived inside Google Drive. Google Drive tried to sync all 40,000 files. It broke. Completely. Google Drive
synchronization failed, and it took me 15–20 hours across two days to figure out what was wrong and fix it.

Why this matters beyond my situation

Most normal users don’t know that pip install can silently deposit 25,000 files into a folder. They don’t know to put their venvs
outside of synced directories. They don’t know that a single pip install whisper can trigger a cascade of PyTorch, CUDA, numpy,
onnxruntime, and dozens of other packages that each bring thousands of files.

This will hit anyone who:

  • Keeps their projects in Google Drive, OneDrive, Dropbox, or iCloud
  • Uses backup software that scans file trees
  • Works on a machine with antivirus that scans new files
  • Has any tooling that watches file counts or directory sizes

And the trend is going in the wrong direction — ML/AI packages are getting bigger, not smaller. A PyTorch install alone brings
thousands of files.

What I’d love to see discussed

I don’t pretend to have the right solution, but here are some ideas worth considering:

  1. A warning during pip install when the total file count is going to exceed some threshold (e.g., 5,000 files). Something like:
    “This installation will create approximately 24,000 files. Consider placing your virtual environment outside of cloud-synced
    folders.”
  2. Documentation that prominently warns against creating venvs inside synced folders. The current docs at packaging.python.org
    mention venvs but don’t warn about this. A “Common Pitfalls” section could save thousands of hours of user frustration.
  3. Longer term: Exploring whether package installations could reduce file counts — consolidated archives, lazy extraction, or
    shared package caches that don’t duplicate files per venv.

My fix

I moved the venvs to a local-only folder (C:\A_Python_libraries_50000_files_venvs_DO_NOT_SYNC) and updated my launch scripts to
point there. Problem solved for me — but I’m technical enough to do that. Most users aren’t.

Thank you for reading. I hope this is useful.

This isn’t implausible. But how does pip detect when the environment is on cloud-synced storage? We definitely wouldn’t want to issue such a warning every time a large install happens, so knowing the user is actually doing something dangerous, not just potentially doing so would be important.

But if you have a solution for that, I’m sure a PR would be welcomed.

Again, feel free to create a PR for this.

I believe uv might do this sort of thing (although I think it’s more as a performance optimisation, so (a) it’s probably not documented as something you can rely on, and (b) it may not work quite as you’d need for your use case). You might want to look at that as a possible alternative to pip.

More generally, the fact that Python allows non-specialist users to make use of large, complex libraries and ecosystems with limited technical knowledge is both a blessing an a curse. We hide a lot of complexity, but honestly, at times you really do need to know what’s going on “under the hood”. Tutorial documentation for tools like PyTorch could help a lot by reminding users that there’s more going on than you might think. Something as simple as a note in the install documentation saying “This installs many thousands of files taking up around a gigabyte of disk space[1] - luckily you don’t need to care about this most of the time, as packaging tools handle it” might give people the hint they need if they ever do hit issues like you did.


  1. Numbers are made up - I don’t know how big PyTorch actually is ↩︎

6 Likes

This feels like something the providers of device backup solutions could handle for their users. venvs pretty clearly advertise themselves as such, but the inverse isn’t true. There’s not a clear uniform way to ask an OS or filesystem “is this directory synced or might be in the future”, or a clear unifrom way to indicate that it shouldn’t be by default that would be appropriate for venv use.

Nowadays venvs also include a .gitignore file (also a pretty clear indication that you’re in a directory that typical end user backup solutions may want to warn users is being synced rather than handled just by their VCS)

6 Likes

Agreed. I think getting whatever sync solution you are using to support .gitignore files is important here.

3 Likes

Python already support loading pure-Python packages from ZIP files. If this is a big enough worry for you, it could be viable to have a custom package manager that bundles files onto ZIP file on installation. It might possibly break code were users are expecting __file__ to be an accessable reference to the file-system, for eg., but there is already importlib.resources, which has designed to handle most of those use-cases.

That said, I don’t know exactly what the UX you provide to your users is, so I am not sure if something like this is easily viable for you. Even if it isn’t, the Python import system is so modular, and the packaging standardized, so I am sure with some creative thinking, it should probably be possible to tackle this issue.

Now, this does not address the overarching issue, and if that’s what you are looking for, I would recommend continuing to engage with the community, and perhaps sponsor some of the volunteers that make it run to maybe try to work on this issue specifically.

I don’t think it’s pip or venv’s job to police libraries’ design choices. But I didn’t realise such it was possible to achieve such file counts so easily:

uv pip install torch
tree .

...


    │           ├── _virtualenv.pth
    │           └── _virtualenv.py
    ├── lib64 -> lib
    ├── pyvenv.cfg
    └── share
        └── man
            └── man1
                └── isympy.1

983 directories, 14898 files

In PyTorch’s case, they’re mainly header files:

~/venv/lib/python3.12/site-packages/torch/include/tree .

...


│   ├── library.h
│   └── script.h
└── xnnpack.h

287 directories, 9501 files

I would’ve thought if binaries have already been built, header files can be deleted from a user’s venv. Or does PyTorch run gcc at run time or do something else sophisticated with them?

1 Like

PyTorch has a C++ API that people can build extensions against. Those builds can happen using any arbitrary PyTorch installation.

1 Like

The variants PEP mostly talked about absolute download size as one of the problems of “all in one, choose at runtime” hardware support approaches, but this thread suggests that file count reductions might be worth considering as a benefit in their own right (even the header files could potentially be moved to an opt-in variant)

3 Likes

Thanks Nathan. That’s great, but to keep things tidier for PyTorch users not writing C++ extensions, could those ~10k header files be refactored into an optional extension library?

1 Like

I think that discussions about how PyTorch organises their wheels should be brought up with the PyTorch project (and community) directly. As far as this forum is concerned, the important point is that the language and packaging ecosystem provide tools to manage the problem (or in the case of wheel variants, we’re looking at mechanisms that might help), so there’s nothing more to be done here. It’s up to individual projects to decide if they want to use those mechanisms or not.

I have no idea why PyTorch feel it’s better to ship 10k[1] of headers that are only needed sometimes, in the main package. And I suspect very few other people here do, either. You should ask that question of the people who do know - the PyTorch project.


  1. Assuming your numbers are correct ↩︎

3 Likes

In my opinion the venv module should default to some kind of standard-behavior, e.g. when running python -m venv *without* parameter:

  • Create something like a directory ~/.cache/python-venvs (or the equivalents of this on macOS like ~/Library/Caches/python-venvs/
  • Create the venv beneath ~/.cache/python-venvs/$(basename $PWD)

These directories are usually ignored by all backup-solutions.

1 Like

Why were the projects configured to install dependencies within their own folders?

There’s also CACHEDIR.TAG which I know uv adds to virtualenvs. Though I don’t think venv nor virtualenv do.

UPDATE: virtualenv does support it

venv does not. You can see the closed issue of doing a similar thing for __pycache__ for why.

1 Like