First, thank you to everyone who volunteers their time to maintain Python packaging. I know it’s a massive, often thankless job,
and I genuinely appreciate what you do. I’m writing this to be helpful, not to complain.
About me: I was the VP of Operations for a search engine, so I’m comfortable with infrastructure, large-scale systems, and
debugging. I’m an advanced user. And this problem still cost me 15–20 grueling hours to diagnose and fix.
What happened
I have two Python projects that do audio/video transcription using Whisper. Each project created a virtual environment using venv
and pip install. The resulting venv folders contained:
- Project 1: 15,627 files, 2.4 GB (the project’s actual code was ~90 files)
- Project 2: 24,064 files, 898 MB (the project’s actual code was ~48 files)
That’s roughly 40,000 files of dependencies for what amounts to a few Python scripts. Over 99.5% of the files in each project
folder were pip-installed packages.
These project folders lived inside Google Drive. Google Drive tried to sync all 40,000 files. It broke. Completely. Google Drive
synchronization failed, and it took me 15–20 hours across two days to figure out what was wrong and fix it.
Why this matters beyond my situation
Most normal users don’t know that pip install can silently deposit 25,000 files into a folder. They don’t know to put their venvs
outside of synced directories. They don’t know that a single pip install whisper can trigger a cascade of PyTorch, CUDA, numpy,
onnxruntime, and dozens of other packages that each bring thousands of files.
This will hit anyone who:
- Keeps their projects in Google Drive, OneDrive, Dropbox, or iCloud
- Uses backup software that scans file trees
- Works on a machine with antivirus that scans new files
- Has any tooling that watches file counts or directory sizes
And the trend is going in the wrong direction — ML/AI packages are getting bigger, not smaller. A PyTorch install alone brings
thousands of files.
What I’d love to see discussed
I don’t pretend to have the right solution, but here are some ideas worth considering:
- A warning during pip install when the total file count is going to exceed some threshold (e.g., 5,000 files). Something like:
“This installation will create approximately 24,000 files. Consider placing your virtual environment outside of cloud-synced
folders.” - Documentation that prominently warns against creating venvs inside synced folders. The current docs at packaging.python.org
mention venvs but don’t warn about this. A “Common Pitfalls” section could save thousands of hours of user frustration. - Longer term: Exploring whether package installations could reduce file counts — consolidated archives, lazy extraction, or
shared package caches that don’t duplicate files per venv.
My fix
I moved the venvs to a local-only folder (C:\A_Python_libraries_50000_files_venvs_DO_NOT_SYNC) and updated my launch scripts to
point there. Problem solved for me — but I’m technical enough to do that. Most users aren’t.
Thank you for reading. I hope this is useful.