As in Python download stats for March 2019, I recently did some analysis of the python.org download logs. I used better tooling this time, so I have some more interesting numbers, and I’m also happy to run more queries if people have ideas (though all I have access to is essentially the URL).
I filtered out a few obvious scrapers, but it made little difference. They showed up with a high absolute number of hits, but evenly distributed across every single file on the server. The badly behaved downloaders from last year seem to be gone.
Windows any of the
.zip packages, not counting sub-parts of an install (so only the 2.7 MSIs, essentially)
Source any of the archives, except those containing docs
Docs those containing docs
RPM some old
.rpm files that are still floating around
Other anything else
As with last year, this is heavily biased towards Windows, as python.org is the primary source for most users. Though this year, the Microsoft Store is an alternative - according to my dashboard, there were 226,272 downloads in the same time period. So quite small (1.4%) compared to python.org.
This chart is all downloads relating to a particular version (based on the directory name).
Hope this is informative or interesting to people. I’m happy to take questions or requests for other pivots.
I think the Windows bias will vanish by adding:
The 20M downloads that go via anaconda? I’m sure @teoliphant can provide the stats.
The 44M downloads that go via linux / apt? I’m not sure python.org ever sees these.
Others? All the server downloads go via Kubernetes / Docker / …?
Sure, that’s the point: we need other data. This isn’t at attempt to encompass all the data though - someone has to publish the raw measurements (along with the biases) so that the meta-analysis can be done.
Though last I heard, Anaconda is just as biased towards Windows, so that may not “balance” it that much. Data from the Linux distros would be very interesting though!
Maybe you should start being evil and put the Python build in the header of PIP’s request to PyPI
It is, in the user agent string at least. That’s a lot more data though, but if you pay for BigQuery then you can run the numbers against it.
Unfortunately, it’ll be much harder to separate “bot” installs (CI, deployments) from actual users, even if you have source IP addresses, so all its really going to tell you is “who has the biggest clusters and the least caching”
Can’t speak for all distros, but for Fedora:
- we don’t have this data; we don’t track what users install.
- the package manager uses Python, so all installations of Fedora have Python installed (except some super-minimal container images). That’s 3.7 for Fedora 31, 3.8 for f32, 3.9 for f33.
- would creating a VM or a container with Python count as installing Python, anyway? I install Python into a chroot(-ish environment) several times a day when I do packaging.
Yeah, this is a tough question, because it depends on what you’re trying to find out, as well as what the person is actually doing.
For most questions, the number of unique users who have installed any version(s) is a pretty good indicator of “how many people will notice %CHANGE%”. But unique machines doesn’t reflect users, since that’ll count one person installing the same thing on 1000 machines as 1000 instead of 1, while a single Docker install won’t reflect the 1000 users who end up using it. Maybe they average out? But how can we ever know?
It’s a tough area to try and openly share data, but I tried. (Note that I never asked or suggested that anyone else should release their data, only that if people want it then I’m not the one who can provide it.)