Periodically, I am allowed access to the CDN logs of python.org to collect some download statistics. (For hopefully obvious reasons, these are not publicly available because of personally identifiable information, and nobody has set up an automated process to transfer and filter them into a public dataset.)
In general, these are not particularly informative, since python.org is only the primary source of Python downloads for Windows users, but in terms of assessing the overall scale of Python usage it is one useful input. So here are a few summary tables that I extracted from the logs:
|Category||Sum of Hits||% of Hits||% of runtime downloads|
These have been corrected for a couple of CI systems that I deemed broken (for example, there’s one system that downloads Python 3.5.4 for macOS every few seconds, and a version of Chef that downloads Python 2.7.10 for Windows unusually frequently). The last column shows percentages of only the rows that have values - excluding downloads that don’t represent “give me a Python runtime”.
Windows is a download of the main installer (either the
.msi) or the embeddable package (nuget package downloads are visible here)
- Windows deps are optional MSIs downloaded by the installer (debug symbols, etc.)
- Sources are any of the source packages
macOS are any of the
Docs are any of the documentation files (primarily the Windows
Sigs are any of the
If instead of pivoting by operating system, I switch to Python versions (based on the version number directory in the URL), filter to the Windows/Sources/macOS breakdown used above, and filter to only versions that had at least 100k downloads or more, we get this:
|Category||Sum of Hits||% of Hits|
So the good news is that the latest releases are getting the majority of the downloads (bearing in mind that 3.7.3 was released in the last week of March).
Finally, as a last validation step to see how often we were getting repeated requests from the same source (e.g. CI systems), I bucketed unique IP addresses. This table is the number of unique IPs for each range of request count, including 200, 300 and 400 HTTP responses. So approx. 2 million unique IPs made 10 or fewer requests during March (requests seem to come in pairs, but I didn’t figure out why this is the case).
The top 20 IP addresses here accounted for 4.64 million requests. A quick sanity check makes it seem like they are distributed across file types and responses (many 300 and 400 responses are included in the request count) and are more likely to be spiders than actual users. Dropping them from the download counts didn’t have a noticeable impact.
These are the most interesting results as far as I’m concerned. If anyone has any suggestions for things they’d like me to take a look at please let me know, though I’ve already gotten rid of the original reports and revoked access to the logs, so I may not be able to get them from my filtered sets.