I can’t prove it, but personally my guess is that our Python 2 download metrics these days are heavily influenced by large companies installing on large clusters.

Mostly this is because of every time I look at the stats, the python 2 downloads are super *spiky*. For example, I just checked numpy’s recent downloads:

https://pypistats.org/packages/numpy

And sure enough, last monday (06-03), it had 598k python 2 downloads. Compared to recent history, this is highly anomalous: the previous few Mondays had 258k, 236k, 305k. Also, if you look at the per-OS stats, the spike is clearly all or almost-all on Linux.

Now, the law of large numbers tells us that if you have lots of *independent* random events – like say, a few hundred thousand different people with no connection to each other, each independently deciding whether to download numpy – then the noise tends to average out pretty quickly, and you shouldn’t see giant spikes like this. It would be extremely weird for 300k people to all independently say “hey, Monday June 3rd, I like the sound of that, that’s a great day for python 2 linux users like me to upgrade numpy”. But if it’s like, one person rolling out a huge cluster, then it makes more sense.

And I feel like I see these kinds of weird spikes like, practically every time I look at some random project – there’ll be like 1 day or 3 days or something where some specific python version + OS will go wild, and then settle back down to the baseline.

Of course I haven’t done any systematic study of it, and there’s no way to actually check this hypothesis

I guess in theory it might be possible to come up with some kind of clever statistical analysis that tries to infer what distribution of latent users could produce the kinds of spiky patterns we see in the data. Anyone know any stats PhD students that might be nerd-snipable?