PyPi Download Counts and Warehouse Migration?

I did a download count analysis on PyPi’s projects, and found something interesting.

There is a significant jump in the number of downloads of PyPi projects from July 2018 to August 2018.

I saw that that’s around the time when PyPI migrated to a new backend, Warehouse, in late April 2018. Might this affect the number of download counts? If so, how?

If anyone has any answers, that would be great. Thanks!

The FAQ addresses this:

Why are there so many more downloads after July 26, 2018?

PyPI download records are generated by a service known as linehaul. The previous iteration of the service had an issue which caused it to restart regularly due to running out of memory, resulting in a large quantity of dropped download records. On July 26, a newer version of the service was deployed, which is much more robust and reliable.

1 Like

Thank you so much! Does that mean the data before July 26, 2018 are unreliable? I read here that the proportions should nonetheless be the same across packages.

Depends on what you’re doing with the data. The caveat is accurate: the download counts themselves are under-reported, so you might consider this to be unreliable, but because the data loss was uniform across all features, you can consider relative comparisons or proportions to be reliable.