PyPI downloads statistics and continuous integration

For folks analyzing PyPI statistics in BigQuery – are there examples out there of people doing data analysis with the file_downloads “CI” field? Does anyone have tips on how best to frame the relevance of that field in data analysis?

Context

I regularly look at package downloads in PyPI thanks to this dataset, and equally regularly get asked how much of the downloads statistics come from “real users” vs. automation (Continuous Integration (CI) systems and similar). The expectation is that any attempt at inferring usage trends from downloads is probably better if more directly connected to ““manual”” downloads. Rather than estimating prevalence of automation in the Python world. The file_downloads.details.ci field seems spot on :person_tipping_hand:. But I’ve never seen anyone relying on it in their data analysis.

What I know so far

Doesn’t feel fair to ask for help without sharing where I’m at, so here goes. This last came up when I blogged about how uv is getting traction in the Wagtail world. Turns out, of all installers trackable in PyPI stats, only pip and uv seem to add ci metadata in their user agent at all. So this premise of “how much traction uv is getting” is an interesting opportunity to dig deeper.

And now that I’ve done the querying, it seems the pattern is pretty different between them two. With the caveat I don’t know much about how either implement downloads or local caching, uv has higher downloads in “CI” than “not”, while pip is the opposite:

I’ve never seen anyone using this ci field so not sure with what caveats to share that kind of analysis. Things like:

  • Caching: Perhaps uv caches much better locally, so wouldn’t count as many local installs
  • New release adoption: uv I believe has had this ci metadata reporting since inception, perhaps lots of pip-driven downloads are on older versions without.
  • Different use cases: perhaps folks reach to uv to speed up their CI, while retaining pip for local dev.

Keen to hear what others think!

References

If anyone wants to dig deeper on this field, relevant details:

4 Likes

Charlie Marsh and I did a sponsor-track talk at PyCon US 2024 which involved measuring the ‘footprint’ of your development processes, and as part of that I did a quick analysis of CI vs. non-CI downloads from pypi.org over a short period of time (one week, I believe). The numbers were shocking, the CI downloads represented more than 80% of the traffic.

3 Likes

The only one I can think of is pepy.tech which has a toggle to filter out CI downloads, but only it’s for pro accounts and I’ve not tried it.

Yep, uv since version 0.1.22 from March 2024, just a month after its initial release, so pretty much since inception (astral-sh/uv#2493).

pip has had it since version 19.1 – pip uses CalVer so that’s six years (pypa/pip#6273).

Looking at the top 20 installers yesterday (via pypinfo --limit 20 --percent --days 1 --all "" installer installer-version):

installer_name installer_version percent download_count
pip 25.1.1 23.94% 367,178,199
pip 24.0 11.48% 176,058,144
uv 0.7.2 10.10% 154,959,976
pip 25.0.1 9.95% 152,533,158
pip 23.0.1 8.83% 135,473,693
uv 0.7.3 4.21% 64,577,578
pip 23.2.1 4.21% 64,576,636
pip 22.3.1 3.15% 48,284,870
pip 21.3.1 3.06% 46,914,226
pip 24.3.1 2.87% 44,042,212
uv 0.6.13 2.43% 37,291,945
pip 22.2.2 2.37% 36,340,879
uv 0.6.17 2.18% 33,449,781
pip 24.2 2.05% 31,454,008
pip 21.0.1 1.66% 25,468,400
uv 0.6.10 1.61% 24,685,021
pip 20.2.2 1.60% 24,539,568
pip 22.1.2 1.55% 23,696,599
poetry 2.1.3 1.46% 22,402,809
uv 0.6.9 1.29% 19,768,787
Total 1,533,696,489

All of these versions are with CI detection, so the vast majority of PyPI downloads include it.

Whilst we’re at it, here’s the top installers yesterday (pypinfo --test --limit 20 --percent --days 1 --all "" installer):

installer_name percent download_count
pip 72.76% 1,349,210,055
uv 23.18% 429,835,352
poetry 2.75% 51,047,940
requests 0.46% 8,492,492
None 0.26% 4,788,213
bandersnatch 0.23% 4,316,255
Bazel 0.12% 2,136,498
Browser 0.09% 1,702,997
setuptools 0.08% 1,421,528
Nexus 0.06% 1,138,682
pdm 0.01% 145,438
Homebrew 0.01% 110,665
devpi 0.00% 54,966
Artifactory 0.00% 38,664
OS 0.00% 8,784
conda 0.00% 3,106
pex 0.00% 1,035
Total 1,854,452,670

These numbers are from querying BigQuery using the pypinfo CLI – we should expose the ci field there too.

3 Likes

It’s worth noting that comparing HTTP calls for pip vs uv is not a linear mapping to traction.

For example uv eagerly prefetches metadata, so uv might do a dozen metadata HTTP requests where pip might have done one. And there are plenty of other scenarios where they behave very differently, not least because uv supports a lot more use cases (projects, tools, scripts, etc.)

6 Likes

Interesting. I’ve been asked questions like this too (basically “what does it mean”), as I’m sure a ton of people have. Is there anything useful that can be gleaned from download numbers, and if so, what? Say I’ve got a small project that other perhaps “volatile” projects depend on. They haven’t got any caching set up on their CI, so every PR and update thereto triggers a “download”, maybe hundreds a day each. Raw download numbers make the impact of the project potentially look bigger than it is? Or do they?

2 Likes

All these workflow/environment managers we have these days that feature bumping every package version to the latest at every opportunity make even the non CI downloads less representative of usage. Ruff for example is going to get a disproportionately large number of downloads versus flake8 just because it does a new release every 10 seconds. Slower updating/stable package will most commonly be installed from cache without incrementing any download counts.

2 Likes

Separately from the question of statistics it seems inefficient that PyPI is attacked by CI systems in this way.

Can the CI systems not proxy/cache PyPI and keep the traffic mostly internal?

3 Likes

They could and they should. But they don’t.

It’s the Tragedy of the Commons.

The way to do it is to aggressively rate-limit and require authentication. But it’s very difficult and disruptive to change this entrenched behaviour unfortunately. It’s very similar to the DockerHub change (from unauthenticated pulls to authenticated), but likely more disruptive.

On the installer/resolver spectrum, we can (mostly already do) cache downloads. But many CI instances are ephemeral and don’t have a persistent or shared cache across agents.

2 Likes

So it’s down to us to enable caching in our CI workflows.

It’s pretty easy with actions/setup-python, usually just add cache: pip.

And it’s enabled by default with actions/setup-uv for GitHub-hosted runners.

3 Likes

First, keep in mind that PyPI is already heavily proxied/cached by Fastly; without that the servers running pypi.org would burn to a crisp :slight_smile:

Second, caching PyPI is not easy (I can say this as a person who helped to run a Artifactory cache at a prior job). When new distributions are uploaded to PyPI, the index page for that package must be invalidated in the cache, otherwise installers won’t know about the new distributions. You might thing “oh, that’s fine, there will just be a delay of 30-60 minutes before the new distribution will be available”, and that seems true, but in reality the situation is more complex.

A common problem for caching PyPI is in fact the AWS boto3 group of packages; AWS updates them often (sometimes more than once per week), and they upload a set of distributions which all require each other’s new versions (they can’t be used with older versions). If the cache is aware of one of the new distributions but not the others (because the others were last pulled within the cache timeout window and the indices for them are stale), then the installer will not able to install those packages without backtracking to older versions, and that doesn’t always work. This was our lived experience running a cache; we frequently had to manually invalidate the index caches for PyPI so that our employees could successfully install boto3.

Another caching problem is yanking packages; a cache which does not know that a specific distribution has been yanked will continue providing it to installers who ask for it, and if that cache is sitting inside GitHub (for example) and feeding packages to CI workflows which build release artifacts, those artifacts will contain yanked packages.

The current system works because PyPI sends ‘purge’ requests to Fastly when new distributions for packages are uploaded, or when distributions are yanked. Doing that for a distributed network of caches being run by numerous third parties would be very challlenging, if it’s even possible at all.

5 Likes

@kpfleming that research about CI vs. non-CI downloads sounds pretty spot on! Do you have data / methodology / results shared somewhere that others could reference? Otherwise I’ll go hunt for the talk recording :slight_smile:

@hugovk ty, I didn’t know about pepy.tech, I’ll check out how they present this "CI"filter.


The use is limited and the risk to over-interpret the data is high, that said personally I still find it useful for specific comparisons as long as the uncertainties are acknowledged:

  • Downloads over time for a specific package as a proxy for usage trends (not very truthful but still more useful than GitHub stars)
  • Releases adoption. As an example, using PyPI data I recently wrote Supported versions: Django vs. FastAPI vs. Laravel.
  • Pre-release testing (if pre-release downloads are low, I know I need to ask more people)
  • Comparing similar packages within a given subset of the Python ecosystem. For example reviewing Django packages’ downloads compared to the downloads of Django itself, to understand where that ecosystem is going.

All of which are influenced by the prevalence of CI, and automation, and lots of other pitfalls. But IMO still useful enough currently.

Is there anything useful that can be gleaned from download numbers, and if so, what?

They help the PyPI administrators figure out which projects are causing the most load on the donated CDN services and hosting backend.

Can the CI systems not proxy/cache PyPI and keep the traffic mostly internal?

It’s definitely possible. I help run some rather large gating CI/CD systems and we deploy a caching proxy in each cloud region (using Apache mod_proxy+mod_cache) because it makes our jobs faster and more reliable. The Internet is chaotic, networks inside clouds less so, therefore every connection a job makes across the open Internet means an increased frequency of false failure results and reduced developer trust in the project’s testing.

Of course, someone setting up a CI system on their own with limited resources doing the absolute bare minimum necessary to get it running probably doesn’t care about such things, and I suspect in aggregate that’s the sort of environment where the bulk of the hits are coming from.

The way to do it is to aggressively rate-limit and require authentication. But it’s very difficult and disruptive to change this entrenched behaviour unfortunately. It’s very similar to the DockerHub change (from unauthenticated pulls to authenticated), but likely more disruptive.

Veering a bit off-topic, but technically DockerHub’s protocol always required authentication. “Anonymous” pulls still had to get an ephemeral token to access the repository indices, though it’s hidden away from userspace by the tools implementing it so you typically wouldn’t notice. This along with a number of other cache-busting mechanisms like bogus timestamps and such make DockerHub significantly harder to cache effectively than PyPI (not to discount the other posts describing the challenges with caching PyPI, which I agree is not trivial either).

And now the rate limits added in the past year-ish have made DockerHub caches a hindrance since funneling all your cold cache hits through a single machine means that your cache is likely to get blocked sooner, even though you were being a good citizen and reducing the overall number of requests from your CI in aggregate. For projects I’m involved in, the upshot has been that we’re abandoning DockerHub and Moby’s docker tooling in general, switching to independent registries either run by the project’s community or by other friendly parties (incidentally also easier to proxy/cache).

1 Like

It wasn’t much research, but here’s the recording: https://www.youtube.com/watch?v=hIXUzxmMcAU&t=21s

ty! Was it that What about PyPI traffic? around 34:30 then? The number you have on the slide there has CI downloads at 27% of the total over one day, so not 80% of the traffic unless I’m misunderstanding.

From my side I shared Wagtail’s CI figures for pip and uv above. Irrespective of installer, Wagtail’s download stats have been hovering between 10 and 30% from CI on a monthly basis over the last 12 months. Django is in a similar ballpark. FastAPI’s proportion is higher, between 20 and 50%:

Sharing with the caveat again this is based on this “CI” reporting in user agents that I’m still trying to understand how to interpret