Tracking Interactive vs CI use of pip and thus PyPI downloads

The analysis of PyPI downloads by OS shown in https://twitter.com/di_codes/status/1359936102413594628 looks odd because the Platforms chart is basically 0 for everything else as Linux shows up as 100-300M downloads a day.

I suspect this is the age old “how many downloads are CI systems vs users” question.

Have had any success attempting to track this?

One idea that could be done within a future pip, at least on posixy things like Linux, would be to have pip indicate if a download was coming from an interactive session (presence of a controlling terminal via the pty module?) or an automated session (more likely to be a continuous integration system or automated build system) within the download request.

It’d be great to be able to highlight when automation users are not using a local PyPI cache and really should be… And figure out how to make that a normal thing in the common CI setups if those are indeed the cause (still just a hypothesis).

2 Likes

There’s a feature request for this here: Feature request: store "installed from CI" in the BigTable · Issue #9 · pypa/linehaul-cloud-function · GitHub

2 Likes

This thread has a few good solutions:

“Draft PEP: PyPI cost solutions: CI, mirrors, containers, and caching to scale”
https://groups.google.com/g/pypa-dev/c/Pdnoi8UeFZ8/m/f7QommElBAAJ

1 Like

Can we distinguish downloads from AWS, Azure, and GCP too?

See this ranking PyPI Download Stats

boto3, botocore, and s3transfer is relating to AWS. And some others are dependency of them.

It would be heavy-handed and probably unnecessary after a few weeks’ notice, but would blocking known major CI IP ranges until they submit a THISISFROMCI header that should be sent when the corresponding env var (or similar) is true?

A redirect to a “how to save yourself tons of ingress bandwidth and save PyPI millions” guide could solve the announcement issue.

I would advise coordinating such things with proprietary repository solutions vendors such as JFrog (Artifactory) or Sonatype (Nexus). Most users of those do not know how exactly the mirroring is performed, and so we cannot really tell whether we (or rather the instances we use) do things the right way or not.

The problem with pty detection is that more and more people are using pip through a wrapper (poetry, pipenv, pip-tools, etc.), so there would be a significant false.

Most CI solutions come with predefined environment variables, e.g. Predefined environment variables reference | GitLab

Maybe you could fingerprint CI runs using them?

It’s not easy to distinguish, but if there was a recommended environment variable or setting we could use to tag requests as coming from Azure, I could probably get that used in a lot of areas.

Though if I could do that, I could just as easily switch those same users to an internal mirror. People seem to get really upset when we suggest we could change default settings though (normally people on here/Twitter, actual customers keep begging us to replace PyPI for them… but that’s off topic for this thread :wink:).

We’ve generally avoided trying to do things like this, and instead tried to focus on ways to expose this information to allow downstream users filter the data how they see fit. The general reasoning is that while some people will want the data with X thing filtered out, still others will want it with Y thing left in.

1 Like

There are already Python packages for identifying whether one is running under some CI and which CI service is in use. The ones I know of are ci-info and ci-py.

1 Like

The objective would be to strongly incentivize reading the TODO docs on how to proxy cache by default (and then specify THISISFROMCI=1|str so that {pip,} would send the correct HTTP header to PyPi for those egregious ASIN IP range(s).

This conversation also brings to mind a question I recently pondered.

pip sends a User Agent string to pypi.

  • Does pypi use the UA string at all?
  • Is the format of this string defined?
  • Should this be definitively defined somewhere? a PEP?

This whole thread is a duplicate of Differentiating organic vs automated installations · Issue #5499 · pypa/pip · GitHub.

This information is tracked by pip, but discarded by linehaul on PyPI (Add the CI varaible to the data structure by dstufft · Pull Request #46 · pypa/linehaul · GitHub added it to the data structures, but is not being stored).

See pip/session.py at bbf8466088655d22cd46b286c8f0b8150754c1d9 · pypa/pip · GitHub and Feature request: store "installed from CI" in the BigTable · Issue #9 · pypa/linehaul-cloud-function · GitHub.

4 Likes
>>> from pip._internal.network.session import user_agent
>>> import json
>>> user_agent()
'pip/20.2.2 {"ci":null,"cpu":"x86_64","distro":{"libc":{"lib":"glibc","version":"2.32"},"name":"Fedora","version":"33"},"implementation":{"name":"CPython","version":"3.9.1"},"installer":{"name":"pip","version":"20.2.2"},"openssl_version":"OpenSSL 1.1.1i FIPS  8 Dec 2020","python":"3.9.1","setuptools_version":"49.1.3","system":{"name":"Linux","release":"5.10.14-200.fc33.x86_64"}}'
>>> print(json.dumps(json.loads(user_agent().split(" ", 1)[1]), indent=2))
{
  "ci": null,
  "cpu": "x86_64",
  "distro": {
    "libc": {
      "lib": "glibc",
      "version": "2.32"
    },
    "name": "Fedora",
    "version": "33"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.9.1"
  },
  "installer": {
    "name": "pip",
    "version": "20.2.2"
  },
  "openssl_version": "OpenSSL 1.1.1i FIPS  8 Dec 2020",
  "python": "3.9.1",
  "setuptools_version": "49.1.3",
  "system": {
    "name": "Linux",
    "release": "5.10.14-200.fc33.x86_64"
  }
}