pre-PEP: User-Agent schema for HTTP requests against remote package indices

Thank you for your thoughtful feedback. I hope to be able to persuade you that this is an appropriate proposal for python packaging.

So far you have entirely discussed the implementation of User-Agent information in pip. Yet PyPI statistics include data from uv, poetry, pdm, and other tools provided through user agent strings. I think a PEP on this topic should cover how the User-Agent is used in multiple tools and why the status quo is a problem if multiple tools are using the current system today. Has there been past discussion of people running into issues with the status quo?

tech debt in log parsing

The most direct response I have to this is comments in the linehaul source code lamenting the lack of direct information from non-pip clients:

This seems like a pretty direct testament to the difficulty of meaningful inferential power that results without a standardized User-Agent string that other tools can conform to.

In addition to inferential power, we also have implementation complexity. This notes the parsing complexity that results from having a list of fallible parsers instead of a uniform protocol:

I think the specific complaint of “hard to find bugs in production” that risks the steady availability of these metrics would be another strong argument for standardization.

contextual inference

However, there is another important pattern that develops with further comments: lamenting how even Python-based clients like urllib don’t know how to provide the right version of Python:

we don’t really know anything about it-- including whether or not the version of Python mentioned is the one they’re going to install it into or not.

This highlights an important point:
(a) that the inferential power of our telemetry is deeply tied to the specifics of the packaging process.
(b) that the Python standard library (the urllib client) is not sufficient to conform to the current requirements.

This is why a proposed change to the packaging library made sense to me, especially since I believe the packaging.markers.Environment dict is exactly the (packaging-specific) information we’d want to provide, independent of the python interpreter process actually making the network request.

tying rustc to rust dependencies that use it

Right now the inferential power is quite low, and this is a result of both the lack of standardization as well as the act of being collected without user consent – not letting the user override the behavior means they can’t give you better information! This brings us to how we can improve rustc.

@emmatyping regarding:

It seems to me that environment information is not particularly stable and thus may not be a good fit for a long-lasting standard (I want to think more about this, however).

Currently, checking the rustc version on the PATH is a very lossy way to describe which versions of rust are available on the system or used for builds, which I suspect is the inferential property people are attempting to identify.

For example, this information should be contextual: if pip wants to build a rust-enabled wheel from an sdist, then it can send the version of rust it’s going to use to PyPI when fetching the sdist. Then queries for versions of rust would would specifically relate to how often someone was using some version of rust to build a package.

This would be particularly useful for developers of cross-platform pyo3 wheels like my GitHub - cosmicexplorer/medusa-zip: A library/binary for parallel zip creation. project, which would then be able to declare compatibility for a range of versions, and would know when it was time to move off:

Crawling PATH does not achieve this.

remote execution and unique IDs

I work on the pants build tool and spack package manager. Both of these tools are organized around contextual graph relations, and they execute subprocesses in a hermetic environment. In particular, I know neither of these tools will have any rustc dependency on the PATH unless it’s a build-time dependency for rust code. Pants in particular has the capability for remote execution, which means pip wouldn’t even connect to PyPI from the same IP address as the user, let alone the same filesystem state.

Do we want to be able to provide a unique ID that correlates these requests to index servers, even across nodes? Pants demonstrates this can be done while respecting consent and security (Anonymous telemetry | Pantsbuild):

How we avoid exposing proprietary information

Innocuous data elements such as filenames, custom option names and custom goal names may reference proprietary information. E.g., path/to/my/secret/project/BUILD. To avoid accidentally exposing even so much as a secret name:

  • We don’t send the full command line, just the goals invoked.
  • Even then, we only send standard goal names, such as test or lint, and filter out custom goals.
  • We only send numerical error codes, not error messages or stack traces.
  • We don’t send config or environment variable values.

To be frank, this explicit discussion of how pants protects my proprietary data is the kind of guarantee that I expect to see from a program that fetches or builds code for me from remote sources. This is the kind of page I really want to see from a packaging ecosystem so I can write secure build pipelines for my clients.

comparison to trusted publishing

And in fact, it is the kind of page PyPI has already: Security Model and Considerations - PyPI Docs

In addition to the requirements above, you can do the following to “ratchet down” the scope of your Trusted Publishing workflows:

  • Use per-job permissions: The permissions key can be defined on the workflow level or the job level; the job level is always more secure because it limits the number of jobs that receive elevated GITHUB_TOKEN credentials.

It’s recognized on this page that while security is never guaranteed, PyPI provides specific safeguards you can deploy to increase your safety for certain types of jobs. I would very much like to have similar guarantees for my CI jobs which download those trusted wheels, particularly ones which are performed in order to subsequently move them across a trust boundary.

comparison to zip confusion

Recently, uv had a parsing vulnerability: uv security advisory: ZIP payload obfuscation

This gives the attacker the ability to create a ZIP that extracts differently across installers: an installer that processes the central directory will receive one set of files, while an installer that processes only the local file entries will receive a different set of files.

I mention this not to call them out, but to highlight that the key concern that motivated the CVE designation was identifying “a zip that extracts differently based on which software processes it”. This was precisely my motivation for specifying in this protocol that a package repository may not discriminate based upon User-Agent value.

In fact, this led PyPI to tighten up restrictions on wheel parsing (Preventing ZIP parser confusion attacks on Python package installers - The Python Package Index Blog), but without going through the normal standards process, because security concerns are objective. I am describing very specifically an avenue by which something like surveillance or even zip confusion might be achieved, but in a programmatic way that can’t be detected. There are other indexes besides PyPI to consider, and I honestly fail to understand why this is not directly analogous to the proactive response PyPI took for potential zip confusion.

alternate repos need standard metrics endpoints

I am trying to build systems that interoperate with PyPI, and with pip, and with uv, and with poetry, and with pex, and with pants, and with spack. I can’t build an alternate package repository to PyPI and get useful telemetry unless I can expect packaging tools to standardize it.

Astral recently announced a product that does just this:

And in doing so, specifically noted:

You won’t need to use pyx to use uv, and you won’t need to use uv to use pyx.

@charliermarsh I appreciated this statement! Would standardizing telemetry inputs (e.g. from pip and poetry) be useful for pyx?

I was definitely aware of uv before April 2024, so you could have had better PyPI metrics demonstrating its usage without patching upstream if a standard like this existed:

bigquery is not a standard

Here’s my final point.

Given that the metadata information right now is only available via google bigquery, it is impossible to rely on it as a resource like PyPI. I cannot build servers that provide this information, and I cannot make interfaces to it from pip without entering into a financial relationship with the bigquery product. I do not fault PyPI for making the best decision with their limited resources, and I am thankful that google makes the free tier available. But I think the bigquery product team would agree that if I were to build an interface to query this info programmatically, it would necessarily require e.g. providing an API key.

Furthermore, the structure of the bigquery table (as far as I can tell) is not subject to any python packaging standard, yet is hosted on our standards docsite: Analyzing PyPI package downloads - Python Packaging User Guide. On the API docsite, we’re much more clear about it (BigQuery Datasets - PyPI Docs):

Download Statistics Table

Table name: bigquery-public-data.pypi.file_downloads

The download statistics table allows you learn more about downloads patterns of packages hosted on PyPI.

Compare to:

Project Metadata Table

Table name: bigquery-public-data.pypi.distribution_metadata

We also have a table that provides access to distribution metadata as outlined by the core metadata specifications.

If we’re producing one CC-licensed dataset (the metadata) that conforms to our core specifications, it seems only natural to consider whether the other CC-licensed dataset we publish under the PyPI brand is also worth standardizing.

environment markers are half the battle

One idea that I found really thoughtful from PEP 777 – How to Re-invent the Wheel | peps.python.org was to specifically limit compression formats for wheels to protocols supported by the CPython standard library. This is a sort of bidirectional pressure, which allows for innovation as long as it’s made available to every user.

For what I believe was a similar rationale, one reason I particularly identified Dependency specifiers - Python Packaging User Guide as a sub-schema for this proposal is so that the time and effort we’ve spent standardizing them could also be available in bigquery, instead of having an alternate data schema:

The marker language is inspired by Python itself, chosen for the ability to safely evaluate it without running arbitrary code that could become a security vulnerability.

The environment marker language is also something we can evaluate in Python. I can’t implement bigquery myself.

3 Likes