pre-PEP: User-Agent schema for HTTP requests against remote package indices

(This is my first attempt to propose a packaging standard in this forum. I am basing this off the instructions at PyPA Specifications — PyPA documentation. Those instructions seem to indicate that a PR against GitHub - pypa/packaging.python.org: Python Packaging User Guide should be provided at the same time, but I’m not seeing many examples of that being done for in-progress PEPs, so I am assuming this is the appropriate first stop for potential new PEPs. I also could not find a standard format expected for PEP proposals, so I’m basing the general structure off of William Woodruff’s pre-PEP at Pre-PEP: Trusted Publishing token exchange .)

Problem Statement

PyPI exposes a BigQuery dataset for package download statistics: Analyzing PyPI package downloads - Python Packaging User Guide. To the best of my knowledge, this is drawn from information provided in the User-Agent string of requests made to PyPI, which largely depends upon support for such in pip. This leads to a few problems, which I realized upon identifying this in pip recently (Enable overriding undocumented telemetry identifier with PIP_TELEMETRY_USER_AGENT_ID by cosmicexplorer · Pull Request #13560 · pypa/pip · GitHub):

  1. The information provided is very similar to the Environment Markers specification (Dependency specifiers - Python Packaging User Guide), but not actually standardized according to a PEP.
  2. This amounts to a form of telemetry which users are largely unaware of and cannot opt out of.
  3. Searching for and executing tools like rustc from the user’s PATH slows down pip resolves and constitutes a potential RCE vector.

More generally, making this information purely an implementation detail of pip means that indices besides PyPI itself are unable to take advantage of it. In my work at Twitter, we hosted an internal --find-links, and then eventually a simple repository API. While we certainly could have (and did) introduce our own internal telemetry for these purposes, it would make the job of tooling much easier if we could expect standard procedures for declaring the identity of whatever is fetching against our internal repo. Furthermore, pip’s implementation employs heuristics to identify whether it’s running in CI–it would be preferable if instead CI runs e.g. from github actions could intentionally announce themselves in a standardized way.

Finally, and most importantly, pip is not the only resolver in town, and it would benefit PyPI to be able to reliably collect detailed statistics from other resolvers, including poetry and uv. In addition to supporting non-PyPI backends, we should also look to support statistics collection from non-pip clients.

Existing Work

PIP_USER_AGENT_USER_DATA

There is an existing attempt to specialize this info in pip, with the PIP_USER_AGENT_USER_DATA environment variable. This is only obliquely documented at User Guide - pip documentation v25.3.dev0 , and only modifies a subkey of the json object encoded into User-Agent. This environment variable still does not avoid the heuristics necessary to detect a CI run, and does not enable users to avoid telemetry altogether.

Environment Dictionary

The environment markers specification as implemented in packaging provides a TypedDictto codify the current environment, which demonstrates a strong overlap with much of the information currently provided to PyPI:

The environment markers specification is far more complex than the telemetry pip encodes in the User-Agent string, in order to support complex queries during a resolve–much like we want to enable for PyPI’s BigQuery dataset. Therefore, basing off of the schema used for environment markers seems like the right path for this proposed standard.

Linehaul Log Parser

The log parser that PyPI uses to generate statistics is located at linehaul-cloud-function/linehaul/ua/parser.py at 8b8ed9db4ed946722d011c7d5ffe08a03ab5f942 · pypi/linehaul-cloud-function · GitHub. Note that it has distinct implementations for different versions of pip (linehaul-cloud-function/linehaul/ua/parser.py at 8b8ed9db4ed946722d011c7d5ffe08a03ab5f942 · pypi/linehaul-cloud-function · GitHub), and performs complex optimization mechanisms (linehaul-cloud-function/linehaul/ua/impl.py at 8b8ed9db4ed946722d011c7d5ffe08a03ab5f942 · pypi/linehaul-cloud-function · GitHub) to identify which parser to use.

While the PyPI log parser will likely have to retain some degree of complexity to support older and nonconforming clients, standardizing the information sent would seem to make it easier to optimize their parser and avoid heuristic matching by version for pip fetches, which constitute the majority of PyPI requests.

Solution Statement

  • A typed specification should be produced for the information PyPI makes use of to generate download statistics.
    • It may be useful to provide a reference implementation of this in the packaging library.
  • The specification must include a mechanism for the user to disable or override statistics gathering.
    • In cases where a tool like pip is executed multiple times or in a loop, generating this info once may be preferable to generating from scratch upon each execution.
  • The specification should likely incorporate an explicit marker for CI or other automated executions.
  • The specification should provide a way to disable or override the need to scan the user’s PATH for tools like rustc.
    • This addresses security concerns, and allows this information to be provided within hermetic build environments which do not provide executables within the same PATH.

Proposal

It would be most useful to provide this information in the packaging library, as it would be easy for pip to consume while avoiding pip-specific assumptions. In particular, I think explicitly incorporating the Environment dict would be appropriate, as that is an existing standardized and json-serializable representation of the running Python environment.

On top of that, the pip implementation makes use of several version strings for named resources, including libc, openssl, setuptools, and rustc (which are not provided if unavailable). These would make sense to provide in a separate dictionary from the Python environment, particularly if we want to enable overriding or disabling them independently.

Taking this approach to the data pip currently encodes would result in a json dict like the following:

{
  "python_environment": { <json of packaging.markers.Environment> },
  "versions": {
    "libc": { "name": "glibc", "version": "2.42" },
    "tls": { "name": "openssl", "version": "3.5.3" },
    "setuptools": { "version": "80.9.0" },
    "rustc": { "version": "1.90.0" }
  },
  "ci": false,
  "user_data": <arbitrary user data>
}

User-Agent Encoding and Declaring Support for this PEP

Right now, pip encodes this into a User-Agent string as pip/25.2 <json>, where “pip” is the client name and “25.2” is its version string. In order to produce a standard for high-quality data that PyPI’s statistics gathering can rely on, it seems appropriate to specify how the json dict is encoded into an HTTP request header.

However, specifying <client name>/<client version> isn’t quite enough info to clarify that the rest of the User-Agent string conforms to this PEP. To make log parsing easier, there are two potential approaches:

  1. Specifying this information in a separate header besides User-Agent.
  2. Providing a parenthesized (PEP NNN) suffix to the <client name>/<client version> prefix.

The User-Agent parser in the linehaul project (linehaul-cloud-function/linehaul/ua/parser.py at main · pypi/linehaul-cloud-function · GitHub) seems to be defined in terms of the User-Agent string, so employing a separate header seems like it would be much more difficult for PyPI to consume than sticking to the User-Agent. From MDN (User-Agent header - HTTP | MDN), the parenthesized suffix to User-Agent seems to be relatively standard, even if web browsers tend to use it for platform info and not to define the syntax of the User-Agent string itself.

Using an unambiguous (PEP NNN) suffix should be easy to scan for with a fast substring search before engaging a slower fallible parser, and would allow for other clients to provide equivalent information to pip without matching client name and version range.

So the overall proposed specification for HTTP requests becomes <client name>/<client version> (PEP NNN) <json>, written into the User-Agent HTTP header.

Overrides

Implementations of this standard should allow the user to either disable the User-Agent info altogether, or to override portions of it, which disables data collection performed by the client. Differentiating “disable” vs “override” is important, as it would allow an implementation to ensure that the user-provided value conforms to the specification, and raise an error if not. This paradigm enables a user to provide valuable statistics to PyPI even if they use e.g. a hermetic build environment which precludes their client inferring that info, and generally allows for users to provide only as much information as they or their employer are comfortable providing to an external service.

This standard is intended to describe information which is independent of specifics such as a particular process execution model, so it’s unclear how best to incorporate an override mechanism. For example, a tool like pex which provides a library API for resolves may decide to provide overrides as a keyword argument to its resolve_multi() method. However, since PyPI download stats largely rely upon this information being provided from pip, it should be reasonable to use pip (and its one-shot process execution model) as the canonical representation.

With that in mind, we can describe a set of environment variables to modify the above representation. Conforming implementations need not use environment variables, but MUST enable equivalent overrides through some equivalent mechanism:

  • PIP_USER_AGENT=disable: does not generate a User-Agent header.
    • This implies not generating the <client name>/<client version> prefix as well.
  • PIP_USER_AGENT=override=pip/25.2 { ... }: provides a string to be provided as User-Agent instead of having the client infer it.
    • If override is used for any option, the client SHOULD ensure the result produces conforming output, or produce an error.
    • Tools like pex which wrap pip or other clients would be able to specify this to avoid having the client attempt to infer any information.
  • PIP_USER_AGENT_USER_DATA: this would remain the same as now.
    • This is the single element that is not inferred by pip, so it does not conform to the “disable”/“override” paradigm.
    • This is a string, and is not decoded as a json blob.
  • PIP_USER_AGENT_CI=override=true: this would provide a standardized way for CI runners to signal their environment.
    • This is also a json blob like others, just one which corresponds to just the json true or false.
    • If not provided, pip would still perform the heuristic inference it does now.

Individual Version Overrides

The above are the most important keys, and may be all we want to support for the first iteration of this standard. If we wanted to extend this standard to cover every key of the json dict, it might look like the following (with =disable supported for all keys):

  • PIP_USER_AGENT_PYTHON_ENVIRONMENT=override=<json blob>: provide a json blob which is decoded, matched against packaging.markers.Environment, then re-encoded.
  • PIP_USER_AGENT_VERSIONS=override=<json blob>: provide a json blob for the versions dict which is decoded, matched against the known schema, then re-encoded.
  • PIP_USER_AGENT_VERSIONS_LIBC=override={"name": "glibc", "version": "2.42"}: provide a json blob for the libc resource in the versions dict, which is decoded, matched against the schema for the libc resource, then re-encoded.
  • PIP_USER_AGENT_VERSIONS_RUSTC=override={"version": "1.90.0"}: provide a json blob for the rustc resource in the versions dict, which is decoded, matched against the schema for the rustc resource, then re-encoded.

The specificity in this section is much less important, and I expect this section will not make it into the final version of the PEP. I think it’s far more important to extend the existing *_USER_DATA variable to cover the entire User-Agent header. This is mostly provided as food for thought, since it relates to other potential future standards which codify dependencies outside the Python ecosystem.

Other Considerations

  • The implementation in the packaging library would be especially useful for generating this information out-of-band to provide with PIP_USER_AGENT=override=....
    • It would also be of use for servers implemented in Python to parse the User-Agent header, although perhaps a separate json schema might be more appropriate.
    • PEPs do not seem to have adopted a standard json schema representation yet, but that would be especially useful to consider for this one as it specifically relates to network calls.
  • It will be important to clarify that this information should not be used to discriminate against particular clients, as this would deter clients from providing these useful statistics.
  • The primary consumer of this info is PyPI, and the primary producer is pip, and the main purpose of this is to make telemetry gathering from pip → pypi more secure and respect the user’s consent. Still, input from other Python package repos and other resolvers would be very welcome.

Conclusion

I am hoping for this PEP to be as minimal in scope as possible, and most importantly to incorporate PIP_USER_AGENT=disable, for performance reasons as well as to respect user consent.

The proposed semantics of =override are complex, and if they’re too difficult to agree on, I think they should be dropped entirely in favor of just PIP_USER_AGENT=disable (and retaining the existing semantics of PIP_USER_AGENT_USER_DATA). PIP_USER_AGENT_CI would be nice to have as well, but PIP_IS_CI is already available for specifically pip invocations that would like to signal this information to PyPI or other indices.

The main contributions of this PEP are expected to be:

  • to standardize the User-Agent schema,
  • to enable end user opt-out for any telemetry gathering.

Does this rise to the level of an interoperability standard? I’d expect this sort of opt-out to be on a per-tool basis, perhaps with some installers (e.g. in a linux redistribution?) deciding not to provide any upstream data.

More generally, if you really care about disabling analytics, you’d probably want to audit the source (or e.g. network packets) yourself, which implies locking to a specific version of a given tool.

A

1 Like

That’s a good point, and perhaps that allows simplifying this proposal by merely specifying the User-Agent schema. It was unclear to me how to incorporate PIP_USER_AGENT_USER_DATA, but it would certainly simplify this to avoid discussion of environment variables entirely. From checking linehaul (linehaul-cloud-function/linehaul/ua/datastructures.py at 8b8ed9db4ed946722d011c7d5ffe08a03ab5f942 · pypi/linehaul-cloud-function · GitHub), it seems PIP_USER_AGENT_USER_DATA is not consumed by PyPI statistics at all. That speaks to the utility of standardizing this schema (including a user_data field), but I think the environment variables can be removed.

I will note that in general, telemetry can’t be avoided unless it’s codified, and it’s not obvious to me how to separate opt-out signals from the data format itself. I think the way pants does anonymous telemetry is very well done (Anonymous telemetry | Pantsbuild) and I think it’s useful for this forum to incorporate the concern about data sharing.

But for now, I’ll revise this to remove the environment variables and focus on the User-Agent schema. Thanks for the quick feedback!

1 Like

I think specifying that the User-Agent need not be provided at all, or need not provide any info, or need not provide any individual field, would be a way to support clients which would like to limit information gathering. This would be in tandem with the separate requirement that the User-Agent field not be used by indices to discriminate.

1 Like

I would really like it if there were a standard “do-not-track” signal that tools could agree to support, so using pip vs uv vs poetry doesn’t require individualized opt-outs. I have some more developed thoughts to this reply now:

I’d expect this sort of opt-out to be on a per-tool basis

We have file formats like PEP 751 which specify which domains to resolve from, and pip reads from the .netrc file. Since this particularly regards the User-Agent header (and is closely related to authentication), I think limiting the amount of information sent to the remote server is within scope for a specification of the User-Agent header.

perhaps with some installers (e.g. in a linux redistribution?) deciding not to provide any upstream data.

Most python installers are not provided by the OS. CPython itself bundles pip into the stdlib, as per PEP 453 – Explicit bootstrapping of pip in Python installations | peps.python.org (ensurepip — Bootstrapping the pip installer — Python 3.13.7 documentation). PEP 453 even goes so far as to specify a bundled set of CA certificates (PEP 453 – Explicit bootstrapping of pip in Python installations | peps.python.org).

I think this is a matter of security. For example, the current pip implementation will provide not just the OpenSSL version number, but also the precise build date (this is what’s provided by _ssl.OPENSSL_VERSION). That information is made available from CPython as an underscored module, probably because it’s not intended to be broadcast over the internet.

The concern that spawned this proposal was finding that pip scans PATH for rustc and executes it every time it’s run, which is very surprising behavior. If we want to continue to incorporate that info into PyPI analytics, I think users really need the ability to say when it’s not allowed.

More generally, if you really care about disabling analytics, you’d probably want to audit the source (or e.g. network packets) yourself, which implies locking to a specific version of a given tool.

Sure, but this framing holds two assumptions:

  1. Python packaging tools are always wielded by experts in network security.
  2. The version of a Python packaging tool is always under the control of the end user executing it.

When executing pip inside of Twitter to download wheels to subsequently upload to our internal repository, I would have liked to avoid providing information that would identify that particular scenario. I wasn’t aware at the time that pip was sending this information to PyPI. I think this is something a lot of corporate users would be very interested in for security reasons, and it would be useful for Python packaging standards to fulfill that need without forking pip or other tools.

More generally, I think the adversarial relationship implied here about auditing the source isn’t really necessary. I think it should be plausible to make broad guarantees about what information the tool will send to a remote index by seeing whether the tool implements PEP NNN in its docs.

Issues with Forking

There are some issues with framing this as a matter of repackaging:

  • as mentioned above, most users do not rely on their OS to provide pip or other Python packaging tools.
  • not all executions are going to be especially sensitive.
  • requiring users to fork Python packaging tools to avoid telemetry means they also need to fork CPython (due to ensurepip), and generally creates an artificial impetus to avoid staying up to date with packaging standards.

I think requiring a fork (or a separate packaging tool) to override some behavior sounds like a textbook example of the kind of behavior that should be standardized.

3 Likes

Given the strictness with which PEP 453 requires downstream distributors to specifically use pip if at all possible (PEP 453 – Explicit bootstrapping of pip in Python installations | peps.python.org), and specifically says “do not remove the ensurepip module in Python 3.4 or later” (or ask them to explicitly document this fact), it would seem within scope at least to use the PEP process to describe the env var configuration for pip’s User-Agent, since PEP 453 requires them to make it available.

PEP 453 has a section on “security considerations” which notes that there are distinct security considerations from connecting to PyPI (PEP 453 – Explicit bootstrapping of pip in Python installations | peps.python.org), but does not specify or link to a description of those considerations. It would seem to make sense to codify the security considerations of connecting to PyPI, which is what the current iteration of this proposal attempts to achieve.

As an extension of PEP 453, I think specifying how to disable telemetry pip sends to PyPI seems within scope.

I think I have a couple of high level comments/questions about this idea:

  1. So far you have entirely discussed the implementation of User-Agent information in pip. Yet PyPI statistics include data from uv, poetry, pdm, and other tools provided through user agent strings. I think a PEP on this topic should cover how the User-Agent is used in multiple tools and why the status quo is a problem if multiple tools are using the current system today. Has there been past discussion of people running into issues with the status quo?
  2. The discussion of pip checking the Rust version indicates to me that different tools may wish to include different pieces of information and search for things in different ways. Standards should be weighed between consistency of usage/interoperability and flexibility/ability for the ecosystem to innovate. It seems to me that environment information is not particularly stable and thus may not be a good fit for a long-lasting standard (I want to think more about this, however).
2 Likes

Hey @cosmicexplorer, thanks for opening this thread!

On a high level:

  • I agree with @AA-Turner that the environment + opt-in/out semantics in this proposal are probably not the right fit for a PEP, since they concern the UX of a packaging tool/component and not its interoperability with other aspects of the packaging ecosystem.
  • I do on the other hand think that it would be good to establish a standard payload structure for installer UAs, and I agree with the rationale you brought up around simplifying linehaul and making it easier for third-party index implementations/ecosystem participants to rely on these UA structures :slightly_smiling_face:. I also agree with what @emmatyping wrote up about ensuring that any standard we put up here isn’t overly constrained – a think a good PEP here would ensure that the pip/pdm/uv/etc. maintainers could easily add new datapoints/facets in the future without having to enter a standards process.

Some individual points:

I think the performance point here is good, but IMO your motivation in this pre-PEP would be stronger without the RCE argument – with my security SME hat on, I don’t think there’s a realistic RCE risk with invoking rustc --version from the user’s pre-existing PATH. It’s true that rustc is arbitrary code in some sense (mainly in that it’s external to pip), but I don’t think it’s “remote” in the sense that RCE is typically used and indicates an attacker position that’s already stronger than needing to be triggered through pip (since any user build of Rust code would presumably invoke rustc).

(I also think that pip’s security model effectively implicitly defines all programs in the user’s PATH as trusted, since the execution contract with build backends is that they could (and do in practice) run arbitrary code and invoke arbitrary underlying native build systems.)

Could you say a bit more about the heuristics you’ve identified? The code in question is this, I believe:

My understanding of these environment variables is that they’re all explicitly documented to mean “this host is definitely a CI,” i.e. they’re not heuristics as such. Users can of course impersonate a CI using these variables, but that gets us down the rabbit hole of cryptographically verifying the authenticity of machines that pip runs on, and that seems like an exercise in frustration :slightly_smiling_face:

I don’t think it’s a hard-and-fast rule, but in general I believe it’s considered good to not have PEP identifiers become long-lived “markers” for standard conformance – FWIW it’s generally expected that packaging PEPs become living PyPA specs, at which point the language in the numbered PEP might become stale or deviate normatively from the living spec.

As an alternative here: have you considered an “intrinsic” encoding, i.e. one within the JSON payload itself? I think even { "v": "1", ... } would suffice, since I’d expect nothing to currently use the v key. A slightly longer version could be "linehaul": "v1.0.0" or similar, to make it clear that this is a standardization of the existing linehaul format and that the living spec might undergo Semver-style revisions.

(The downside to this is that it puts the version detection into the payload, but this is maybe fine operationally – linehaul is complicated and probably not optimal in terms of performance, but it’s also AFAIK not a significant drag on PyPI’s overall latency or uptime. Plus the payload needs to be decoded anyways.)

2 Likes

Right, there are multiple places in the pip internal and vendored code base that call processes on the PATH, for example if you are on musl Linux distro getting the the musl version is achieved by calling a sub-process: https://github.com/pypa/packaging/blob/25.0/src/packaging/_musllinux.py#L52

So even if you have --only-binary ":all:"enabled you are not guaranteeing that pip won’t call out to a subprocess to figure out the information it needs to filter wheels.

If performance is the primary motivator then I would be in favor of adding a cache that keys off something like the path, the size, and the mtime of the executable (where that makes sense, like rustc).

But all of this is pip internal design discussion, and as seems to be in agreement here not a question for interoperability.

I would be in favor of standardizing what information should be collected, especially if that was implemented in a new or existing library and then pip could vendor that reference implementation. But as already discussed, whether a user can change or disable what is added to the user agent string should be a tool UX choice.

4 Likes

Thank you for your thoughtful feedback. I hope to be able to persuade you that this is an appropriate proposal for python packaging.

So far you have entirely discussed the implementation of User-Agent information in pip. Yet PyPI statistics include data from uv, poetry, pdm, and other tools provided through user agent strings. I think a PEP on this topic should cover how the User-Agent is used in multiple tools and why the status quo is a problem if multiple tools are using the current system today. Has there been past discussion of people running into issues with the status quo?

tech debt in log parsing

The most direct response I have to this is comments in the linehaul source code lamenting the lack of direct information from non-pip clients:

This seems like a pretty direct testament to the difficulty of meaningful inferential power that results without a standardized User-Agent string that other tools can conform to.

In addition to inferential power, we also have implementation complexity. This notes the parsing complexity that results from having a list of fallible parsers instead of a uniform protocol:

I think the specific complaint of “hard to find bugs in production” that risks the steady availability of these metrics would be another strong argument for standardization.

contextual inference

However, there is another important pattern that develops with further comments: lamenting how even Python-based clients like urllib don’t know how to provide the right version of Python:

we don’t really know anything about it-- including whether or not the version of Python mentioned is the one they’re going to install it into or not.

This highlights an important point:
(a) that the inferential power of our telemetry is deeply tied to the specifics of the packaging process.
(b) that the Python standard library (the urllib client) is not sufficient to conform to the current requirements.

This is why a proposed change to the packaging library made sense to me, especially since I believe the packaging.markers.Environment dict is exactly the (packaging-specific) information we’d want to provide, independent of the python interpreter process actually making the network request.

tying rustc to rust dependencies that use it

Right now the inferential power is quite low, and this is a result of both the lack of standardization as well as the act of being collected without user consent – not letting the user override the behavior means they can’t give you better information! This brings us to how we can improve rustc.

@emmatyping regarding:

It seems to me that environment information is not particularly stable and thus may not be a good fit for a long-lasting standard (I want to think more about this, however).

Currently, checking the rustc version on the PATH is a very lossy way to describe which versions of rust are available on the system or used for builds, which I suspect is the inferential property people are attempting to identify.

For example, this information should be contextual: if pip wants to build a rust-enabled wheel from an sdist, then it can send the version of rust it’s going to use to PyPI when fetching the sdist. Then queries for versions of rust would would specifically relate to how often someone was using some version of rust to build a package.

This would be particularly useful for developers of cross-platform pyo3 wheels like my GitHub - cosmicexplorer/medusa-zip: A library/binary for parallel zip creation. project, which would then be able to declare compatibility for a range of versions, and would know when it was time to move off:

Crawling PATH does not achieve this.

remote execution and unique IDs

I work on the pants build tool and spack package manager. Both of these tools are organized around contextual graph relations, and they execute subprocesses in a hermetic environment. In particular, I know neither of these tools will have any rustc dependency on the PATH unless it’s a build-time dependency for rust code. Pants in particular has the capability for remote execution, which means pip wouldn’t even connect to PyPI from the same IP address as the user, let alone the same filesystem state.

Do we want to be able to provide a unique ID that correlates these requests to index servers, even across nodes? Pants demonstrates this can be done while respecting consent and security (Anonymous telemetry | Pantsbuild):

How we avoid exposing proprietary information

Innocuous data elements such as filenames, custom option names and custom goal names may reference proprietary information. E.g., path/to/my/secret/project/BUILD. To avoid accidentally exposing even so much as a secret name:

  • We don’t send the full command line, just the goals invoked.
  • Even then, we only send standard goal names, such as test or lint, and filter out custom goals.
  • We only send numerical error codes, not error messages or stack traces.
  • We don’t send config or environment variable values.

To be frank, this explicit discussion of how pants protects my proprietary data is the kind of guarantee that I expect to see from a program that fetches or builds code for me from remote sources. This is the kind of page I really want to see from a packaging ecosystem so I can write secure build pipelines for my clients.

comparison to trusted publishing

And in fact, it is the kind of page PyPI has already: Security Model and Considerations - PyPI Docs

In addition to the requirements above, you can do the following to “ratchet down” the scope of your Trusted Publishing workflows:

  • Use per-job permissions: The permissions key can be defined on the workflow level or the job level; the job level is always more secure because it limits the number of jobs that receive elevated GITHUB_TOKEN credentials.

It’s recognized on this page that while security is never guaranteed, PyPI provides specific safeguards you can deploy to increase your safety for certain types of jobs. I would very much like to have similar guarantees for my CI jobs which download those trusted wheels, particularly ones which are performed in order to subsequently move them across a trust boundary.

comparison to zip confusion

Recently, uv had a parsing vulnerability: uv security advisory: ZIP payload obfuscation

This gives the attacker the ability to create a ZIP that extracts differently across installers: an installer that processes the central directory will receive one set of files, while an installer that processes only the local file entries will receive a different set of files.

I mention this not to call them out, but to highlight that the key concern that motivated the CVE designation was identifying “a zip that extracts differently based on which software processes it”. This was precisely my motivation for specifying in this protocol that a package repository may not discriminate based upon User-Agent value.

In fact, this led PyPI to tighten up restrictions on wheel parsing (Preventing ZIP parser confusion attacks on Python package installers - The Python Package Index Blog), but without going through the normal standards process, because security concerns are objective. I am describing very specifically an avenue by which something like surveillance or even zip confusion might be achieved, but in a programmatic way that can’t be detected. There are other indexes besides PyPI to consider, and I honestly fail to understand why this is not directly analogous to the proactive response PyPI took for potential zip confusion.

alternate repos need standard metrics endpoints

I am trying to build systems that interoperate with PyPI, and with pip, and with uv, and with poetry, and with pex, and with pants, and with spack. I can’t build an alternate package repository to PyPI and get useful telemetry unless I can expect packaging tools to standardize it.

Astral recently announced a product that does just this:

And in doing so, specifically noted:

You won’t need to use pyx to use uv, and you won’t need to use uv to use pyx.

@charliermarsh I appreciated this statement! Would standardizing telemetry inputs (e.g. from pip and poetry) be useful for pyx?

I was definitely aware of uv before April 2024, so you could have had better PyPI metrics demonstrating its usage without patching upstream if a standard like this existed:

bigquery is not a standard

Here’s my final point.

Given that the metadata information right now is only available via google bigquery, it is impossible to rely on it as a resource like PyPI. I cannot build servers that provide this information, and I cannot make interfaces to it from pip without entering into a financial relationship with the bigquery product. I do not fault PyPI for making the best decision with their limited resources, and I am thankful that google makes the free tier available. But I think the bigquery product team would agree that if I were to build an interface to query this info programmatically, it would necessarily require e.g. providing an API key.

Furthermore, the structure of the bigquery table (as far as I can tell) is not subject to any python packaging standard, yet is hosted on our standards docsite: Analyzing PyPI package downloads - Python Packaging User Guide. On the API docsite, we’re much more clear about it (BigQuery Datasets - PyPI Docs):

Download Statistics Table

Table name: bigquery-public-data.pypi.file_downloads

The download statistics table allows you learn more about downloads patterns of packages hosted on PyPI.

Compare to:

Project Metadata Table

Table name: bigquery-public-data.pypi.distribution_metadata

We also have a table that provides access to distribution metadata as outlined by the core metadata specifications.

If we’re producing one CC-licensed dataset (the metadata) that conforms to our core specifications, it seems only natural to consider whether the other CC-licensed dataset we publish under the PyPI brand is also worth standardizing.

environment markers are half the battle

One idea that I found really thoughtful from PEP 777 – How to Re-invent the Wheel | peps.python.org was to specifically limit compression formats for wheels to protocols supported by the CPython standard library. This is a sort of bidirectional pressure, which allows for innovation as long as it’s made available to every user.

For what I believe was a similar rationale, one reason I particularly identified Dependency specifiers - Python Packaging User Guide as a sub-schema for this proposal is so that the time and effort we’ve spent standardizing them could also be available in bigquery, instead of having an alternate data schema:

The marker language is inspired by Python itself, chosen for the ability to safely evaluate it without running arbitrary code that could become a security vulnerability.

The environment marker language is also something we can evaluate in Python. I can’t implement bigquery myself.

3 Likes

Want to say this was super thoughtful and I really appreciate your input here!

a think a good PEP here would ensure that the pip/pdm/uv/etc. maintainers could easily add new datapoints/facets in the future without having to enter a standards process.

brainstorming

I was thinking about .netrc as potential analogy, but it didn’t work. spack has a really interesting yaml format with pretty slick scope precedence mechanics (Configuration Files - Spack 1.1.0.dev0 documentation), but spack is a large and stable system as opposed to the more dynamic one that seems to be coming together here.

prior art: linehaul yaml test cases

One thing I noticed from the recent uv commit to linehaul is that there is already a schema they’re testing against (Add uv parser (#162) · pypi/linehaul-cloud-function@b10850c · GitHub):

uv ua parsing and yaml result
# uv >=0.1.22 format

# OSX Example
- ua: 'uv/0.1.22 {"installer":{"name":"uv","version":"0.1.22"},"python":"3.12.2","implementation":{"name":"CPython","version":"3.12.2"},"distro":{"name":"macOS","version":"14.4","id":null,"libc":null},"system":{"name":"Darwin","release":"23.2.0"},"cpu":"arm64","openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}'
  result:
    installer:
      name: uv
      version: '0.1.22'
    python: 3.12.2
    implementation:
      name: CPython
      version: 3.12.2
    distro:
      name: macOS
      version: 14.4
    system:
      name: Darwin
      release: 23.2.0
    cpu: arm64

# Linux (Ubuntu) Example
- ua: 'uv/0.1.22 {"installer":{"name":"uv","version":"0.1.22"},"python":"3.12.2","implementation":{"name":"CPython","version":"3.12.2"},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":{"lib":"glibc","version":"2.35"}},"system":{"name":"Linux","release":"6.5.0-1016-azure"},"cpu":"x86_64","openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}'
  result:
    installer:
      name: uv
      version: '0.1.22'
    python: 3.12.2
    implementation:
      name: CPython
      version: 3.12.2
    distro:
      name: Ubuntu
      version: 22.04
      id: jammy
      libc:
        lib: glibc
        version: 2.35
    system:
      name: Linux
      release: 6.5.0-1016-azure
    cpu: x86_64
    ci: true

# Windows Example
- ua: 'uv/0.1.22 {"installer":{"name":"uv","version":"0.1.22"},"python":"3.12.2","implementation":{"name":"CPython","version":"3.12.2"},"distro":null,"system":{"name":"Windows","release":"2022Server"},"cpu":"AMD64","openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}'
  result:
    installer:
      name: uv
      version: '0.1.22'
    python: 3.12.2
    implementation:
      name: CPython
      version: 3.12.2
    system:
      name: Windows
      release: 2022Server
    cpu: AMD64
    ci: true

pip’s entry on the other hand is more spare (and notably does not include the ci flag):

pip ua parsing and yaml result
# Pip 6 Format
- ua: 'pip/18.0 {"cpu":"x86_64","distro":{"name":"macOS","version":"10.13.6"},"implementation":{"name":"PyPy","version":"6.0.0"},"installer":{"name":"pip","version":"18.0"},"openssl_version":"LibreSSL 2.6.2","python":"3.5.3","setuptools_version":"40.0.0","system":{"name":"Darwin","release":"17.7.0"}}'
  result:
    installer:
      name: pip
      version: '18.0'
    python: 3.5.3
    implementation:
      name: PyPy
      version: 6.0.0
    distro:
      name: macOS
      version: 10.13.6
    system:
      name: Darwin
      release: 17.7.0
    cpu: x86_64
    openssl_version: LibreSSL 2.6.2
    setuptools_version: 40.0.0

# Pip 1.4 Format
- ua: 'pip/1.4.1 CPython/2.7.14 Darwin/17.7.0'
  result:
    installer:
      name: pip
      version: 1.4.1
    python: 2.7.14
    implementation:
      name: CPython
      version: 2.7.14
    system:
      name: Darwin
      release: 17.7.0

I would really prefer to use packaging.markers.Environment instead here, which is unambiguously json-serializable and intended for this exact purpose. I think it is also probably the “right” thing to do to use our metadata formats for querying here – if they are inappropriate for that, then that signals an opportunity to improve the metadata format!

cc @radoering: I see PyPI/linehaul’s test case for poetry’s User-Agent is pretty spare (linehaul-cloud-function/tests/unit/ua/fixtures/poetry.yml at 8b8ed9db4ed946722d011c7d5ffe08a03ab5f942 · pypi/linehaul-cloud-function · GitHub); is there any other feature you think a spec like this should have?

poetry ua parsing and yaml result
# Poetry Format
- ua: 'poetry/1.1.11 CPython/3.9.2 Linux/5.10.16.3-microsoft-standard-WSL2'
  result:
    installer:
      name: poetry
      version: '1.1.11'
    implementation:
      name: CPython
      version: 3.9.2
    system:
      name: Linux
      release: 5.10.16.3-microsoft-standard-WSL2

on the “RCE vector”

@woodruffw:

with my security SME hat on, I don’t think there’s a realistic RCE risk with invoking rustc --version from the user’s pre-existing PATH

Thanks! I won’t make this claim again. :sweat_smile:

This was very poor communication on my part. I didn’t realize I had directly claimed that an RCE existed–I wanted someone to consider the issue because I’m not qualified to do so. Thanks for resolving it.

reframing the “opt-out”: trust boundaries

I am very concerned this could result in publishing proprietary data outside of a trust boundary, even for experienced users. (This is how I should have framed this initially.) And in particular, I think there’s a distinction to be made between:

  1. low-importance CI runs (upon every PR commit, do not modify global state)
  2. runs that populate some global state (release builds)

I would also say that (1) is very likely to constitute the majority of bandwidth and other resource load that PyPI experiences, while (2) is very likely to be distinguishable in other ways. We have the ci: <bool> flag, but this distinction between CI runs is also a distinction between trust boundaries. I think the information in (2) is also likely to be useful to project repos, however.

This is where I’m concerned about the risk I mentioned in the post just above, where a malicious project repo serves different inputs based on User-Agent, so only the release binary is modified.

reframing again: extensibility

But I can see that to do this right (as you said), it would need to be a dynamic format that allows tools to expand the schema. And this extensibility may get close enough to what I wanted from the opt-out.

I just really think we should have a statement like this (Anonymous telemetry | Pantsbuild):

How we avoid exposing proprietary information

I think that’s something we should strive for in these packaging standards. But maybe that goal will have to be tool-centric as well.

I’m going to drop the opt-out for now and try to focus on what kind of format would be effectively extensible here since that’s not an obvious problem.

on the pip security model

(I also think that pip’s security model effectively implicitly defines all programs in the user’s PATH as trusted, since the execution contract with build backends is that they could (and do in practice) run arbitrary code and invoke arbitrary underlying native build systems.)

I think communicating dependencies (particularly ABI) across build and packaging ecosystems is critically important. I applied to NGI Zero proposing several changes that would make cargo builds configurable by downstream users: https://circumstances.run/deck/@hipsterelectron/114610077000401178. Sandboxing is really good these days but a lot of tools do not make themselves easy to execute as a subprocess (cargo especially). spack however does, and re2 - Rust uses my build script that deduplicates spack builds against the cargo dep graph. The Cargo.toml change is small:

I think representing dependencies in full allows swapping them out, or creating them from a separate process, or other things that avoid prescribing a specific process execution with global state. I think python is tantalizingly close to getting there, but I don’t know if it should. spack has a universal language for dependency relationships Spec Syntax - Spack 1.1.0.dev0 documentation, and it would make more sense if python could stitch together dependency relationships from tools like spack or meson (the way I demonstrated for cargo => spack), so that you could not just cryptographically sign a build that crosses ecosystems, you could also incorporate it into the resolve.

That’s why I’m so emphatic about metadata and dependencies. But that’s largely irrelevant to this PEP.

Could you say a bit more about the heuristics you’ve identified?

I assumed the function name looks_like_ci() was telling the truth and it wasn’t a sure thing. If these are standard (and they indeed seem to be), then I think the only problem here is the function name.

I don’t think it’s a hard-and-fast rule, but in general I believe it’s considered good to not have PEP identifiers become long-lived “markers” for standard conformance

Thanks, this makes perfect sense.

As an alternative here: have you considered an “intrinsic” encoding, i.e. one within the JSON payload itself? I think even { "v": "1", ... } would suffice, since I’d expect nothing to currently use the v key. A slightly longer version could be "linehaul": "v1.0.0" or similar, to make it clear that this is a standardization of the existing linehaul format and that the living spec might undergo Semver-style revisions.

I will probably start off with exactly this format (I like acknowledging that it’s linehaul’s format too, because we don’t want to break that).

TODO items for now:

  • I’m not sure I like the boolean CI. I like "ci": "AZURE" more for example.
  • The matter of extensibility for resolvers is an interesting format question I will ponder.
  • The packaging library has unfortunately a very inefficient and extremely confusing implementation of environment marker evaluation. If I want to propose environment markers as a query mechanism, I think I should spend some time understanding that more.
    • I have a fork which fixes version and specifier performance which would be good to try to upstream here as well.

caching is strictly worse than a well-designed IPC API

If performance is the primary motivator then I would be in favor of adding a cache that keys off something like the path, the size, and the mtime of the executable (where that makes sense, like rustc).

My problem with that is that it’s so much more complex (and can become incorrect) than just letting the user provide the value in the first place. This is unfortunately an example of a huge category of error (not trying to insult you) that a lot of tooling makes, where they have some very complex, fallible, perhaps heuristic process (scanning the PATH, scanning the classpath, etc) that produces a single in-memory document, and they don’t offer a shortcut that avoids scanning the filesystem in the first place.

The mention of opt-out isn’t just personal choice, it’s about architecting a tool for performance when you embed it into a larger process. I agree that I was conflating a lot of things at once. I’m sorry for being unclear.

So even if you have --only-binary ":all:"enabled you are not guaranteeing that pip won’t call out to a subprocess to figure out the information it needs to filter wheels.

My proposed change TELEMETRY_USER_AGENT_ID did guarantee that, and provided wholly compatible output, by following the mechanism of decomposing these filesystem search processes into distinct phases that someone else can cache. I think pip can and should maintain an internal cache of python index metadata, because that needs to be kept up to date. But the user agent ID not changing depending on context is also why it’s much less useful for making inferences – see two posts up.

phases

I want to really emphasize how important it is to try to explicitly phase the internal computations performed by a tool, so I’m providing another example.

Not me speaking here (stu hood giving a talk on our work at Twitter with parallel distributed scala compiles), but I was the reason this parallel scala compilation project worked and one of the biggest issues by far was having to rip out the incredibly complex lazy (as in lazy computation) classfile loading process in the scala compiler which intersperses i/o with computation and does many other things that harm performance. The easiest way to make your tool cacheable is to separate out the phases it performs so that a build tool or package manager or whatever can schedule, parallelize, and cache for you.

Here’s my documentation of this for CGO 2020 Getting Pants Performance for Free via
Parallelism using Graal native-image
. Literally taking every single JVM jit process and converting them into a cacheable AOT process was how we solved scala performance.

By analogy, taking the processes that occur monolithically, and decomposing them into standard protocols are how we can solve performance.

package finding is different

uv has some great docs here: Resolver | uv

The slowest part of resolution in uv is loading package and version metadata, even if it’s cached.

This is what I have been doing in a pip branch: separating package finding from everything else and performing it as a distinct phase. I am finally at the point where parsing performance matters, so I’m actually working on a rust process that does package finding only and communicates over IPC to pip which is otherwise only a resolver. At some point, I think defining the package finding db and the inputs to a resolver could be a good PEP.

I think python packaging standards can solve every engineering problem. I gave a talk at packagingcon 2023 specifically about how great it was to work on standards with pip: Python Resolution Evolution: Decoupling Metadata from Downloads in Pip :: PackagingCon 2023 :: pretalx.

Conclusions

  • I think if pip’s going to perform subprocess executions in the background, that should be a separate subcommand. If it’s in packaging, that sounds like half a PEP already.
[won't mention this again but]
  • I think it looks really bad to have telemetry in pip that you can’t turn off. I think it reflects really poorly on the python community. I am trying to make the python community not look bad. It would be really cool if we could tell people “this is how we avoid exposing proprietary information”. Anonymous telemetry | Pantsbuild I think tooling should try to help people avoid making mistakes.

Sorry, I think I missed something, how would this proposal prevent packaging, and therefore pip, call ld on musl linux to work out the version of libc? Which is needed information to filter wheels during the collecting phase.

If the telemetry format is standardised, then pip can accept the telemetry data as an input directly, instead of having to infer it at runtime.

As far as the OP goes, standardising the telemetry format seems like a good idea to me.

Having a standardised way to configure it in a build environment also wouldn’t be unreasonable, but I think it should be a separate proposal (if it happens at all).

1 Like

Firstly this data is needed whether telemetry data is used or not, secondly how does that stop a subprocess being run to get the data? Regardless of how it’s passed to pip.

If pip is told the libc version through its config, it won’t need to look it up. That can technically be done without standardisation, so the flow is instead “If the telemetry reporting is standardised and tools make it configurable, then the tools may choose to use the relevant pieces of the standardised format for other parts of their configuration without having to design the configuration mechanism from scratch”

I do not think that we have specific requirements. Actually, we just set the user agent via requests-toolbelt so that it contains a sensible value. The example from the PyPI/linehaul’s test case is the result of calling requests_toolbelt.user_agent('poetry', __version__). I think we are fine with any scheme as long as it is simple enough to implement.

@cosmicexplorer if I could make a procedural observation/suggestion: a lot of your responses have focused on implementation details and potential areas for improvement. For example, you’ve noted that PATH lookups aren’t super generalizable or necessarily the most performant, which I agree with.

However, how a given installer chooses to populate the metadata it puts in the User-Agent is essentially an internal policy decision for that installer, i.e. beyond the scope of an interoperability PEP. In other words, IMO a pre-PEP discussion would be a lot easier to follow here if we separate the “what” of a potential standard linehaul format from the “how” of installers like pip, uv, poetry, etc. actually producing that format. That’s not to say the “how” isn’t important (IMO it is!), but that maybe we should have a separate discussion thread for that if the goal here is principally to define the schema itself.

(Hopefully the above isn’t read as a rebuke – it’s just me trying to understand where you want the focus of this thread to be, i.e. on pip internals or on a packaging interoperability PEP.)

4 Likes