(This is my first attempt to propose a packaging standard in this forum. I am basing this off the instructions at PyPA Specifications — PyPA documentation. Those instructions seem to indicate that a PR against GitHub - pypa/packaging.python.org: Python Packaging User Guide should be provided at the same time, but I’m not seeing many examples of that being done for in-progress PEPs, so I am assuming this is the appropriate first stop for potential new PEPs. I also could not find a standard format expected for PEP proposals, so I’m basing the general structure off of William Woodruff’s pre-PEP at Pre-PEP: Trusted Publishing token exchange .)
Problem Statement
PyPI exposes a BigQuery dataset for package download statistics: Analyzing PyPI package downloads - Python Packaging User Guide. To the best of my knowledge, this is drawn from information provided in the User-Agent string of requests made to PyPI, which largely depends upon support for such in pip. This leads to a few problems, which I realized upon identifying this in pip recently (Enable overriding undocumented telemetry identifier with PIP_TELEMETRY_USER_AGENT_ID by cosmicexplorer · Pull Request #13560 · pypa/pip · GitHub):
- The information provided is very similar to the Environment Markers specification (Dependency specifiers - Python Packaging User Guide), but not actually standardized according to a PEP.
- This amounts to a form of telemetry which users are largely unaware of and cannot opt out of.
- Searching for and executing tools like
rustcfrom the user’s PATH slows down pip resolves and constitutes a potential RCE vector.
More generally, making this information purely an implementation detail of pip means that indices besides PyPI itself are unable to take advantage of it. In my work at Twitter, we hosted an internal --find-links, and then eventually a simple repository API. While we certainly could have (and did) introduce our own internal telemetry for these purposes, it would make the job of tooling much easier if we could expect standard procedures for declaring the identity of whatever is fetching against our internal repo. Furthermore, pip’s implementation employs heuristics to identify whether it’s running in CI–it would be preferable if instead CI runs e.g. from github actions could intentionally announce themselves in a standardized way.
Finally, and most importantly, pip is not the only resolver in town, and it would benefit PyPI to be able to reliably collect detailed statistics from other resolvers, including poetry and uv. In addition to supporting non-PyPI backends, we should also look to support statistics collection from non-pip clients.
Existing Work
PIP_USER_AGENT_USER_DATA
There is an existing attempt to specialize this info in pip, with the PIP_USER_AGENT_USER_DATA environment variable. This is only obliquely documented at User Guide - pip documentation v25.3.dev0 , and only modifies a subkey of the json object encoded into User-Agent. This environment variable still does not avoid the heuristics necessary to detect a CI run, and does not enable users to avoid telemetry altogether.
Environment Dictionary
The environment markers specification as implemented in packaging provides a TypedDictto codify the current environment, which demonstrates a strong overlap with much of the information currently provided to PyPI:
The environment markers specification is far more complex than the telemetry pip encodes in the User-Agent string, in order to support complex queries during a resolve–much like we want to enable for PyPI’s BigQuery dataset. Therefore, basing off of the schema used for environment markers seems like the right path for this proposed standard.
Linehaul Log Parser
The log parser that PyPI uses to generate statistics is located at linehaul-cloud-function/linehaul/ua/parser.py at 8b8ed9db4ed946722d011c7d5ffe08a03ab5f942 · pypi/linehaul-cloud-function · GitHub. Note that it has distinct implementations for different versions of pip (linehaul-cloud-function/linehaul/ua/parser.py at 8b8ed9db4ed946722d011c7d5ffe08a03ab5f942 · pypi/linehaul-cloud-function · GitHub), and performs complex optimization mechanisms (linehaul-cloud-function/linehaul/ua/impl.py at 8b8ed9db4ed946722d011c7d5ffe08a03ab5f942 · pypi/linehaul-cloud-function · GitHub) to identify which parser to use.
While the PyPI log parser will likely have to retain some degree of complexity to support older and nonconforming clients, standardizing the information sent would seem to make it easier to optimize their parser and avoid heuristic matching by version for pip fetches, which constitute the majority of PyPI requests.
Solution Statement
- A typed specification should be produced for the information PyPI makes use of to generate download statistics.
- It may be useful to provide a reference implementation of this in the
packaginglibrary.
- It may be useful to provide a reference implementation of this in the
- The specification must include a mechanism for the user to disable or override statistics gathering.
- In cases where a tool like pip is executed multiple times or in a loop, generating this info once may be preferable to generating from scratch upon each execution.
- The specification should likely incorporate an explicit marker for CI or other automated executions.
- The specification should provide a way to disable or override the need to scan the user’s
PATHfor tools likerustc.- This addresses security concerns, and allows this information to be provided within hermetic build environments which do not provide executables within the same
PATH.
- This addresses security concerns, and allows this information to be provided within hermetic build environments which do not provide executables within the same
Proposal
It would be most useful to provide this information in the packaging library, as it would be easy for pip to consume while avoiding pip-specific assumptions. In particular, I think explicitly incorporating the Environment dict would be appropriate, as that is an existing standardized and json-serializable representation of the running Python environment.
On top of that, the pip implementation makes use of several version strings for named resources, including libc, openssl, setuptools, and rustc (which are not provided if unavailable). These would make sense to provide in a separate dictionary from the Python environment, particularly if we want to enable overriding or disabling them independently.
Taking this approach to the data pip currently encodes would result in a json dict like the following:
{
"python_environment": { <json of packaging.markers.Environment> },
"versions": {
"libc": { "name": "glibc", "version": "2.42" },
"tls": { "name": "openssl", "version": "3.5.3" },
"setuptools": { "version": "80.9.0" },
"rustc": { "version": "1.90.0" }
},
"ci": false,
"user_data": <arbitrary user data>
}
User-Agent Encoding and Declaring Support for this PEP
Right now, pip encodes this into a User-Agent string as pip/25.2 <json>, where “pip” is the client name and “25.2” is its version string. In order to produce a standard for high-quality data that PyPI’s statistics gathering can rely on, it seems appropriate to specify how the json dict is encoded into an HTTP request header.
However, specifying <client name>/<client version> isn’t quite enough info to clarify that the rest of the User-Agent string conforms to this PEP. To make log parsing easier, there are two potential approaches:
- Specifying this information in a separate header besides
User-Agent. - Providing a parenthesized
(PEP NNN)suffix to the<client name>/<client version>prefix.
The User-Agent parser in the linehaul project (linehaul-cloud-function/linehaul/ua/parser.py at main · pypi/linehaul-cloud-function · GitHub) seems to be defined in terms of the User-Agent string, so employing a separate header seems like it would be much more difficult for PyPI to consume than sticking to the User-Agent. From MDN (User-Agent header - HTTP | MDN), the parenthesized suffix to User-Agent seems to be relatively standard, even if web browsers tend to use it for platform info and not to define the syntax of the User-Agent string itself.
Using an unambiguous (PEP NNN) suffix should be easy to scan for with a fast substring search before engaging a slower fallible parser, and would allow for other clients to provide equivalent information to pip without matching client name and version range.
So the overall proposed specification for HTTP requests becomes <client name>/<client version> (PEP NNN) <json>, written into the User-Agent HTTP header.
Overrides
Implementations of this standard should allow the user to either disable the User-Agent info altogether, or to override portions of it, which disables data collection performed by the client. Differentiating “disable” vs “override” is important, as it would allow an implementation to ensure that the user-provided value conforms to the specification, and raise an error if not. This paradigm enables a user to provide valuable statistics to PyPI even if they use e.g. a hermetic build environment which precludes their client inferring that info, and generally allows for users to provide only as much information as they or their employer are comfortable providing to an external service.
This standard is intended to describe information which is independent of specifics such as a particular process execution model, so it’s unclear how best to incorporate an override mechanism. For example, a tool like pex which provides a library API for resolves may decide to provide overrides as a keyword argument to its resolve_multi() method. However, since PyPI download stats largely rely upon this information being provided from pip, it should be reasonable to use pip (and its one-shot process execution model) as the canonical representation.
With that in mind, we can describe a set of environment variables to modify the above representation. Conforming implementations need not use environment variables, but MUST enable equivalent overrides through some equivalent mechanism:
PIP_USER_AGENT=disable: does not generate aUser-Agentheader.- This implies not generating the
<client name>/<client version>prefix as well.
- This implies not generating the
PIP_USER_AGENT=override=pip/25.2 { ... }: provides a string to be provided asUser-Agentinstead of having the client infer it.- If
overrideis used for any option, the client SHOULD ensure the result produces conforming output, or produce an error. - Tools like pex which wrap pip or other clients would be able to specify this to avoid having the client attempt to infer any information.
- If
PIP_USER_AGENT_USER_DATA: this would remain the same as now.- This is the single element that is not inferred by pip, so it does not conform to the “disable”/“override” paradigm.
- This is a string, and is not decoded as a json blob.
PIP_USER_AGENT_CI=override=true: this would provide a standardized way for CI runners to signal their environment.- This is also a json blob like others, just one which corresponds to just the json
trueorfalse. - If not provided, pip would still perform the heuristic inference it does now.
- This is also a json blob like others, just one which corresponds to just the json
Individual Version Overrides
The above are the most important keys, and may be all we want to support for the first iteration of this standard. If we wanted to extend this standard to cover every key of the json dict, it might look like the following (with =disable supported for all keys):
PIP_USER_AGENT_PYTHON_ENVIRONMENT=override=<json blob>: provide a json blob which is decoded, matched againstpackaging.markers.Environment, then re-encoded.PIP_USER_AGENT_VERSIONS=override=<json blob>: provide a json blob for the versions dict which is decoded, matched against the known schema, then re-encoded.PIP_USER_AGENT_VERSIONS_LIBC=override={"name": "glibc", "version": "2.42"}: provide a json blob for thelibcresource in the versions dict, which is decoded, matched against the schema for thelibcresource, then re-encoded.PIP_USER_AGENT_VERSIONS_RUSTC=override={"version": "1.90.0"}: provide a json blob for therustcresource in the versions dict, which is decoded, matched against the schema for therustcresource, then re-encoded.
The specificity in this section is much less important, and I expect this section will not make it into the final version of the PEP. I think it’s far more important to extend the existing *_USER_DATA variable to cover the entire User-Agent header. This is mostly provided as food for thought, since it relates to other potential future standards which codify dependencies outside the Python ecosystem.
Other Considerations
- The implementation in the
packaginglibrary would be especially useful for generating this information out-of-band to provide withPIP_USER_AGENT=override=....- It would also be of use for servers implemented in Python to parse the
User-Agentheader, although perhaps a separate json schema might be more appropriate. - PEPs do not seem to have adopted a standard json schema representation yet, but that would be especially useful to consider for this one as it specifically relates to network calls.
- It would also be of use for servers implemented in Python to parse the
- It will be important to clarify that this information should not be used to discriminate against particular clients, as this would deter clients from providing these useful statistics.
- The primary consumer of this info is PyPI, and the primary producer is pip, and the main purpose of this is to make telemetry gathering from pip → pypi more secure and respect the user’s consent. Still, input from other Python package repos and other resolvers would be very welcome.
Conclusion
I am hoping for this PEP to be as minimal in scope as possible, and most importantly to incorporate PIP_USER_AGENT=disable, for performance reasons as well as to respect user consent.
The proposed semantics of =override are complex, and if they’re too difficult to agree on, I think they should be dropped entirely in favor of just PIP_USER_AGENT=disable (and retaining the existing semantics of PIP_USER_AGENT_USER_DATA). PIP_USER_AGENT_CI would be nice to have as well, but PIP_IS_CI is already available for specifically pip invocations that would like to signal this information to PyPI or other indices.
The main contributions of this PEP are expected to be:
- to standardize the
User-Agentschema, - to enable end user opt-out for any telemetry gathering.