Ideas for client side package provenance checks

I’ve seen several interesting ideas go by in the PEP 752/755 discussions, so I wanted to summarise the various possibilities in one place. I’m also going to include ideas from PEP 752 in this list, since some of the options use different pieces of that as their foundation.

If folks think it would be worthwhile, I’d be willing to make this summary an Informational PEP, similar to those the CPython core dev team have sometimes used to put different competing PEPs into a shared context (see PEP 607 – Reducing CPython’s Feature Delivery Latency | peps.python.org for an example). The ideas are separable enough that I don’t believe it would be a good idea to pursue them all in a single standardisation proposal, even for the ideas that are mutually compatible. However, I don’t have any plans of my own to actually implement these ideas, I just wanted to summarise them in one place.

The recurring example I am going to use is from the azure- namespace on PyPI, comparing a hypothetical azure-enhanced-cli project (published by someone other than Microsoft, such as the some-publisher@domain.example used in the examples below) and azure-cli (an official Microsoft project published by the “azpycli” team).

This initial post is just aimed at describing the ideas, without attempting to assess their merits or go too far into the related technical details. I’ve only listed ideas that I think would be technically feasible.

Explicit provenance assertions

Explicit provenance assertions would use a new syntax like package-name from expected-source to declare an expected publisher for a component (other potential syntaxes for this were suggested, such as expected-source::package-name, but I personally prefer the keyword format).

For all of the ideas in this category, PEP 740 digital attestations could potentially be used to assert strong claims to the given discriminator without clients having to trust the index server.

While it likely makes the most sense to pick one provenance assertion mechanism and run with it, it would be possible to define marker characters that distinguished between the potential options at resolution time (@ for email addresses, trailing / for URLs, . for domain names, anything else is a user/org name).

Provenance assertions are NOT applicable to direct URL references (those should instead use hashes as a direct artifact level constraint, and otherwise rely on TLS certificates or SSH keys for domain validation).

Using email addresses

  • Valid example: pip install 'azure-cli from azpycli@microsoft.com'
  • Valid example: pip install 'azure-enhanced-cli from some-publisher@domain.example'
  • Failing example: pip install 'azure-enhanced-cli from azpycli@microsoft.com'

The idea behind using email addresses is that this is how the existing metadata standards identify package maintainers and authors.

Pursuing this idea may require tightening up how PyPI validates email address usage in new projects (or when changing the email address metadata on existing projects).

Using repository user and/or organisation names

  • Valid example: pip install 'azure-cli from microsoft'
  • Valid example: pip install 'azure-cli from azpycli'
  • Valid example: pip install 'azure-enhanced-cli from some-publisher'
  • Failing example: pip install 'azure-enhanced-cli from microsoft'

The idea behind using repository user and/or organisation names is that it avoids having to define a way for repositories to verify a publisher’s ownership of an external resource (whether that’s an email address or a domain name).

Pursuing this idea would need some form of enhancement to the repository JSON API standards to make the relevant repository account metadata available to API clients.

The main downside to this approach is that it doesn’t generalise nicely across different index servers (since the user/organisation names are repository dependent).

Using domain names

  • Valid example: pip install 'azure-cli from pyprojects.microsoft.com'
  • Valid example: pip install 'azure-enhanced-cli from some-publisher.domain.example'
  • Failing example: pip install 'azure-enhanced-cli from pyprojects.microsoft.com'

The idea behind using domain names is that they’re already a commonly validated external resource, and most major publishers of packages will already have one. If no external domain is specified for a project, then attempting to assert any specific provenance for that project will fail.

Pursuing this idea would need some mechanism to allow repositories to reliably assert that a particular user or organisation account is an authorised publisher for a given external domain and/or for clients to check that a given project is authorised for publication (see the previous thread for some ideas on that front)

PEP 708 (the tracking/alternate location metadata for projects) would also potentially be relevant to this approach.

Using HTTPS URLs

  • Valid example: pip install 'azure-cli from github.com/Azure/'
  • Valid example: pip install 'azure-enhanced-cli from domain.example/some-publisher/'
  • Failing example: pip install 'azure-enhanced-cli from github.com/Azure/'

Similar to the domain name idea, but allowing domains to be divided by URL paths, not just by subdomains.

Pursuing this idea would requiring some mechanism for verifying ownership of exact URLs rather than entire domains.

Implicit provenance constraints

All of the explicit provenance assertions share a common weakness: a naive user typing pip install azure-enhanced-cli doesn’t receive any indication that this is an unofficial third party project rather than an official Azure client package published by Microsoft.

Implicit provenance constraints would be designed to avoid that weakness.

The provenance checking mechanisms in this section would likely work best if a single form of explicit provenance assertion was defined as the one to be used for implicit provenance constraints. The examples below assume the use of domain names for this purpose (since they’re likely to be the easiest mechanism for clients to be able to validate independently of any specific repository).

Implicit provenance constraints would only be used when there are no artifact level constraints defined (if you have an artifact hash to check against, you no longer care where that artifact came from).

Trust on first use

  • Example prompt:
    pip install azure-cli
    'azure-cli 2.64.0' is published by 'microsoft.com', do you wish to trust this publisher for this project? [y/N]:
  • Example prompt:
    pip install azure-enhanced-cli
    'azure-enhanced-cli 0.9' is published by 'some-publisher.domain.example', do you wish to trust this publisher for this project? [y/N]:

Given a defined form of provenance checking, clients would be prompted to accept projects with unknown provenance the first time they’re encountered. While potentially viable for experienced users, this would get very noisy and confusing for new users, hence the other ideas in this section.

When granting approval, users would be able to choose between “trust this project with this provenance” and “trust all projects with this provenance” (probably by asking a second question after project level approval is given, as it would be hard to make a combined question both clear and concise).

Sharing trusted provenance lists

Via some mechanism (most likely packaging entry points with a defined API), installed packages would be able to declare that they provided lists of trustworthy provenance details that mean (in the absence of a publishing account compromise) distribution packages with that provenance are unlikely to include malicious code.

This would allow organisations to publish lists of pre-approved provenance details as installable projects, rather than their users having to define all those trust rules individually.

Defining a default verified publisher list

TLS certificate checking relies heavily on the default CA lists shipped by browser and operating system vendors. Many Python projects directly or indirectly use the certifi certificate bundle (built from the cert bundles published by Mozilla) as the trust root for their TLS validation.

This is the first idea on this list where we start getting into the same territory as PEP 752 and PEP 755: given an implicit provenance constraint mechanism at the technical level, it would likely be feasible for the PSF to build a “verified publisher” policy mechanism on top of that, where, in return for providing the PSF with additional identifying and contact information, as well as paying an administrative fee, the PSF would:

  • perform sufficient checks to satisfy the PSF that the applicant is a legitimate publisher that isn’t aiming to distribute malware
  • add the publisher’s provenance details to an installable PyPI package that package management clients can use a default

The administrative fee should be set high enough to cover both the actual administration costs, as well as the reputational risk if a verified publisher turns out to be less trustworthy than hoped. The fee could potentially be discounted and/or waived for PSF members and for other open source organisations, but the verification step should happen regardless.

Just as the trusted CA lists in browsers and operating systems evolve over time, so would the default verified publisher list.

Namespace prefix provenance contraints with open namespace grants

  • Valid example: pip install @azure/cli (authorised member of the azure- namespace)
  • Valid example: pip install azure-enhanced-cli (no prefix assertion specified)
  • Failing example: pip install @azure/enhanced-cli (NOT an authorised member of the azure- namespace)
  • Potentially valid example: pip install @azure-contrib/enhanced-cli (could be an authorised azure-contrib project)
  • Failing example: pip install @azure/contrib-enhanced-cli (still not an authorised member of the azure- namespace)

This idea builds on the PEP 752 concept of repository level open namespace prefix grants.

In this idea, assuming pyprojects.microsoft.com is the registered owner of the azure- prefix, then specifiying @azure/cli is mostly equivalent to azure-cli from pyprojects.microsoft.com, but with some additional flexibility as indicated by the “potentially valid” @azure-contrib/enhanced-cli example.

If the asserted prefix has no owner, then the shorthand provenance assertion will always fail.

Otherwise, the namespace ownership will be checked against the implicit provenance constraints (that way an alert will still be given if namespace ownership changes), while the project API metadata will be checked to see if the namespace usage is authorized by the namespace owner. If the project is not an authorized member of the namespace, then the prefix assertion will fail. However, the package will still be installable if the prefix assertion syntax is not used.

As discussed in PEP 752, it is expected that owners of namespace grants would have the ability to designate projects from third party publishers as authorised namespace members, even if those projects are not published directly by the namespace owner.

Registering new projects within a filtered namespace grant would still be permitted, but such projects would typically be unauthorised by default.

Project registration prevention with restricted namespace grants

  • Valid example: pip install enhanced-azure-cli (project name is outside the namespace grant)
  • Failing example: pip install azure-enhanced-cli (use of restricted prefix disallowed when registering project)

This idea covers the PEP 752 concept of repository level restricted namespace prefix grants.

The open vs restricted namespace grant distinction doesn’t affect the client side of things much at all (there’s a proposed API field to indicate the difference, but no real motivation for clients to ever check it).

Instead, it is more a process issue between the PSF and entities that have successfully become part of a verified publisher program, and would like to defensively register specific (most likely trademarked) prefixes to prevent future uploads to them without specific authorisation from the namespace grant owner.


Note: several of the ideas above were first suggested by other people, so if anyone has links handy, post them below and I’ll edit them into this initial post. Also feel free to mention other ideas that I missed when they came up in the discussion threads and I’ll see if it makes sense to add them to the overview.

6 Likes

With the hopefully-at-least-somewhat-neutral summary completed, my own current thoughts:

  • I think starting with prefix validation (as PEP 752 does) is a bad idea. I see potential value in the general notion of supporting namespace prefix grants on PyPI (especially given an explicit syntax for requesting only official packages within a namespace prefix), but without a way for users to indicate to PyPI clients which namespace owners they consider trustworthy, its potential value seems limited.
  • I really like the idea of using third party domain attestation as the root trust mechanism for publisher verification (specifically, defining a .well-known URL as @woodruffw suggested that clients can use to validate a claim like azure-cli from pyprojects.microsoft.com against the specified domain, rather than having to trust anything published by the repository server itself). Such a mechanism may also help solve the distribution problem for PEP 480 TUF signing keys.
  • there’s a valid concern that clients spidering out to check attestations against multiple domains could bring back the bad old days of find-links external repository metadata causing installation reliability problems before PEP 470 removed that feature. Implementation PEPs would need to address that concern (e.g. by having most packages continue to default to using PyPI as their sole publishing authority, and by leveraging PEP 740 to have PyPI host checkable metadata for cases where the third party attestation servers are unavailable). Having artifact level hashes pre-empt the need to check project level attestations is also aimed at avoiding this problem (since lock file users would only potentially face issues at lock file resolution time, not when installing from a lock that contains artifact hashes)
  • I like the general idea of “trust on first use” as the basis for adding new trusted publishers to a personal development environment, but I think any practical UX for that feature is going to need a way to provide a base set of trust rules, defined either by a specific organisation (for institutional use), or by the PSF (for new users in general). The differences between TLS usability (wide-spread) vs SSH usability (niche, only for highly technical users) is a major influence on my thinking here.
  • the procedural efforts needed for the PSF to be able to administer the restricted namespace grants proposed in PEP 752 feel like they could be better invested in a more general “verified publisher” program that goes beyond the basic metadata validation that PyPI is already doing. Such a program could also potentially have multiple tiers, such as “verified” (the PSF is aware of the publisher’s legal identity, is satisfied it is a genuine identification, and that the user won’t intentionally publish malware), and “high assurance” (the PSF is confident the publishing access for that account is appropriately controlled with a low risk of malicious compromise).
2 Likes

Is the intention that the new form(s) of requirement such as azure-cli from azpycli are usable anywhere? Because my concern would be that organisation names like azpycli aren’t immutable - Microsoft could easily reorganise and decide not to pay for that org any more and let it lapse, for example. If it was acceptable to have azure-cli from azpycli in package dependency metadata, a Microsoft organisation change could invalidate unrelated packages on PyPI.

Allowing them only in top-level user input would avoid this problem, but (a) it might be hard to precisely define what counts as “top-level user input”[1], and (b) it might not actually solve the problem of ensuring packages are trusted…


  1. does this include a requirements file, for example? ↩︎

2 Likes

A lot of the variants here seem to be still requiring/implying some level of verification from pypi, and even if resources are allocated to making that happen, I don’t think these will work well with multiple index use.

This was pointed out in the 752/755 discussions already:

There are multiple things pointing to this part of the problem being in scope for organizations that would benefit from it. an index server that sidesteps this issue by merging multiple index servers and prefers packages based on the order the mirror is told has recently been open-sourced by developers at CERN, whose motivations are quotable as:

Dependency confusion here was specifically in the context of multiple index servers. The full details and links to history surrounding that are in the full quoted post, and I do encourage others to go read it.


With that and many other problems in mind, I don’t think we should be building something that relies on pypi admin review for this specific problem. There’s a benefit here that such solutions also don’t need to pass as much of a hurdle as they don’t suggest an ongoing new admin cost or an ongoing additional reliance on funding surrounding a solution that we may not want in the future. Still, even if money and time were magically not a concern, pypi admins can’t know what might exist on all other indexes that any kind of package might clash with, either maliciously or unintentionally.

If the goal is making sure users don’t accidentally find themselves installing an unexpected (not even necessarily maliciously so) dependency, then I think the ssh model on this is the clearest, especially if we pair it with a new extension to dependency specifiers that asserts “this came from this user on this index”, a way to have a user-configured store of trust, and allow scoping trust as “globally trusted”, “trusted for this project”, or “trusted for this dep in this project” (the last of which is what is used when the new dependency specifier is explicitly used)

2 Likes

That would be something that any implementation PEPs would have to spell out, but my own inclination would be to allow them anywhere, but point out that being overly strict with them in package dependencies can cause problems as org details change. (we already take a similarly relaxed approach when it comes to packages using overly strict dependency version pins, as sometimes those are valid for genuinely tightly coupled packages)

This is one of the reasons the domain name variant is at the top of my personal “that’s a potentially interesting idea” list: assuming it’s backed by a well-known query URL published on the claimed domain, then PSF involvement should only be needed for clients to turn implicit trust checks on by default (opt-in usage at an institutional level could take place in any org prepared to build out their own trusted publisher list).

None of the suggested ideas are trivial to implement in a useful fashion, though (and Paul’s comments in the publisher verification thread about there already being several enhancements in progress in the package distribution trust management space remains a genuine concern)

Sorry to resurrect an old thread, but wanted to add my support, some additional context and also some recent updates that relate here.

I agree this is a good idea and is largely why PEP 740 updated the Index API to include attestations and why we included the [[packages.attestation-identities]] field in PEP 751.

The threat we’re trying to avoid is a compromised release: a build that was created somewhere other than where the maintainer originally intended and was published (either by a compromise of PyPI, maintainer account takeover, token compromise, etc).

The goal is for any tool that produces a PEP 751 lockfile to include the PEP 740 attestation identity for a given file in the lockfile at generation time, and for any tool that consumes a PEP 751 lockfile to verify that there is a corresponding attestation with a matching identity at install-time.

Similarly to hash-checking, this requires Trust On First use when the attestation identities are first included in the lockfile. Unlike hash checking, the attestation identity should not change from one release to the next, and so each new release doesn’t need TOFU’d, just each project.

Crucially, I believe this should happen transparently to the user and by default when possible, and without needing to understand new syntax.

Some inline replies:

I would argue that in a post-lockfile world, this is fine. Meaning that if the user wants some additional assurances, they should instead pip lock add azure-enhanced-cli (or whatever), and either a) review the generated lockfile manually or b) pass the lockfile through something that verifies it against some policy (e.g. “only permit dependencies that have attestations from these identities”), at which point they would determine that the attestation identity is something other than what they expected.

I think making assurances about the quality of the artifact itself (e.g. no malware) is probably out of scope for provenance attestations as they currently exist, as they’re really just focused on verifiably associating an artifact with either a publishing identity or a build identity. However, what you’re describing sounds a lot like cargo-vet which could be enabled by permitting third-party attestations: for example, as $organization, I could attest that I did <something> to review some given artifact, and I published those attestations, and if you trust me, you can take those attestations into account when evaluating what you’re consuming.

I think this just comes down to “how much do you trust PyPI?” (or your index of choice). If you trust the index enough to believe that the attestations it’s serving match what it verified at upload time against what the maintainer configured, trust on first use is fine. If you don’t trust PyPI, you’re going to want to get these out of band and manually collect/verify them anyways, and some base set of trust rules provided by the PSF probably won’t cut it for you. And if your organization has some policy on what these identities are, you just need a tool that can evaluate a lockfile against that policy.

2 Likes

I suspect someone will eventually create an auditing tool for pylock.toml that will encompass this. I have some ideas on what one might want to audit at Lock files - Open Source by Brett Cannon .

3 Likes

@dustin regarding TOFU and tooling: would it not make sense to just do that in installers like pip or uv? Trailofbits published a prototype (I think) on GitHub [1] in their original blog post regarding attestations [2]. It seems to depend on a pretty major architecture change (to allow plugins) in pip but reads like a very logical conclusion on the “consumer” site.

1: GitHub - trailofbits/pip-plugin-pep740: An implementation of a pip plugin that verifies PEP-740 attestations before installing a package, and aborts the installation if verification fails.
2: Attestations: A new generation of signatures on PyPI -The Trail of Bits Blog

Yes, absolutely. I’d say there’s generally less of an appetite to add new features (even security features) to pip. As a result we were pursuing a plugin architecture for pip to avoid pip maintainers from having to maintain new features, but that hasn’t gotten much traction either.

One of the additional challenges is that pip only supports vendoring pure-Python code, and some of the libraries currently needed for verification include native code.

I’m not sure on plans for adding security features like this to uv, maybe @woodruffw has some idea?

I can’t speak for other pip maintainers, but IMO the problem is that we don’t have the maintainer resources (or skills) to maintain new security features. We’re reluctant to add them because a bitrotted security feature is worse than no security feature at all…

To be clear here, “hasn’t got much traction” means “progress has stalled because the questions raised by pip maintainers[1] haven’t been addressed yet by the contributors”.

I think we’re at a bit of an impasse because I don’t see how a plugin architecture can work without some guarantees on the stability of parts of pip’s internals, and no-one has yet articulated what guarantees a plugin system needs, in order for us to evaluate whether we can commit to those guarantees. As a result, we keep getting bogged down in meta-discussions, rather than making progress.

Plugins could probably use native code, because they would be separately installed and as such wouldn’t encounter pip’s bootstrapping problem (that we need a single, universal wheel, that will run anywhere, before we can download anything from a package index).


  1. basically, me :slightly_smiling_face: ↩︎

Thinking some more about this, another approach might be to design some sort of PEP 517 style “installer hook” standard. That would convert the “meta discussions” into what would hopefully be a more productive standards debate, and would result in something that could be common between all installers (pip, uv, …), which could reduce the cost of reimplementing the same functionality for pip and uv. From pip’s point of view, having a standard would bypass all of the architectural questions, and we’d just be left with having to implement a well-defined interface.

Of course, that would need someone to write such a PEP, and I don’t know if anyone is interested in doing that.

Oh, to be clear, this is not me assigning blame on the maintainers! Entirely likely that it’s not the right idea or isn’t sufficiently motivated or just isn’t worth the effort.

Agreed, I’m not trying to assign blame either. In many ways, I’d love to see a plugin system for pip. I just don’t know what one would look like, and it’s frustrating when real world examples like this, and the others discussed in the pip issue, aren’t enough to pin down what such an interface would look like.

That’s one reason an “installer hook” standard might be helpful - it decouples the questions about what the consumers need from the concerns about how installers will support the interface.

Did not expect to get such insightful responses this quick; thank you both!

@pf_moore i am rather new to the pip maintainer discussions, I recently stumbled upon this question as I started rolling out PEP740 compliant attestations for packages and was confronted with the question how a client would even use that information correctly.

Is my understanding right that it would be better to look into install hook standardization than trying to get the ball rolling again on the pip specific plugin?

My quick and dirty solution for now is using pypi-attestations and some deployment method specific mechanisms to ensure the rule-adherences and validity of each wheels provenance is checked before install.

This is probably not applicable to general users but works if you fully control the virtual environment lifecycle.

Is the inability to vendor cryptography reasonably in pip the primary reason this can’t be implemented in pip directly?

If so, it seems to me the more productive option would be figuring out how to ensure the necessary primitives are always available in python (as part of the requirements for the language, not just cpython). pip would not be the only beneficiary here, and it’s one of the essential building blocks to safely bootstrap anything else.

1 Like

Speaking as a former SRE: I would rather patch just the package than the python distribution to fix a security issue.

That’s just an idea I had while posting here. Creating a standard is probably harder than just agreeing a pip-specific solution, to be honest. The advantage of a standard is that it would clearly prevent using pip internals in a hook, focusing the discussion on whether that’s an acceptable limitation.

I’ve tried to make that point in the pip thread, but it’s really hard to get any clear answers when people are thinking in terms of in-process Python code being invoked from pip internal functions…

It’s the reason implementing it in pip is impossible, yes. But even if the functionality was available in the stdlib, I’d still have concerns about around maintainability (as I mentioned above).

I’d rather this functionality was a plugin, because (a) it bypasses the issue of native code dependencies, and (b) it ensures that the code is maintained by subject matter experts with an interest in the relevant security issues.

1 Like

Not trying to start a Rust Linux support flamewar, but unfortunately I don’t think the code in cryptography could ever be in the standard library because most of it is implemented in Rust at this point. Python can’t add Rust components because then people who want to use Python on systems where Rust can’t be installed would run into issues.

Maybe one day there will be a way to bypass that, IMO it’d be really neat if Rust was an option for new extension modules in CPython, for example.

I wasn’t necessarily advocating pulling in everything cryptography provides, and would personally prefer only pulling in the minimum reasonable primitives for bootstrapping a library with a shorter release/support cycle. While I like Rust, I think for cryptography, the better option that I would personally advocate for and be willing to contribute work on , would be including more of the formally verified work being done in HACL*, which happens to be more portable than rust libraries as well. This should result in something we can be more confident won’t need frequent patches, but allow the basics we need to get such a library installed reasonably.