Ideas for client side package provenance checks

ncoghlan · September 22, 2024, 5:26am

I’ve seen several interesting ideas go by in the PEP 752/755 discussions, so I wanted to summarise the various possibilities in one place. I’m also going to include ideas from PEP 752 in this list, since some of the options use different pieces of that as their foundation.

If folks think it would be worthwhile, I’d be willing to make this summary an Informational PEP, similar to those the CPython core dev team have sometimes used to put different competing PEPs into a shared context (see PEP 607 – Reducing CPython’s Feature Delivery Latency | peps.python.org for an example). The ideas are separable enough that I don’t believe it would be a good idea to pursue them all in a single standardisation proposal, even for the ideas that are mutually compatible. However, I don’t have any plans of my own to actually implement these ideas, I just wanted to summarise them in one place.

The recurring example I am going to use is from the azure- namespace on PyPI, comparing a hypothetical azure-enhanced-cli project (published by someone other than Microsoft, such as the some-publisher@domain.example used in the examples below) and azure-cli (an official Microsoft project published by the “azpycli” team).

This initial post is just aimed at describing the ideas, without attempting to assess their merits or go too far into the related technical details. I’ve only listed ideas that I think would be technically feasible.

Explicit provenance assertions

Explicit provenance assertions would use a new syntax like package-name from expected-source to declare an expected publisher for a component (other potential syntaxes for this were suggested, such as expected-source::package-name, but I personally prefer the keyword format).

For all of the ideas in this category, PEP 740 digital attestations could potentially be used to assert strong claims to the given discriminator without clients having to trust the index server.

While it likely makes the most sense to pick one provenance assertion mechanism and run with it, it would be possible to define marker characters that distinguished between the potential options at resolution time (@ for email addresses, trailing / for URLs, . for domain names, anything else is a user/org name).

Provenance assertions are NOT applicable to direct URL references (those should instead use hashes as a direct artifact level constraint, and otherwise rely on TLS certificates or SSH keys for domain validation).

Using email addresses

Valid example: pip install 'azure-cli from azpycli@microsoft.com'
Valid example: pip install 'azure-enhanced-cli from some-publisher@domain.example'
Failing example: pip install 'azure-enhanced-cli from azpycli@microsoft.com'

The idea behind using email addresses is that this is how the existing metadata standards identify package maintainers and authors.

Pursuing this idea may require tightening up how PyPI validates email address usage in new projects (or when changing the email address metadata on existing projects).

Using repository user and/or organisation names

Valid example: pip install 'azure-cli from microsoft'
Valid example: pip install 'azure-cli from azpycli'
Valid example: pip install 'azure-enhanced-cli from some-publisher'
Failing example: pip install 'azure-enhanced-cli from microsoft'

The idea behind using repository user and/or organisation names is that it avoids having to define a way for repositories to verify a publisher’s ownership of an external resource (whether that’s an email address or a domain name).

Pursuing this idea would need some form of enhancement to the repository JSON API standards to make the relevant repository account metadata available to API clients.

The main downside to this approach is that it doesn’t generalise nicely across different index servers (since the user/organisation names are repository dependent).

Using domain names

Valid example: pip install 'azure-cli from pyprojects.microsoft.com'
Valid example: pip install 'azure-enhanced-cli from some-publisher.domain.example'
Failing example: pip install 'azure-enhanced-cli from pyprojects.microsoft.com'

The idea behind using domain names is that they’re already a commonly validated external resource, and most major publishers of packages will already have one. If no external domain is specified for a project, then attempting to assert any specific provenance for that project will fail.

Pursuing this idea would need some mechanism to allow repositories to reliably assert that a particular user or organisation account is an authorised publisher for a given external domain and/or for clients to check that a given project is authorised for publication (see the previous thread for some ideas on that front)

PEP 708 (the tracking/alternate location metadata for projects) would also potentially be relevant to this approach.

Using HTTPS URLs

Valid example: pip install 'azure-cli from github.com/Azure/'
Valid example: pip install 'azure-enhanced-cli from domain.example/some-publisher/'
Failing example: pip install 'azure-enhanced-cli from github.com/Azure/'

Similar to the domain name idea, but allowing domains to be divided by URL paths, not just by subdomains.

Pursuing this idea would requiring some mechanism for verifying ownership of exact URLs rather than entire domains.

Implicit provenance constraints

All of the explicit provenance assertions share a common weakness: a naive user typing pip install azure-enhanced-cli doesn’t receive any indication that this is an unofficial third party project rather than an official Azure client package published by Microsoft.

Implicit provenance constraints would be designed to avoid that weakness.

The provenance checking mechanisms in this section would likely work best if a single form of explicit provenance assertion was defined as the one to be used for implicit provenance constraints. The examples below assume the use of domain names for this purpose (since they’re likely to be the easiest mechanism for clients to be able to validate independently of any specific repository).

Implicit provenance constraints would only be used when there are no artifact level constraints defined (if you have an artifact hash to check against, you no longer care where that artifact came from).

Trust on first use

Example prompt:
pip install azure-cli →
'azure-cli 2.64.0' is published by 'microsoft.com', do you wish to trust this publisher for this project? [y/N]:
Example prompt:
pip install azure-enhanced-cli →
'azure-enhanced-cli 0.9' is published by 'some-publisher.domain.example', do you wish to trust this publisher for this project? [y/N]:

Given a defined form of provenance checking, clients would be prompted to accept projects with unknown provenance the first time they’re encountered. While potentially viable for experienced users, this would get very noisy and confusing for new users, hence the other ideas in this section.

When granting approval, users would be able to choose between “trust this project with this provenance” and “trust all projects with this provenance” (probably by asking a second question after project level approval is given, as it would be hard to make a combined question both clear and concise).

Sharing trusted provenance lists

Via some mechanism (most likely packaging entry points with a defined API), installed packages would be able to declare that they provided lists of trustworthy provenance details that mean (in the absence of a publishing account compromise) distribution packages with that provenance are unlikely to include malicious code.

This would allow organisations to publish lists of pre-approved provenance details as installable projects, rather than their users having to define all those trust rules individually.

Defining a default verified publisher list

TLS certificate checking relies heavily on the default CA lists shipped by browser and operating system vendors. Many Python projects directly or indirectly use the certifi certificate bundle (built from the cert bundles published by Mozilla) as the trust root for their TLS validation.

This is the first idea on this list where we start getting into the same territory as PEP 752 and PEP 755: given an implicit provenance constraint mechanism at the technical level, it would likely be feasible for the PSF to build a “verified publisher” policy mechanism on top of that, where, in return for providing the PSF with additional identifying and contact information, as well as paying an administrative fee, the PSF would:

perform sufficient checks to satisfy the PSF that the applicant is a legitimate publisher that isn’t aiming to distribute malware
add the publisher’s provenance details to an installable PyPI package that package management clients can use a default

The administrative fee should be set high enough to cover both the actual administration costs, as well as the reputational risk if a verified publisher turns out to be less trustworthy than hoped. The fee could potentially be discounted and/or waived for PSF members and for other open source organisations, but the verification step should happen regardless.

Just as the trusted CA lists in browsers and operating systems evolve over time, so would the default verified publisher list.

Namespace prefix provenance contraints with open namespace grants

Valid example: pip install @azure/cli (authorised member of the azure- namespace)
Valid example: pip install azure-enhanced-cli (no prefix assertion specified)
Failing example: pip install @azure/enhanced-cli (NOT an authorised member of the azure- namespace)
Potentially valid example: pip install @azure-contrib/enhanced-cli (could be an authorised azure-contrib project)
Failing example: pip install @azure/contrib-enhanced-cli (still not an authorised member of the azure- namespace)

This idea builds on the PEP 752 concept of repository level open namespace prefix grants.

In this idea, assuming pyprojects.microsoft.com is the registered owner of the azure- prefix, then specifiying @azure/cli is mostly equivalent to azure-cli from pyprojects.microsoft.com, but with some additional flexibility as indicated by the “potentially valid” @azure-contrib/enhanced-cli example.

If the asserted prefix has no owner, then the shorthand provenance assertion will always fail.

Otherwise, the namespace ownership will be checked against the implicit provenance constraints (that way an alert will still be given if namespace ownership changes), while the project API metadata will be checked to see if the namespace usage is authorized by the namespace owner. If the project is not an authorized member of the namespace, then the prefix assertion will fail. However, the package will still be installable if the prefix assertion syntax is not used.

As discussed in PEP 752, it is expected that owners of namespace grants would have the ability to designate projects from third party publishers as authorised namespace members, even if those projects are not published directly by the namespace owner.

Registering new projects within a filtered namespace grant would still be permitted, but such projects would typically be unauthorised by default.

Project registration prevention with restricted namespace grants

Valid example: pip install enhanced-azure-cli (project name is outside the namespace grant)
Failing example: pip install azure-enhanced-cli (use of restricted prefix disallowed when registering project)

This idea covers the PEP 752 concept of repository level restricted namespace prefix grants.

The open vs restricted namespace grant distinction doesn’t affect the client side of things much at all (there’s a proposed API field to indicate the difference, but no real motivation for clients to ever check it).

Instead, it is more a process issue between the PSF and entities that have successfully become part of a verified publisher program, and would like to defensively register specific (most likely trademarked) prefixes to prevent future uploads to them without specific authorisation from the namespace grant owner.

Note: several of the ideas above were first suggested by other people, so if anyone has links handy, post them below and I’ll edit them into this initial post. Also feel free to mention other ideas that I missed when they came up in the discussion threads and I’ll see if it makes sense to add them to the overview.

ncoghlan · September 22, 2024, 6:09am

With the hopefully-at-least-somewhat-neutral summary completed, my own current thoughts:

I think starting with prefix validation (as PEP 752 does) is a bad idea. I see potential value in the general notion of supporting namespace prefix grants on PyPI (especially given an explicit syntax for requesting only official packages within a namespace prefix), but without a way for users to indicate to PyPI clients which namespace owners they consider trustworthy, its potential value seems limited.
I really like the idea of using third party domain attestation as the root trust mechanism for publisher verification (specifically, defining a .well-known URL as @woodruffw suggested that clients can use to validate a claim like azure-cli from pyprojects.microsoft.com against the specified domain, rather than having to trust anything published by the repository server itself). Such a mechanism may also help solve the distribution problem for PEP 480 TUF signing keys.
there’s a valid concern that clients spidering out to check attestations against multiple domains could bring back the bad old days of find-links external repository metadata causing installation reliability problems before PEP 470 removed that feature. Implementation PEPs would need to address that concern (e.g. by having most packages continue to default to using PyPI as their sole publishing authority, and by leveraging PEP 740 to have PyPI host checkable metadata for cases where the third party attestation servers are unavailable). Having artifact level hashes pre-empt the need to check project level attestations is also aimed at avoiding this problem (since lock file users would only potentially face issues at lock file resolution time, not when installing from a lock that contains artifact hashes)
I like the general idea of “trust on first use” as the basis for adding new trusted publishers to a personal development environment, but I think any practical UX for that feature is going to need a way to provide a base set of trust rules, defined either by a specific organisation (for institutional use), or by the PSF (for new users in general). The differences between TLS usability (wide-spread) vs SSH usability (niche, only for highly technical users) is a major influence on my thinking here.
the procedural efforts needed for the PSF to be able to administer the restricted namespace grants proposed in PEP 752 feel like they could be better invested in a more general “verified publisher” program that goes beyond the basic metadata validation that PyPI is already doing. Such a program could also potentially have multiple tiers, such as “verified” (the PSF is aware of the publisher’s legal identity, is satisfied it is a genuine identification, and that the user won’t intentionally publish malware), and “high assurance” (the PSF is confident the publishing access for that account is appropriately controlled with a low risk of malicious compromise).

pf_moore · September 22, 2024, 10:54am

Is the intention that the new form(s) of requirement such as azure-cli from azpycli are usable anywhere? Because my concern would be that organisation names like azpycli aren’t immutable - Microsoft could easily reorganise and decide not to pay for that org any more and let it lapse, for example. If it was acceptable to have azure-cli from azpycli in package dependency metadata, a Microsoft organisation change could invalidate unrelated packages on PyPI.

Allowing them only in top-level user input would avoid this problem, but (a) it might be hard to precisely define what counts as “top-level user input”^[1], and (b) it might not actually solve the problem of ensuring packages are trusted…

does this include a requirements file, for example? ↩︎

mikeshardmind · September 22, 2024, 11:16am

A lot of the variants here seem to be still requiring/implying some level of verification from pypi, and even if resources are allocated to making that happen, I don’t think these will work well with multiple index use.

This was pointed out in the 752/755 discussions already:

There are multiple things pointing to this part of the problem being in scope for organizations that would benefit from it. an index server that sidesteps this issue by merging multiple index servers and prefers packages based on the order the mirror is told has recently been open-sourced by developers at CERN, whose motivations are quotable as:

Dependency confusion here was specifically in the context of multiple index servers. The full details and links to history surrounding that are in the full quoted post, and I do encourage others to go read it.

With that and many other problems in mind, I don’t think we should be building something that relies on pypi admin review for this specific problem. There’s a benefit here that such solutions also don’t need to pass as much of a hurdle as they don’t suggest an ongoing new admin cost or an ongoing additional reliance on funding surrounding a solution that we may not want in the future. Still, even if money and time were magically not a concern, pypi admins can’t know what might exist on all other indexes that any kind of package might clash with, either maliciously or unintentionally.

If the goal is making sure users don’t accidentally find themselves installing an unexpected (not even necessarily maliciously so) dependency, then I think the ssh model on this is the clearest, especially if we pair it with a new extension to dependency specifiers that asserts “this came from this user on this index”, a way to have a user-configured store of trust, and allow scoping trust as “globally trusted”, “trusted for this project”, or “trusted for this dep in this project” (the last of which is what is used when the new dependency specifier is explicitly used)

ncoghlan · September 22, 2024, 6:58pm

That would be something that any implementation PEPs would have to spell out, but my own inclination would be to allow them anywhere, but point out that being overly strict with them in package dependencies can cause problems as org details change. (we already take a similarly relaxed approach when it comes to packages using overly strict dependency version pins, as sometimes those are valid for genuinely tightly coupled packages)

This is one of the reasons the domain name variant is at the top of my personal “that’s a potentially interesting idea” list: assuming it’s backed by a well-known query URL published on the claimed domain, then PSF involvement should only be needed for clients to turn implicit trust checks on by default (opt-in usage at an institutional level could take place in any org prepared to build out their own trusted publisher list).

None of the suggested ideas are trivial to implement in a useful fashion, though (and Paul’s comments in the publisher verification thread about there already being several enhancements in progress in the package distribution trust management space remains a genuine concern)