Pre-PEP: Exposing Trusted Publisher provenance on PyPI

woodruffw · January 2, 2024, 6:46pm

Hello all! I’m opening this up as a pre-PEP discussion thread, with the goal of getting some additional attention on this problem and the proposed solution before creating an actual draft PEP.

TL;DR: “We can bootstrap cryptographic provenance on top of Trusted Publishing, thereby giving us provenance for a large number of PyPI packages (by total downloads). Users will be able to retrieve this provenance, allowing them to verify that a package originates from a particular source repository or CI system.”

Disclaimer

I am a contributor to PyPI, but not a maintainer. These are my opinions, do not reflect others’, do not reflect official positions, etc. etc.

Problem statement

As of 2024, there is currently no good way to deliver provenance for packages on PyPI. In other words: while PyPI itself offers transport security and strong hashes for downloaded distributions, there is no way to verify that a particular package came from a particular source repository or other signing identity.

Doing this poses significant usability and operational challenges: previous attempts (like PGP signatures) have relied on publishers to maintain long-lived signing keys, users to retrieve those keys (and rotate them correctly), as well as an external ecosystem of online keyservers for key distribution. Even with all of that, users still had to establish trust in specific key-identity pairings, since any PGP key can claim to represent any arbitrary human identity. In practice, this meant that only a tiny minority of PyPI packages were actually signed with PGP (both by package count and overall downloads), and that an even smaller minority of users actually verified those signatures.

Solution statement

A workable, lasting solution to this problem needs to sidestep the operational and usability issues that come with PGP and other manual identity binding layers.

We have a new technique for this available on PyPI, as of April 2023: Trusted Publishing. Under the hood, Trusted Publishing uses OpenID Connect to associate a PyPI project with a workflow that is trusted to publish the project, such as a GitHub Actions workflow on the project’s associated GitHub repository. The resulting association is cryptographically bound, meaning that no other user or GitHub repository can impersonate the Trusted Publisher. As a result, a Trusted Publisher can publish directly to PyPI without manual API token configuration.

Because Trusted Publishing is OIDC under the hood, any Trusted Publishing workflow can also become a provenance-generating workflow (with Sigstore) with no additional user configuration required.

So, the actual solution statement: for PyPI packages that are currently published with Trusted Publishing, we provide zero-configuration provenance without user effort by resulting the Trusted Publisher identity as a Sigstore codesigning identity.

In less jargon: a repository named github.com/pypa/sampleproject that uses Trusted Publishing to upload to PyPI will also upload provenance that downstream users can verify to establish that each uploaded package genuinely comes from pypa/sampleproject’s CI.

Components

Changes to `gh-action-pypi-publish` (and other publishing workflows that use Trusted Publishing)

To make this work, publishing workflows like gh-action-pypi-publish will need to re-use their pre-existing id-token: write (or equivalent) permissions to obtain an OIDC credential with aud: sigstore. That credential will then be bound to a short-lived signing key via Sigstore’s “keyless signing” mechanism, allowing the workflow to sign for each of the distributions to be uploaded. All of this can be abstracted behind sigstore-python, which is a mature Sigstore implementation designed (in part) for exactly this purpose.

In effect: gh-action-pypi-publish will produce {dist}.sigstore.json for each dist given to it. This requires no additional user configuration or interaction, since the permissions needed to produce {dist}.sigstore.json are the same permissions needed to upload with Trusted Publishing.

Changes to `twine` (and other uploading clients)

Once publishing workflows like gh-action-pypi-publish begin producing {dist}.sigstore.json for each dist, uploading clients (like twine) will need to become aware of these “sidecar” artifacts and include them with each uploaded distribution.

In effect: similar to how PGP signatures were handled ({dist}.asc), clients like twine will need to detect {dist}.sigstore.json for each dist and upload each’s contents as associated metadata.

Changes to PyPI

PyPI requires two sets of changes for this work:

Producing side: the upload endpoint will need to accept a provenance or similar POST field, containing the contents of {dist}.sigstore.json as mentioned above. This field should have at least the following semantics:
- It MUST be present if the uploader is a Trusted Publisher, and MUST NOT be present otherwise.
  - This will require a deprecation/onboarding period, since there are existing Trusted Publishing workflows that will not immediately upgrade to the latest version of gh-action-pypi-publish.
- It MUST be a valid Sigstore bundle (i.e. signature, signing certificate, and other metadata needed for a Sigstore verification)
- The signature MUST be valid for the given dist, which PyPI can verify by using sigstore-python’s verification APIs with the Trusted Publisher for dist as the expected signing identity.
Consuming side: PyPI will need to decide how to expose the Trusted Publisher signatures uploaded to it. Some (not mutually exclusive) options:
- Make {dist}.sigstore.json available via an additional data- attribute on the PEP 503 Simple Index
- Make {dist}.sigstore.json available via additional attribures in the PEP 691 Simple JSON Index
- Expose Trusted Publishing status (and associated verified signatures) on each release view in the Web UI, similar to what NPM does

PEP items

Not all of the above falls into the scope of a PEP, so I’ve broken out the specific things (possibly incomplete!) that I believe need to be standardized or included under a PEP here:

Changes to the upload endpoint: I’m not sure if this requires a PEP (since the current endpoint isn’t specified by one), but if so: the addition of a provenance (or similar) POST field.
Changes to the PEP 503 and PEP 691 indices: both index formats should reflect (1) the expected Trusted Publisher identity for the uploaded release, and (2) the Trusted Publisher signature for the release.

Other considerations

As of writing this, PyPI’s Trusted Publishing is currently limited to just GitHub Actions. This covers a plurality (if not majority) of actively maintained projects, but is too narrow of a supported platform base to confidently build a stable, long-term code-signing scheme for PyPI on top of. Consequently, everything proposed above is blocked until PyPI supports at least one (and ideally more than one) additional Trusted Publisher (e.g. GitLab, Google Cloud Build, etc.).
Because this bootstraps on top of Trusted Publishing, this work will be unable to provide signatures for packages that aren’t uploaded with Trusted Publishing. This is a tradeoff made for expedience and operational reasons: starting with Trusted Publishing avoids many of the hard PKI problems that code signing otherwise needs to handle, and allows PyPI to distribute signatures for a large percent of overall PyPI downloads (since many of the current top projects already use Trusted Publishing).
This proposal isn’t meant to be the “final state” of codesigning on PyPI. Instead, it’s meant to be an early building block for later improvements, such as PyPI emitting counter-attestations/counter-signatures for each uploaded package.
Because this proposal has PyPI distribute both the signatures and the identities needed to verify them, it isn’t intended to protect against threat models where PyPI itself is malicious. This is similarly done for expedience/operational reasons: PyPI is already the center of trust, and attempting to reduce that trust requires separate techniques (like TUF or mandatory asset transparency) that aren’t immediately practical to integrate.

What happens after this?

Everything above focuses on making codesigning as “no-touch” as possible, and giving PyPI the ability to verify (and redistribute) the signatures uploaded to it.

From there, there are a lot of things that could be done to further adopt codesigning in the Python ecosystem. These are outside of the immediate scope of the ideas above, but I think are worth discussing in this thread (as they’ll certainly inform the more concrete design decisions we propose):

How does this interact with lockfiles (and upcoming lockfile standard proposals)? Being able to lock the identity associated with a PyPI project name is useful from a security perspective, since it reduces trust in PyPI itself.
How do we integrate this into pip and/or other installing clients? pip has vendoring constraints that make cryptographic dependencies a challenge, due to their native transitive requirements.
Similar to the first point: long term, how do we reduce the amount of trust placed in PyPI? This technical proposal doesn’t increase the amount of trust, but doesn’t decrease it either.
How do we extend this to other publishing workflows, i.e. ones that won’t (or can’t) be moved to CI providers that support Trusted Publishing?
Long term stability: how can we build this in a forwards-compatible way, ensuring that a different provenance technique can be inserted if Sigstore becomes unmaintained or otherwise inappropriate for PyPI’s needs?

CCing a few people who I know are interested in this design and conversation: @dustin @sethmlarson @dstufft @EWDurbin @miketheman

davidism · January 2, 2024, 7:11pm

I’m assuming the following question falls into this category, but I’ll ask. How does {name}.sigstore.json relate to the {name}.intoto.jsonl that is produced by the SLSA generator? For example, Flask uploads this intoto.jsonl file to the release page during the publishing workflow.

fungi · January 2, 2024, 7:25pm

So, the actual solution statement: for PyPI packages that are
currently published with Trusted Publishing, we provide
zero-configuration provenance without user effort by resulting
the Trusted Publisher identity as a Sigstore codesigning identity.

In less jargon: a repository named github.com/pypa/sampleproject
that uses Trusted Publishing to upload to PyPI will also upload
provenance that downstream users can verify to establish that each
uploaded package genuinely comes from pypa/sampleproject’s CI.

Restating just so I’m sure I understand the trust model: Instead of
having to trust that the package was published by someone/something
which has access to credentials for a PyPI account maintaining that
project’s releases, the user can instead trust that it was published
by someone/something which has access to credentials for a GitHub
account maintaining that project’s source code?

While I can understand the utility of being able to link a published
artifact on PyPI to a source repository in a cryptographically
strong manner, isn’t this based on an assumption that PyPI
credentials are less trustworthy or more prone to compromise than
GitHub credentials? Does this actually create trust, or merely
shuffle it around? From my personal perspective, I trust PyPI more
than I trust GitHub. After all, one is governed by a non-profit
organization beholden to its members, the other run by a for-profit
company accountable only to its shareholders.

As of writing this, PyPI’s Trusted Publishing is currently
limited to just GitHub Actions. This covers a plurality (if not
majority) of actively maintained projects, but is too narrow of a
supported platform base to confidently build a stable, long-term
code-signing scheme for PyPI on top of. Consequently, everything
proposed above is blocked until PyPI supports at least one
(and ideally more than one) additional Trusted Publisher (e.g.
GitLab, Google Cloud Build, etc.).

It seems to me that, instead of gating this proposal on giving
people the option to alternatively trust another self-serving*
commercial enterprise like Google LLC or GitLab Inc. instead of
Microsoft Corporation, it would make more sense for the next step to
be working out how someone who maintains their own code hosting, CI
and identity provider systems can safely add them to PyPI
(presumably in a self-service fashion in order to not inundate the
PyPI admins with these requests).

*This is not meant to be derisive, companies are by definition
focused on their own profit.

woodruffw · January 2, 2024, 7:26pm

Thank you for asking this! This connects to a philosophical/taxonomical question that @sethmlarson and I have been discussing: ultimately, what properties would we like to be verifiable from a PyPI package?

With this current proposal, the statement a third-party user can verify is “package sampleproject was published from publish.yml @ pypa/sampleproject.” But that doesn’t actually assert that sampleproject was built on pypa/sampleproject’s CI; it could have been uploaded or retrieved from somewhere else.

Another statement a third-party user might want to verify is “package sampleproject was produced by build.yml @ pypa/gh-action-build, where build.yml is a hermetic reusable workflow”. For that statement, {name}.intoto.jsonl (or its equivalent) will be what we want, I think.

TL;DR: the two are currently orthogonal, but your read is right that they can be unified in the future. I think that’s a good direction to go in, but I’ve left it out here to keep the scope (IMO) tractable.

woodruffw · January 2, 2024, 7:55pm

Sort of: Trusted Publishing doesn’t override or replace trust in a project’s PyPI owners (since those owners can, at any time, choose to upload using a provisioned API token instead). I think the right way to think of it is an an alternative authentication mechanism, one that’s more misuse resistant (since there are no manually provisioned or shared tokens, and everything is automatically scoped and self-expiring).

In the status quo, an external user (i.e. someone browsing PyPI) has the same visibility with both API token releases and Trusted Publishing releases: they can see the list of people (currently) associated with the project on PyPI, but they don’t know which of those people initiated the actual release. With this proposal they still wouldn’t know which, but they would be able to verifiably link the release back to the CI workflow that produced it (which, depending on the CI provider, might disclose who published it, but this is no different from the status quo).

I don’t think it’s based on the assumption that PyPI is less trustworthy, but instead on the following:

Secure secret distribution is hard and error-prone: most users scope their API tokens correctly most of the time, but the small percentage of errors they make are (1) still bad and time/resource intensive, and (2) can be avoided through misuse-resistant designs like Trusted Publishing.
Whether or not GitHub (or any other CI/CD provider) is more trustworthy, those that PyPI intends to support for Trusted Publishing are more institutionally capable of maintaining OIDC IdPs, PKIs, etc. This is not because the PyPI administrators are technically incapable, but because doing those things reliably requires a full staff of FTEs and SREs. Because GitHub et al. are already in trusted positions (see next point), we can basically get free institutional security properties from them.
GitHub and other CI providers are de facto trusted parties, for better or worse: a plurality (if not majority) of PyPI packages are already published from GitHub, meaning that a compromised GitHub already has access to tokens and other key materials that keep PyPI packages secure.

This will potentially be a controversial point, but I think there’s very limited value in allowing arbitrary Trusted Publisher integrations against PyPI: for small self-hosted services, there’s no distinguishable security distinction between running an OIDC PKI (with online signing materials for short-lived OIDC credentials) and continuing to use an ordinary PyPI API token. In other words: the benefit of Trusted Publishing comes from institutional PKI management; making small self-hosters run a PKI is probably going to be even more misuse-prone than API tokens are

That being said, I think we do need to come up with some kind of provenance or attestation solution for people who can’t or won’t use a Trusted Publisher. I’ve left that out of this proposal because it’s a significantly harder task; I’m trying to solve the plurality/majority case first (which, for better or worse, is tied to commercial code-hosts).

pf_moore · January 2, 2024, 8:16pm

I agree, at least in the sense that I don’t think that trusted publishing is something we should assume is an “obvious” thing to use - at least in its current form. For example, trusted publishing as I understand it relies at least to some level on automatically triggered CI actions that do the build and release. While I’m a happy user of github, I have never been comfortable with the idea of automating releases to the “push a button and it happens” level^[1]. Local builds with a manual twine upload are my preference, for reasons that I don’t feel I should need to justify to anyone (as it’s my project, and maintenance workflows are my business).

And I’m mildly concerned that we may see institutional users for whom provenance is important (or possibly even legally mandated) pushing projects towards trusted publishing as a result, with no regard for the workflow preferences of project maintainers.

Although I’ll be fair and say that a lot of my reservations are around things like “push a tag to trigger a release” workflows, which I dislike because it’s not how I use tags, and the fact that I have no experience or examples of triggering mechanisms that do feel comfortable to me. ↩︎

fungi · January 2, 2024, 8:27pm

This will potentially be a controversial point, but I think
there’s very limited value in allowing arbitrary Trusted Publisher
integrations against PyPI: for small self-hosted services, there’s
no distinguishable security distinction between running an OIDC
PKI (with online signing materials for short-lived OIDC
credentials) and continuing to use an ordinary PyPI API token. In
other words: the benefit of Trusted Publishing comes from
institutional PKI management; making small self-hosters run a PKI
is probably going to be even more misuse-prone than API tokens are

That being said, I think we do need to come up with some kind of
provenance or attestation solution for people who can’t or won’t
use a Trusted Publisher. I’ve left that out of this proposal
because it’s a significantly harder task; I’m trying to solve the
plurality/majority case first (which, for better or worse, is tied
to commercial code-hosts).

I’m coming from participation in large communities publishing
hundreds of different projects to PyPI with thousands of releases
over more than a decade, sufficiently sized to have members
collaboratively operating fully open source code hosting and CI/CD
infrastructure instead of relying on commercial solutions. At least
having some idea of how these sorts of communities might participate
in the proposed trust model would be appreciated.

Granted, we’re perfectly happy using OpenPGP signatures. In addition
to making packages available on PyPI, we also publish them along
with their associated signatures on our own sites, cautioning
downstream consumers and redistributors to check those directly
since PyPI no longer supplies them. In our case, the primary
audience for those signatures is curated distributions (e.g. Debian)
who can set the expected sdist signing key in package update scripts
and rely on that to verify that subsequent release tarballs were
published by systems with access to the corresponding automation
signing key, cross-signed by release managers and infrastructure
sysadmins.

woodruffw · January 2, 2024, 8:32pm

Yes, that’s correct (although NB that this is true for any CI publishing workflow, including ones that use API tokens – the thing that sets Trusted Publishing apart is that it removes the long-lived credential, making it harder for users to mis-configure or leak their CI secrets).

I appreciate you saying this explicitly. Doing a local upload is your business, and this work is not intended to preclude that kind of workflow – Trusted Publishing is a refinement over API tokens for the “happy path” of CI-based publishing, not a replacement for API tokens.

If people are interested, I would be happy to have a focus session (online or IRL) on figuring out a good model for non-Trusted Publisher attestations. But I think this work has value independent of that conversation, and does not infringe on either the possibilities there or on non-CI publishing workflows

steve.dower · January 2, 2024, 8:38pm

I’m also in a situation where for $work reasons we are actively discouraging OIDC in favour of our own internal automation. It would be fine for us to set up some form of handshake to provide provenance guarantees, even if it’s a bit of work, as we’d do it once and be done. Even an OpenPGP signature wouldn’t be impossible.

My main concern here is that the existence of “trusted publisher” information that doesn’t include the ones we’ve published will cause users to think there’s something wrong with ours since they don’t have it. The primary practical value of publisher provenance is to detect when something changes,^[1] mainly that it’s usually there but is missing for the release you’re about to install. Provided we can communicate that clearly, and that packages that never usually have it aren’t inherently less trustworthy, I think this is great (and eventually we’ll be able to set up ways to trust non-OIDC workflows).

This is why I publicise when the Authenticode certificate used for Python releases changes. ↩︎

kpfleming · January 2, 2024, 8:44pm

Echoing this point… while it may be outside the scope of a PEP, the way this information is presented in the PyPI UI is important. An empty attestation field for a project which has never used attestation can be harmful, whereas an empty field for a project which has used it at least once is useful information.

woodruffw · January 2, 2024, 8:54pm

To my understanding: the size of community you’re describing would be sufficiently large to potentially warrant a Trusted Publishing integration. The point of my comment wasn’t to exclude large communities, but to say that self-hosted code forges with 1-10 active users would probably not benefit from running an OIDC PKI versus configuring API tokens for themselves.

However, I think that’s a separate issue. In the medium term the next step there is probably for PyPI to decide what the sufficient conditions for Trusted Publishing integration are; whatever those end up being, this proposal will piggyback off of them (since, per the original comment, I don’t think it’s justifiable to do this with only GitHub being supported).

Fully agreed. I think it’s impossible to guarantee that users understand that this is additive and not subtractive w/r/t non-Trusted Publisher workflows, but I’ll try my best to communicate this in both the design and its public explanations/documentations (presuming broad consensus here).

As a practical matter, everything proposed above will probably have very little impact on the average user: pip won’t be able to verify provenance until a suitable technique for vendoring the necessary cryptographic dependencies is found, and the absence of a standard Python lockfile means that users won’t (immediately) be able to make policy decisions around Trusted Publisher changes.

In terms of visible user-facing changes, I think the only immediate one would be some small changes to the PyPI project UI. One of the ideas there is to mitigate metadata confusion by having some kind of designator that indicates that the project’s specified repository is the same on as in the Trusted Publisher. But I think that’ll look more like a positive signal (little green checkmark on the metadata box?) rather than a negative signal (big scary warning on non-TP-published projects).

pf_moore · January 2, 2024, 9:13pm

Exposing my total ignorance here, as a pip maintainer I have no idea what “verifying provenance” would even look like in the pip user interface. If I were doing a simple pip install requests, how precisely do you imagine the current behaviour would change in a “verifying provenance” context? I’m assuming that such verification would be triggered by some sort of --verify-provenance command line flag, but beyond that, what would happen?

I’m interested because I’d like to start considering what my opinion of such a feature would be. We have had a similar discussion some time ago around the pip-audit tool, which was proposed (but rejected) as a native pip subcommand. And while we’re clearly a long way from needing provenance verification in pip yet, I’d like to get enough of a feel for what this is all about to have an informed opinion when that time does come.

fungi · January 2, 2024, 9:19pm

In the medium term the next step there is probably for PyPI to
decide what the sufficient conditions for Trusted Publishing
integration are; whatever those end up being, this proposal will
piggyback off of them (since, per the original comment, I don’t
think it’s justifiable to do this with only GitHub being
supported).

Agreed, I looked and didn’t find any clear specification on how I
or other interested parties could propose a change to the warehouse
repo to integrate a “forge” like ours as a trusted publisher. What
are the expected capabilities? Are there policy concerns which need
to be satisfied? These are the sorts of things it would be great to
see addressed without expecting someone to reverse-engineer the
current GitHub integration in the codebase.

Presumably these are questions which need to be answered and work
done in the process of adding a second trusted publisher integration
anyway, so transparently exposing that decision-making process and
the details of the required development in a discoverable way would
be much appreciated. The current “Internals and Technical Details”
in the PyPI docs is a good start, but it’s focused more on how it
works rather than what needed to be done to make it work.

woodruffw · January 2, 2024, 9:46pm

This needs much more thought and discussion than I’m about to summarize, but here’s the very rough, handwavey idea of what I envision for the distant future:

For ordinary pip install invocations, nothing changes from the user’s perspective.
When --require-provenance or similar is passed in, pip install will require that every retrieved distribution has some kind of valid digitally signed provenance.
When given a lockfile containing provenance information (this doesn’t exist yet!), pip install will behave as if --require-provenance was passed, analogous to the way pip requires hashes if any requirement has hashes, even if --require-hashes is not explicitly specified.

Beyond that, it would probably make sense to determine whether and what policy flexibility should be exposed on pip’s side (if any). However, that conversation will probably depend on a clearer picture of how we decide to handle non-Trusted Publishing provenance/attestation statements.

pf_moore · January 2, 2024, 9:53pm

That sounds like a recipe for packages deep in the dependency tree being subjected to massive pressure to provide provenance, and is something I’d be very strongly against pip being a part of. If you want to enforce that sort of high-integrity environment, use a tool that’s built for it, not a general tool like pip. The pip maintainers don’t have the expertise to judge the issues this would raise - for example, should build dependencies be required to have provenance data? What about source trees from a local directory? How would they provide provenance? I wouldn’t know how to answer such questions, so I’d be unable to even triage an issue on the pip tracker that claimed any given behaviour was “necessary”, or “a bug”…

Please don’t feel that you need to answer these questions now. What you’ve said is precisely what I asked for, and gives me enough to let me form my opinions. Consider the above as just a “sneak peek” of the sorts of concerns I’ll have when this does come up for discussion

woodruffw · January 2, 2024, 10:14pm

Fair response, and I appreciate the follow-on questions. I don’t have good answers yet, and your point about pip being a general tool without the maintainer specialization needed to differentiate failure modes here is an excellent one!

As another thing to think about: @dstufft and I have talked a bit about whether a provenance-based solution ever can provide this level of “universal” coverage, or whether PyPI/Python packaging would be better limiting this to “specialty” cases and pursuing a Golang-style Checksum Database instead. These would be complimentary techniques, but with this approach pip would probably only care about the Checksum Database (since, like hashes, it could make universal statements about validity without a whole bunch of user-derived ambiguity).

But that’s an entire can of worms, and is unconnected to the proposal here. Just something to think about as well

dstufft · January 2, 2024, 10:55pm

I’ve talked to Will about this already, so I believe he already knows my general thinking, but just to get it in this thread.

I think that supporting this kind of provenance information does provide something useful. Specially it lets any random end user cryptographically verify that an artifact was published by a particular workflow, running in a particular repository, at a particular time (assuming you trust the trusted publisher, who hold the root keys that allow this claim to be made).

That’s a pretty useful thing to be able to introspect about an artifact when you’re investigating where a particular artifact on PyPI came from, and even without any support for anything but uploading these provenance attestations, I think there’s a net value gain.

These sorts of things can also allow us to start presenting some verified information in the UI and API for PyPI (which is something that might make sense to expose in something like pip show). This would have to be done carefully to avoid the “HTTPS padlock” problem (e.g. don’t act like a package is secure/safe just because it has provenance information), but also to avoid “punishing” projects that don’t have that information available.

I’m not a UI designer, so take this with a grain of salt, but you could imagine a “Provenance” tab shows up on PyPI where without verifiable provenance information it just says something like “No Provenance information is available”, but when it is available it starts to list things we know about the project that we can attest to like “was published from X repo, using Y workflow, at Z commit”.

However, where I think it starts to break down is that a boolean check for “has provenance information” isn’t a particularly useful security control in my opinion. I have no rights to publish for Django on PyPI, but I can easily publish a package with verifiable provenance information from github.com/dstufft/django, and any sort of boolean check like that would be perfectly happy to accept it, because it has no way of knowing whether that’s the right repo/workflow/etc.

This is the thing that made a lot of the early internet signing schemes pretty useless too-- without knowing what the right provenance is (or key in a signing scheme), you can’t verify anything except that provenance (or a signature) exists, which doesn’t tell you much. The provenance information does have some benefits in that it has metadata that comes from a handful of trusted, well known entities that can make presenting that verified metadata to users easier to do.

So I’m pretty down on the idea of introducing boolean checks for provenance (like a hypothetical --require-provenance), but displaying verified data is pretty useful I think as long as it’s presented in a good way.

If we could get to a world where we do have a secure way to know what the provenance information should be (including if a particular package should have it or not), then I think it would be a useful thing to add to pip (or another client).

There’s a possibility of some kind of TOFU setup, where it acts sort of like SSH where the first time you install a package (possibly using lockfiles as the data store?) it records what the provenance “source” is, and then verifies against that in the future.

I’m personally pretty down on TOFU schemes. I’ve been using OpenSSH for something like 20+ years now and I don’t think I’ve ever done anything but blindly accept a new server key, and I suspect most people are the same. I’m very wary about introducing “alert fatigue” where every legitimate change (say when I moved packaging from dstufft/packaging to pypa/packaging triggers a big warning that users have to accept to continue their install. Eventually user’s get trained to just press the “keep going” button and end up destroying the effectiveness of that security control.

All of that to say, I’m +1 on making it possible to emit provenance information (and honestly, I’m OK with allowing it from more than just trusted publishers-- but there’s a related problem where PyPI couldn’t display “verified” provenance information if the source of that information isn’t itself trusted by PyPI-- but we could still allow it to be published if end users wanted to consume it and make choices about it), but -1 on trying to make that information more meaningful to the tooling besides “here’s some information we know about this artifact, as attested to by GitHub (or wheover)” unless we can solve the problem of knowing what that trusted information should be.

ntessore · January 2, 2024, 11:18pm

I would be quite happy with even just a green checkmark next to the repository URL on the PyPI project page to indicate a package was indeed built from said repository. Not with any security aspect in mind, but only to verify that the wheel I am about to install does, in fact, correspond to the code I am seeing on GitHub.

steve.dower · January 3, 2024, 3:54pm

This is probably the best option, I certainly wouldn’t be upset about it (no matter which “hat” I’m wearing).

I’d also be happy if a link to the specific build [logs] or commit were exposed in their own section, but on the same level as other metadata. All we’re really able to definitively state is that a particular CI build published the package, so that’s probably all we should say.

woodruffw · January 3, 2024, 4:22pm

Just to clarify: the hypothetical --require-provenance assumed a trusted identity mapping, i.e. some backing verification procedure other than the boolean of “yep, it has provenance.” I completely agree that a boolean check for provenance itself is 0% useful (and with the consequent points about alert fatigue/TOFU designs having limited value)

How we get that trusted mapping is an open research question – with Trusted Publishers we have a strong “honest” link between distribution artifacts and repositories (in the sense that an attacker can’t lie about the Trusted Publisher identity), but we don’t have a reason to trust that link. Even Golang’s sumdb struggles to better than TOFU here – the checksum database is only kept honest after the initial go.sum is created, since there’s no gossiping between packaging clients. That isn’t a problem we need to solve immediately, but it’s something that absolutely needs a solution once/if we try to move this further into general clients like pip.

Topic		Replies	Views
RFC: improving pip security with package signing (PEP-458) Packaging	3	933	January 14, 2021
Draft PEP: Recording provenance of installed packages Packaging	21	1545	April 3, 2023
PEP 458: Secure PyPI downloads with package signing PEPs	134	19182	November 10, 2022
GPG Signature support removed from PyPI Packaging	8	1390	June 17, 2023
PEP 710 - Recording the provenance of installed packages Standards	24	2711	November 15, 2023