PEP 740: Index support for digital attestations

woodruffw · January 29, 2024, 9:59pm

Draft PEP: PEP 740 – Index support for digital attestations | peps.python.org

Other context: Pre-PEP: Exposing Trusted Publisher provenance on PyPI

To summarize the rationale and motivation:

Previous efforts to host digital signatures on the index have been largely ad-hoc and not subject to any constraints or invariants other than “there might be a .asc file adjacent to the distribution”. This proposes a structured attestation storage and presentation scheme, and also provides for stronger invariants between release files (if a release file comes with an attestation, all other files in the release must have similarly typed attestations).
This PEP is intentionally agnostic towards the set of attestation formats, prescribing only that they need to be:
1. Uniquely identified with human-readable identifiers
2. Verifiable by the index itself
This is done to prevent compatibility or longevity risks: the expectation with this PEP is that, upon acceptance, PyPA will standardize one or more attestation formats as part of the PyPA Specifications, which will then form the initial set of attestation formats accepted by PyPI.

Summary of the proposed changes:

When uploading release files, each file may be accompanied by an attestations JSON blob that contains key-value pairs of attestation-type, attestation-object. A contrived example of this is provided in the draft PEP.
The simple index (PEP 503) and simple JSON API (PEP 691) will both serve these uploaded attestations, as part of a larger “provenance” object that also contains Trusted Publisher metadata. A contrived example of this is also provided in the draft PEP.

I look forward to all feedback here! And thanks, in advance, to everybody who comments below.

CC @dstufft (as sponsor/delegate) and @sethmlarson (as SDIR)

brettcannon · January 29, 2024, 11:45pm

I didn’t see a mention of a metadata version bump, but e.g. PEP 700 – Additional Fields for the Simple API for Package Indexes | peps.python.org led to a minor bump. Otherwise the subject matter is outside my area of expertise to comment on its content.

woodruffw · January 29, 2024, 11:51pm

Thank you for catching that! I think this does require a minor bump, since it’ll be an additional field.

(I’ll batch this along with other feedback.)

kknechtel · January 30, 2024, 6:19am

I don’t think I understand the background motivation for this. Surely anyone who gained access to hijack a package and release a malicious version, would also be able to forge the attestation?

woodruffw · January 30, 2024, 3:13pm

Not necessarily: the attestations in question are digitally signed, meaning that any attacker seeking to forge them would also need to possess the appropriate private key material (historically something like a PGP private key, but this proposal is intentionally agnostic so that the index can try newer, more modern schemes like Sigstore).

There are more powerful adversary scenarios in which the attacker also possesses the private key material, e.g. an attacker with access to your keyring (again, assuming self-held keys). But that’s a “game over” scenario, versus more common (and weaker) adversaries. In other words: the idea is to raise the ecosystem’s “baseline” attacker sophistication from relatively unsophisticated (opportunistic theft of API tokens) to relatively sophisticated (theft of key material or identity).

Separately, another motivation for this proposal is provenance: Python distributions currently contain unauthenticated metadata about their source repository, etc. Some of the attestations we have in mind would build on top of Trusted Publishing (the thread linked above has more context on that), which would effectively allow the index (and downstream users) to verify both the metadata’s authenticity and that the package actually comes from the repository that claims to publish it.

How that actually works requires a bit of in-depth explaining on how Trusted Publishing works, which I’m happy to do in this thread or elsewhere if there’s interest . But as a “black box,” you can think of it as “packages can be bound to the repository that publishes them in a public, cryptographically verifiable way.”

Do you think the PEP would benefit from additional language around the background motivation? I tried to keep it somewhat brief to avoid getting into the details of different attestation formats, but I can definitely include some of the above if you think it helps motivate the ideas better.

adamsilkey · January 30, 2024, 3:43pm

I definitely do, especially given the question was already asked.

woodruffw · January 30, 2024, 4:53pm

Thanks; I’ve opened PEP 740: initial feedback by woodruffw · Pull Request #3637 · python/peps · GitHub for the feedback so far.

dustin · February 21, 2024, 10:36pm

I reviewed the PEP, my feedback is below, with quotes from the PEP:

Consistent release attestations: if a file belonging to a release has a set of digital attestations, then all of the other files belonging to that release should also have the same types of attestations.

I see the goal here but I think this is overly restrictive, especially in the early stages here where users might not have the ability to generate attestations for all files in the release. I think it also complicates (or prevents) adding attestations to releases after an initial upload, as it assumes artifacts + attestations will always be uploaded in tandem. The imagined use case here is for allowing third-party attestations.

I think ultimately this consistency would be checked at verification-time by installers when evaluating a policy (which, at the simplest level, should just reject artifacts without attestations) and so we don’t need an index to enforce this across the board. I think it could be an optional feature that projects could enable if they wanted (to further restrict what they are able to publish) but it shouldn’t be expected of all users.

Each attestation value MUST be verifiable by the index. If the index fails to verify any attestation in attestations, it MUST reject the upload.

I think this PEP should probably go into a lot more detail about what “verifiable by the index” means in this context. What specific steps should the index take to verify the attestation?

The JSON object SHALL have one or more keys, each identifying an attestation format known to the index. If any key does not identify an attestation format known to the index, the index MUST reject the upload.

I’m concerned about the index having to manage a quantity of different attestation formats, and have PyPI become the arbiter of essentially a namespace for the attestation formats, which are otherwise unstandardized.

I suspect it might be easier for everyone if we say attestations are a consistent format (like an RFC 8785 JSON document) with specific fields across all of them (like a name and digest), and otherwise allow the attestations to take any form that falls within those restrictions.

When data-provenance is true, the index MUST serve a provenance object at the same URL, but with .provenance appended to it. For example, if HolyGrail-1.0.tar.gz exists and has associated attestations, those attestations would be located within the provenance object hosted at HolyGrail-1.0.tar.gz.provenance.

There is a bit of an unofficial policy that we only serve files that are immutable (artifacts and now .metadata for those artifacts) and only data served via API can be mutable (yanked status, vulnerabilities, etc).

Given that I think we want to make it possible for provenance objects to be mutable (i.e., to accept additional attestations after the initial upload has happened), I think that means we shouldn’t require serving a .provenance file from the index, and find a different way to provide provenance via the Simple API (maybe repeated data-provenance attributes with links to immutable attestation files?)

alex_Gaynor · February 22, 2024, 1:46am

Is it actually valuable to specify the purely mechanical elements of where in a JSON payload a signature goes, without defining the broader threat model, how installers are expected to process data, or anything else?

woodruffw · February 22, 2024, 10:43pm

Makes sense; I can loosen the language here!

I’ll add this to the PEP, but to sketch here:

The attestation must be signed over by a digital signature in nature; the index must verify that signature as part of verifying the attestation.
The attestation itself must be “consistent” with the distribution that it’s attesting to, i.e. must bind the distribution’s name and the distribution’s content (via a strong hash).

For (1), this verification implies that PyPI possesses the public key material (or equivalent, e.g. machine identity) needed to verify the signature. This is assumed as part of the PEP, since Trusted Publishing provides that material for the machine identity case (and future work for direct key usage is left open as a possibility).

This is a fair point – the key/identifying format thing is a hack that I wasn’t super happy with . I think that, rather than having a whole bunch of different attestation formats and a bespoke namespace for distinguishing them, we can assert the following:

Every attestation is over just the distribution name and its cryptographic digest, in some canonical formal (e.g. RFC 8785 JSON)
The verification materials (signature, etc.) for the attestation are encoded in a JSON bundle format that supports both X.509 certificates and bare keys (Sigstore’s bundle format meets this requirement, but we could pare it down to a smaller format).
The meaning of a given attestation is defined at the policy layer, rather than encoded in a bespoke namespace here. In practice, this means that a “publish” attestation will be identified by the fact that it’s signed by the Trusted Publisher identity.

Makes sense to me! I’ll try to find a cite for this, but I think repeated data-provenance on the same HTML element isn’t valid in HTML5 (each data-* attribute needs a unique name per-element). We could number them or similar (data-provenance-0, …), but that feels pretty hacky.

How do you feel about only serving provenance via the Simple JSON API? That would sidestep the format woes , but I’m not sure if maintaining parallels between the two APIs is important here (vulnerabilities, etc. only exist in the JSON API, so there’s some precedent).

I think there’s some value in a purely mechanical PEP here: on one level, this PEP can be seen as a modern replacement for the previous practice of YOLOing PGP-signature-shaped text blobs onto the index. Under that thought, the goal for this PEP is not to define a concrete threat model, but just to expose tools for putting a new type of thing on the index (where that type of thing happens to eventually be a building block for a new set of security properties for the index).

(As-is, I think it’s hard to form a coherent thread model around “same-sourced” packages and attestations where the index is the sole source of trust: the other missing pieces here are a standard lockfile format and additional transparency mechanisms for the index itself.)

dustin · February 22, 2024, 11:00pm

Ah, I think you’re right. We could make data-provenance some sort of array (I think a JSON array would be valid?).

I think we should try to maintain parity. The JSON API isn’t standardized, so an installer like pip wouldn’t want to integrate against it, which would prevent verification downstream (also, hence https://discuss.python.org/t/draft-pep-adding-vulnerability-data-to-the-simple-api-for-package-indexes/)

pf_moore · February 22, 2024, 11:34pm

The JSON form of the simple index (which is what I think @woodruffw was referring to) is standardised (PEP 691). I don’t really have an opinion on whether it’s OK to have the two forms diverge over this data, though.

dustin · February 22, 2024, 11:53pm

Ah, you’re right, I missed that @woodruffw is conflating the unstandardized JSON API (which has vulnerability data) with the PEP 691 API (which doesn’t).

woodruffw · February 23, 2024, 12:01am

Yep, sorry for the confusion here! I thought the vulnerability data was in both the non-standard and PEP 691 APIs.

Stuffing JSON into a data attribute seems reasonable to me then (although it might need to be additionally encoded to escape quotes). The only risk from there is that it might end up being quite large, since the provenance will include an X.509 certificate for machine identities

woodruffw · February 26, 2024, 10:48pm

I’ve opened a draft PR with some of the feedback above here: PEP 740: Feedback, round 2 by woodruffw · Pull Request #3692 · python/peps · GitHub

(That doesn’t include the attestation verification steps yet. Once I get the general on these changes, I’ll add the verification section.)

woodruffw · April 24, 2024, 6:02pm

We’ve merged some feedback above, which I’ll summarize below:

We’ve increased the level of detail in the PEP around individual types and data layouts, including precise layouts for attestation objects (which encapsulate a digital signature for each release file) and provenance objects (which encapsulate attestation objects along with their verification materials).
We’ve resolved the challenge of embedding a large provenance object in the simple index by instead embedding the provenance object’s SHA256 hash, which can then be discovered via a derive-able URL.
We’ve added additional context to the security implications, including a brief discussion of cryptographic agility (via versioning).
We’ve added notes on future extensions to enable signing with identities other that trusted publishers, e.g. for signing with maintainer-held private keys.

woodruffw · April 24, 2024, 8:28pm

The feedback above, while merged today, has not changed since 3 weeks ago. So I’m hereby requesting @dstufft’s approval of this PEP

woodruffw · May 1, 2024, 6:56pm

Update: @dstufft and I had a call earlier today to talk through some of the specifics in the PEP, and he pointed out that the current approach of embedding the provenance JSON into each file listing in the simple JSON API may scale poorly if (1) the attestations are large, (2) there are a lot of attestations per file, (3) there are a lot of files listed, or (4) all of the above.

So, I’ve done a bit of informal analysis using the numbers he gave me

First, a typical attestation will be approximately 5.3KB of JSON. This number comes from the example attestation we built for initial testing purposes.
Initially, we expect to see 1 attestation per file per release per project, corresponding to the “publish” attestation that gets verified against the Trusted Publisher. Conservatively, we’ll estimate that PyPI may eventually host 3 attestations per file (one “publish”, one “build”, and one “third-party” attestation).
The current average number of files per project is ~21.^[1]

Given those numbers, we might reasonably expect a future average project to have 60-70 attestations, or ~318 KB of attestation JSON in its PEP 691 “project detail” endpoint. That’s a lot of JSON to push down the pipe, especially since we expect an installing client like pip to potentially only need/access a small fraction of all releases and their attestations

Given the above, I’m going to change the PEP so that the suggested JSON API change does not embed the entire provenance object. Instead, the JSON API will behave like the simple index API and embed the digest of the provenance object, which can then be retrieved on-demand from an adjacent .provenance URL.

I’ll make the PR for that in a bit, along with a new appendix section summarizing the numbers above as rationale.

Queried by Donald. ↩︎

woodruffw · May 16, 2024, 3:45pm

I’ve been thinking some more about aspects of PEP 740’s design, and I wanted to document some of them here to solicit feedback

Right now, the attestation “payload” in PEP 740 is a fixed, canonicalized JSON body. This is relatively simple to implement and has some desirable misuse-resistant properties (by binding payload reconstruction and signature validation into a single step), but also comes with downsides:

The current attestation payload includes the distribution filename (e.g. foo-1.2.3.tar.gz) to ensure domain separation. However, distribution filenames are nontrivial to normalize (see e.g. PEP 625 for sdist names), and even when normalized (for parsing purposes) are still malleable (e.g. due to multiple spellings of PEP 440 version qualifiers). Bottom line: including the distribution filename as is from the build backend is potentially risky, since build backends have a decent degree of freedom in filename structure. That means more normalization work for the signing step that needs to be kept synchronized with the larger packaging ecosystem.
The current attestation payload is “bespoke,” in the sense that it isn’t an in-toto statement (and intentionally does not allow unbounded metadata, to ensure that it can be reproduced from just the distribution filename + digest). This is simple, but it also means that different kinds of attestations are not easily encoded in the format itself: PyPI and future downstream consumers will need to make contextual decisions to determine the “kind” of attestation(s) attached to a distribution, which is not ideal.

For (1), the solution is potentially just more extensive structuring and normalization: we could use packaging during attestation payload generation to parse distribution filenames and reject invalid ones, and then decompose them into structured data rather than strings. For example, the current attestation payload:

{"digest":"some-hash","distribution":"foo-1.2.3.tar.gz"}

could become (roughly):

{"digest":"some-hash","distribution":{"name":"foo","version":"1.2.3","type":"sdist"}}

(This would not affect attestation size at all, since it becomes a hash like the previous format.)

For (2), a more general solution is probably best: we want some way to encode attestation intent (and associated metadata), and the current format (and expectation of exact hash consistency) is too strict for that.

We’ve made a lot of progress on implementing DSSE + in-toto support in sigstore-python and, given that, I’m tempted to revisit the feasibility of using in-toto statements (along with appropriate predicates, like the release predicate) for the attestation payload. This will change verification from a “reconstruct the payload and verify” model to a “verify the given payload and check it for consistency” model, but I think that’s an acceptable tradeoff.

The main “con” of this approach is attestation size: the attestation will now contain a full JSON payload. We can prevent that payload from becoming unbounded by only allowing certain predicates (plus limiting acceptance size on PyPI itself, as a backstop), but it’ll still be larger than a single signature. On the other hand the size of the attestation is mostly dominated by the X.509 certificate and other verification materials anyways, so there’s an argument that a few dozen bytes of extra JSON doesn’t matter all that much.

I’m curious what people think about this – I’ll also be at PyCon to discuss IRL, for those who’ll be there.

CC @facutuesca @sethmlarson @dstufft

woodruffw · June 5, 2024, 9:04pm

I’m back from PyCon and some travel, and I did some more thinking about the canonicalization/normalization and attestation “kind” problems, and have tweaked my open PR to accomodate both:

The PEP now more strongly asserts the normalization of distribution filenames, saying that they must be fully normalized and consistent with the living specs for their respective type (sdist and wheel). In practice, this means that there will always be a single normal form for a distribution filename, making it suitable for use in the attestation.
The attestation payload is now an in-toto statement and is signed over using DSSE, rather than a fixed payload + bare signature. This makes the attestation object itself slightly larger, but gives us the flexibility we need to encode different “kinds” of attestations without requiring a major future revision.