I’ve been thinking some more about aspects of PEP 740’s design, and I wanted to document some of them here to solicit feedback
Right now, the attestation “payload” in PEP 740 is a fixed, canonicalized JSON body. This is relatively simple to implement and has some desirable misuse-resistant properties (by binding payload reconstruction and signature validation into a single step), but also comes with downsides:
- The current attestation payload includes the distribution filename (e.g.
foo-1.2.3.tar.gz
) to ensure domain separation. However, distribution filenames are nontrivial to normalize (see e.g. PEP 625 for sdist names), and even when normalized (for parsing purposes) are still malleable (e.g. due to multiple spellings of PEP 440 version qualifiers). Bottom line: including the distribution filename as is from the build backend is potentially risky, since build backends have a decent degree of freedom in filename structure. That means more normalization work for the signing step that needs to be kept synchronized with the larger packaging ecosystem. - The current attestation payload is “bespoke,” in the sense that it isn’t an in-toto statement (and intentionally does not allow unbounded metadata, to ensure that it can be reproduced from just the distribution filename + digest). This is simple, but it also means that different kinds of attestations are not easily encoded in the format itself: PyPI and future downstream consumers will need to make contextual decisions to determine the “kind” of attestation(s) attached to a distribution, which is not ideal.
For (1), the solution is potentially just more extensive structuring and normalization: we could use packaging
during attestation payload generation to parse distribution filenames and reject invalid ones, and then decompose them into structured data rather than strings. For example, the current attestation payload:
{"digest":"some-hash","distribution":"foo-1.2.3.tar.gz"}
could become (roughly):
{"digest":"some-hash","distribution":{"name":"foo","version":"1.2.3","type":"sdist"}}
(This would not affect attestation size at all, since it becomes a hash like the previous format.)
For (2), a more general solution is probably best: we want some way to encode attestation intent (and associated metadata), and the current format (and expectation of exact hash consistency) is too strict for that.
We’ve made a lot of progress on implementing DSSE + in-toto support in sigstore-python and, given that, I’m tempted to revisit the feasibility of using in-toto statements (along with appropriate predicates, like the release predicate) for the attestation payload. This will change verification from a “reconstruct the payload and verify” model to a “verify the given payload and check it for consistency” model, but I think that’s an acceptable tradeoff.
The main “con” of this approach is attestation size: the attestation will now contain a full JSON payload. We can prevent that payload from becoming unbounded by only allowing certain predicates (plus limiting acceptance size on PyPI itself, as a backstop), but it’ll still be larger than a single signature. On the other hand the size of the attestation is mostly dominated by the X.509 certificate and other verification materials anyways, so there’s an argument that a few dozen bytes of extra JSON doesn’t matter all that much.
I’m curious what people think about this – I’ll also be at PyCon to discuss IRL, for those who’ll be there.