Previous efforts to host digital signatures on the index have been largely ad-hoc and not subject to any constraints or invariants other than “there might be a .asc file adjacent to the distribution”. This proposes a structured attestation storage and presentation scheme, and also provides for stronger invariants between release files (if a release file comes with an attestation, all other files in the release must have similarly typed attestations).
This PEP is intentionally agnostic towards the set of attestation formats, prescribing only that they need to be:
Uniquely identified with human-readable identifiers
Verifiable by the index itself
This is done to prevent compatibility or longevity risks: the expectation with this PEP is that, upon acceptance, PyPA will standardize one or more attestation formats as part of the PyPA Specifications, which will then form the initial set of attestation formats accepted by PyPI.
Summary of the proposed changes:
When uploading release files, each file may be accompanied by an attestations JSON blob that contains key-value pairs of attestation-type, attestation-object. A contrived example of this is provided in the draft PEP.
The simple index (PEP 503) and simple JSON API (PEP 691) will both serve these uploaded attestations, as part of a larger “provenance” object that also contains Trusted Publisher metadata. A contrived example of this is also provided in the draft PEP.
I look forward to all feedback here! And thanks, in advance, to everybody who comments below.
Not necessarily: the attestations in question are digitally signed, meaning that any attacker seeking to forge them would also need to possess the appropriate private key material (historically something like a PGP private key, but this proposal is intentionally agnostic so that the index can try newer, more modern schemes like Sigstore).
There are more powerful adversary scenarios in which the attacker also possesses the private key material, e.g. an attacker with access to your keyring (again, assuming self-held keys). But that’s a “game over” scenario, versus more common (and weaker) adversaries. In other words: the idea is to raise the ecosystem’s “baseline” attacker sophistication from relatively unsophisticated (opportunistic theft of API tokens) to relatively sophisticated (theft of key material or identity).
Separately, another motivation for this proposal is provenance: Python distributions currently contain unauthenticated metadata about their source repository, etc. Some of the attestations we have in mind would build on top of Trusted Publishing (the thread linked above has more context on that), which would effectively allow the index (and downstream users) to verify both the metadata’s authenticity and that the package actually comes from the repository that claims to publish it.
How that actually works requires a bit of in-depth explaining on how Trusted Publishing works, which I’m happy to do in this thread or elsewhere if there’s interest . But as a “black box,” you can think of it as “packages can be bound to the repository that publishes them in a public, cryptographically verifiable way.”
Do you think the PEP would benefit from additional language around the background motivation? I tried to keep it somewhat brief to avoid getting into the details of different attestation formats, but I can definitely include some of the above if you think it helps motivate the ideas better.
I reviewed the PEP, my feedback is below, with quotes from the PEP:
Consistent release attestations: if a file belonging to a release has a set of digital attestations, then all of the other files belonging to that release should also have the same types of attestations.
I see the goal here but I think this is overly restrictive, especially in the early stages here where users might not have the ability to generate attestations for all files in the release. I think it also complicates (or prevents) adding attestations to releases after an initial upload, as it assumes artifacts + attestations will always be uploaded in tandem. The imagined use case here is for allowing third-party attestations.
I think ultimately this consistency would be checked at verification-time by installers when evaluating a policy (which, at the simplest level, should just reject artifacts without attestations) and so we don’t need an index to enforce this across the board. I think it could be an optional feature that projects could enable if they wanted (to further restrict what they are able to publish) but it shouldn’t be expected of all users.
Each attestation value MUST be verifiable by the index. If the index fails to verify any attestation in attestations, it MUST reject the upload.
I think this PEP should probably go into a lot more detail about what “verifiable by the index” means in this context. What specific steps should the index take to verify the attestation?
The JSON object SHALL have one or more keys, each identifying an attestation format known to the index. If any key does not identify an attestation format known to the index, the index MUST reject the upload.
I’m concerned about the index having to manage a quantity of different attestation formats, and have PyPI become the arbiter of essentially a namespace for the attestation formats, which are otherwise unstandardized.
I suspect it might be easier for everyone if we say attestations are a consistent format (like an RFC 8785 JSON document) with specific fields across all of them (like a name and digest), and otherwise allow the attestations to take any form that falls within those restrictions.
When data-provenance is true, the index MUST serve a provenance object at the same URL, but with .provenance appended to it. For example, if HolyGrail-1.0.tar.gz exists and has associated attestations, those attestations would be located within the provenance object hosted at HolyGrail-1.0.tar.gz.provenance.
There is a bit of an unofficial policy that we only serve files that are immutable (artifacts and now .metadata for those artifacts) and only data served via API can be mutable (yanked status, vulnerabilities, etc).
Given that I think we want to make it possible for provenance objects to be mutable (i.e., to accept additional attestations after the initial upload has happened), I think that means we shouldn’t require serving a .provenance file from the index, and find a different way to provide provenance via the Simple API (maybe repeated data-provenance attributes with links to immutable attestation files?)
Is it actually valuable to specify the purely mechanical elements of where in a JSON payload a signature goes, without defining the broader threat model, how installers are expected to process data, or anything else?
The attestation must be signed over by a digital signature in nature; the index must verify that signature as part of verifying the attestation.
The attestation itself must be “consistent” with the distribution that it’s attesting to, i.e. must bind the distribution’s name and the distribution’s content (via a strong hash).
For (1), this verification implies that PyPI possesses the public key material (or equivalent, e.g. machine identity) needed to verify the signature. This is assumed as part of the PEP, since Trusted Publishing provides that material for the machine identity case (and future work for direct key usage is left open as a possibility).
This is a fair point – the key/identifying format thing is a hack that I wasn’t super happy with . I think that, rather than having a whole bunch of different attestation formats and a bespoke namespace for distinguishing them, we can assert the following:
Every attestation is over just the distribution name and its cryptographic digest, in some canonical formal (e.g. RFC 8785 JSON)
The verification materials (signature, etc.) for the attestation are encoded in a JSON bundle format that supports both X.509 certificates and bare keys (Sigstore’s bundle format meets this requirement, but we could pare it down to a smaller format).
The meaning of a given attestation is defined at the policy layer, rather than encoded in a bespoke namespace here. In practice, this means that a “publish” attestation will be identified by the fact that it’s signed by the Trusted Publisher identity.
Makes sense to me! I’ll try to find a cite for this, but I think repeated data-provenance on the same HTML element isn’t valid in HTML5 (each data-* attribute needs a unique name per-element). We could number them or similar (data-provenance-0, …), but that feels pretty hacky.
How do you feel about only serving provenance via the Simple JSON API? That would sidestep the format woes , but I’m not sure if maintaining parallels between the two APIs is important here (vulnerabilities, etc. only exist in the JSON API, so there’s some precedent).
I think there’s some value in a purely mechanical PEP here: on one level, this PEP can be seen as a modern replacement for the previous practice of YOLOing PGP-signature-shaped text blobs onto the index. Under that thought, the goal for this PEP is not to define a concrete threat model, but just to expose tools for putting a new type of thing on the index (where that type of thing happens to eventually be a building block for a new set of security properties for the index).
(As-is, I think it’s hard to form a coherent thread model around “same-sourced” packages and attestations where the index is the sole source of trust: the other missing pieces here are a standard lockfile format and additional transparency mechanisms for the index itself.)
The JSON form of the simple index (which is what I think @woodruffw was referring to) is standardised (PEP 691). I don’t really have an opinion on whether it’s OK to have the two forms diverge over this data, though.
Yep, sorry for the confusion here! I thought the vulnerability data was in both the non-standard and PEP 691 APIs.
Stuffing JSON into a data attribute seems reasonable to me then (although it might need to be additionally encoded to escape quotes). The only risk from there is that it might end up being quite large, since the provenance will include an X.509 certificate for machine identities