PEP 694: Upload 2.0 API for Python Package Repositories

dstufft · July 7, 2022, 9:13pm

That section you’re quoting is describing the status quo API, the later sections describe what this PEP proposed.

dstufft · July 7, 2022, 9:19pm

For additional related data, the question would be whether it’s a large binary blob that needs uploaded, or whether it’s metadata that can be sent in the initial request.

If it’s metadata, then it can just be added as an extra key to the request to create a file upload, that would be sent just prior to uploading the file itself.

If it’s a big binary blob that needs uploaded independently, then we would need to add a way to upload related files that aren’t distributions.

woodruffw · July 11, 2022, 4:21pm

My bad!

The sizes here should be approximately 1.5KB per certificate, and ~150 bytes per signature (base64 encoded). Do you think that’s small enough to make sense as extra keys?

Edit: For context, sigstore’s dogfooded signatures and certificates can be seen here (with their sizes): Release Release 0.6.2 · sigstore/sigstore-python · GitHub

dstufft · July 11, 2022, 4:32pm

What’s the relation between certificates, signatures, and files?

1 file = 1 signature = 1 certificate? Or something else?

woodruffw · July 11, 2022, 5:25pm

Yep, each file has exactly one signature and certificate.

Edit: In principle, a file can be signed many times and have correspondingly many certificates. But I think the data model can be 1-1 for PyPI’s purposes. Or maybe not, and that’s perhaps something we should discuss…

dstufft · July 11, 2022, 5:30pm

That’s probably fine to live as metadata I think.

It’s intended that the entire METADATA can be submitted as uh, metadata, and while I don’t have it readily available how big those are, the largest part of them is the project README typically, and the average project README on PyPI is just under 4kb large, and the largest one is 7.2M large (yikes).

woodruffw · July 11, 2022, 5:46pm

Gotcha! In that case, here are some proposals for the keys:

For the 1-1 case (if we go that route):

"sigstore": {"signature": "<base64 sig>", "certificate": "<PEM-encoded cert>"}

For the many case (if we go that route):

"sigstore": [{"signature": "<base64 sig>", "certificate": "<PEM-encoded cert>"}]

dstufft · July 11, 2022, 6:07pm

Those look fine to me, though I don’t think I’d add them to the PEP yet unless PyPI supports them already and I’ve just missed it.

That’s mostly because I think the question of what PyPI + Sigstore together means deserves it’s own consideration, and a look at the whole proposal of what that means. But I think that scheme would be fine for how to add it to this proposed upload API, and suggests that it wouldn’t be a large blocker to eventually adding sigstore.

woodruffw · July 11, 2022, 6:16pm

Nope, PyPI doesn’t support them yet!

Makes sense. I think the plan is to rehydrate PEP 480 with some of those details.

mmarston · August 3, 2022, 2:04am

I have a few comments about the document.

(1) I suggest adding something about the length of Upload-Token. For example the tus Internet Draft says:

A conforming implementation MUST be able to handle a Upload-Token field value of at least 128 octets.

(2) The Upload-Offset of 1001 in the example given in the document is incorrect.

As an example, if uploading a 100,000 byte file in 1000 byte chunks, and this chunk represents bytes 1001 through 2000, you would send headers like:

Upload-Offset is zero based, so the first 1000 bytes are at offsets 0 through 999, so byte 1001 should be at offset 1000 not 1001.

(3) The document doesn’t really describe what is at the draft URL so it is hard for someone implementing a service to know what to provide or for a client to know what to expect. There is the question of what files are available to download from that draft URL: (a) only the files from this session, (b) all the files for this package, (c) all draft files including ones from other packages, or (d) none of the above. More important is what API is implemented at that URL. Should the document specify that the repository at the draft must implement PEP 503? Or PEP 691?

(4) Are the upload, draft, and publish keys all 3 required in the urls key? Can a compliant server omit the draft key? The wording in the document currently says (emphasis added):

For the urls key, there are currently three keys that may appear:

dstufft · August 3, 2022, 4:22am

Thanks!

That’s an easy add, sounds good.

Oops yea, I’ll fix that.

It’s intended to be a simple repository API that contains at least the files from the session. It is not required to have any additional files from that project or any other projects, but it may if an implementation wants to. So it’s PEP 503 + PEP 691, etc, depending on what repository specs/version the repository is emitting.

Upload And Publish would be mandatory, there’s an open question about whether draft should be mandatory or not. It’s simpler for end users if they can rely it on it always existing, but if any non PyPI implementations of this would rather not have to support it, I think making it optional would work too.

mmarston · August 3, 2022, 6:34pm

That sounds reasonable. I suggest adding a sentence or two to the PEP stating that.

The proposed API requires much more from service implementations than the current legacy upload API, so I suggest it is worth considering ways to reduce the amount of work required of services to be compliant with the PEP. Here are a few features I imagine could be made optional:

Chunked uploads, resumable uploads, and cancelling an in progress upload (those all go together so I’d expect an implementation to either support all or none).
Deleting an uploaded file - I’m not sure how often users will get into a situation where they need to delete a file from a session.
Draft repository URL.

The implementation costs of 2 and 3 are relatively small compared to 1. So if anything is to made optional in order to ease service implementations, I’d consider making chunked uploads, resumable uploads, and cancelling uploads optional. I realize the value of it for extremely large files, but as stated in the PEP “it is recommended that most clients should choose to upload each file as a single chunk as that requires fewer requests and typically has better performance.”

mmarston · August 4, 2022, 12:14am

I think it is likely that other server implementations will host both the legacy upload API and this new one under the same root repository URL. It won’t be a problem for the service to distinguish legacy upload requests from upload 2.0 requests. But how will the client know which API the service supports? As a user, suppose I upgraded to a new version of twine that supports upload 2.0 and I have a configured my ~/.pypirc with an index server repository URL pointing at an Artifactory or CodeArtifact repository. How will the client know which API to use? I like Florian’s proposal that the client first make an application/vnd.pypi.upload.v2+json request using the new API and if that fails then fall back on legacy API.

mmarston · August 4, 2022, 12:29am

And one other thought (sorry for multiple replies in a row). Would it be reasonable to specify a minimum chunk size and/or a maximum number of chunks? (The minimum chunk size would not apply to the final chunk). If the PEP specified a minimum chunk size then server implementations are free to reject smaller chunks (though the server may choose to allow smaller chunks) and compliant clients would only send upload chunks of the specified size or larger.

As a point of reference, S3 has limits for multipart uploads, including a minimum part size of 5 MiB and maximum of 10,000 parts. This is relevant both as a point of comparison and as a potential limitation should a server implementation choose to implement chunked file upload using S3 as a backend.

dstufft · August 5, 2022, 1:41pm

It should be possible to host this at the same URL as the existing upload API using a few mechanisms to detect it:

The existing upload uses some url parameters the new one does not, so you can dispatch based on that.
The existing upload uses form upload content type, the new one uses a custom content type, so you can dispatch based on that.
The payloads for the two uploads look entirely different, so worse case you can dispatch based on that.

So, it should be possible to have both APIs living at the same URL.

It actually should be possible, if the URLs are the same, to have the client do that proposed fallback as things stand now, but I think I’ll give it deeper thought on how to have clients auto detect the difference and know which API to use.

I’m hesitant to put specific requirements on minimum chunk size or maximum chunk size, if only because I don’t know that it’s possible to pick specific limits that are guaranteed to work across all possible implementations now and into the future.

What I think could be a good value add here is to take a look at the major object stores and how they handle multipart uploads and try to come up with a set of recommendations for what a client should limit itself to for maximum compatibility.

mmarston · August 10, 2022, 12:20am

Yes, I completely agree.

The question of how the client detects which API to use is relevant regardless of whether or not the service implements both upload APIs at the same URL (and regardless of whether or not the service even supports both upload APIs). When a user puts a repository URL in their ~/.pypirc file (or whatever configuration is used by the client) the client will need to detect which API to use (unless the client requires the user to configure which API version to use or the client only supports one upload API version).

I believe the proposed fallback approach is sufficient. But if after giving it deeper thought you conclude that having clients use the fallback approach is insufficient or inappropriate then you may want to update the PEP to define a way for the server to advertise what version of the upload API it supports.

I agree with the hesitation to specify a minimum chunk size in the PEP. I’d consider updating the PEP to include:

Server implementations may enforce implementation-specific minimum chunk size and maximum chunk size (and you may want to document how a server should respond if it it receives a chunk size outside of those bounds).
Recommendations for minimum and maximum chunk sizes for maximum compatibility.
A recommendation for clients to allow a user to override chunk size (in the unfortunate case that the client’s default chunk sizes are not supported by a server).

trishankatdatadog · January 5, 2023, 6:41pm

Yes. Another approach is to associate Sigstore certificates with distributions would be to use the “custom” TUF targets metadata in PEPs 458/480.

This might also be a good opportunity to deprecate GPG signatures?

miketheman · August 2, 2023, 1:22pm

Done in GPG Signature support removed from PyPI

barry · May 2, 2024, 6:27pm

It’s been a while and the PEP is still in draft. I’m curious what the status is of this PEP and its implementation.

barry · July 10, 2024, 11:50pm

I think I’ve caught up on this DPO thread, PEP 694 (as written today), and @alanbato 's PR^[1].

I think this is a fantastic feature that would solve many of the problems with test.pypi.org, and maybe even allow us to decommission that service.

The last update to the PR is 2 years old and the last request for review was 5 months ago, so I’m wondering what the status is of both the PEP and the PR. I’d like to help if I can.

In the meantime, I have some questions. I hope maybe this will reboot the initiative, and I apologies in advance if I’ve missed any important details or misunderstood what is being proposed, and also for the length of this reply.

I wonder if “staged” releases is a better term than “draft”? Maybe it’s a distinction without a difference, but ISTM that the feature is really about incrementally preparing a new release, rather than (in general) creating preliminary versions that will undergo revisions ^[2]. Perhaps a minor point^[3].

The PR asks:

Some people think it shouldn’t be guessable by regular users, while others want it to be determinable by maintainers in order to automate their workflows. How can we achieve both? Should it be calculated, or should it be stored?

I think it is important that draft releases aren’t easily guessable. The use case is for releases that are timed to product announcements. I would really like to stage^[4] all the files for the new release, but not generally expose any of the files or even the presence of new versions until the announcement goes out. I think it’s enough to secure-through-obscurity, rather than truly lock down drafts through some sort of authentication token in the request.

“Authentication token” - which reminds me, although I don’t think it’s explicit, the simple index and files for a draft release are both accessible with the standard unauthenticated access mechanisms, and draft releases would likely also be cached in the CDN. I think both consequences are fine, as long as the draft hash token is not easily guessed.

Q1: I think the fact that draft files will be cached in the CDN is okay, with at worse some duplication perhaps? I’m not sure how all that works, so I just wanted to ask if that effect has been considered?

Q2: In the “Create an Upload Session” response, I don’t think the PEP specifies the format of the urls:draft value, and the PR I think currently creates an easily guessable hash from the package name and version. I think unguessability is an important use case to support and would like to see a resolution which preserves security-through-obscurity for drafts.

If I understand correctly, I think the root simple repository index doesn’t have access to the draft hash, so wouldn’t include any responses related to drafts. That seems fine except in one corner case: where a package is getting uploaded via 694 for the first time. I am not sure a) whether an initial draft upload session exposes the package name to the root index; b) if not, whether that’s a use case we should support. It might be rare enough not to matter.

I think the Session Completion section of the PEP is missing the treatment of another corner case. Say I create a session, then complete/publish it without uploading any files. I think that should explicitly be described as an error. An upload session MUST include at least one file (wheel or sdist).

What if I want to stage new releases for more than one package, and then publish them all at the same time? Technically, I think I can do this just by creating multiple sessions, and managing any tokens I need to do the individual uploads, and subsequent publishing steps. The other thing I’d want is to for pip install to see all the drafts, so that I can test that everything works as expected before I push all the big “publish” buttons. I guess the use of multiple --index-url/--extra-index-url flags would be enough to support this use case on the client end? Has anyone thought about this use case or tried it out with the PRs?

I don’t love that I have to use --index-url or --extra-index-url to get client-side access to draft releases, but maybe it’s okay. Alternatively, something like a --draft flag (which can be specified multiple times) to resolve the session hashes and “see” the drafts would be better? That’s a client tool UX question and I don’t think it specifically impacts the protocol.

From the PEP

[…] for hashing the uploaded files, the serial chunks requirement means that the server can maintain hashing state between requests, update it for each request, then write that file back to storage. Unfortunately this isn’t actually possible to do with Python’s hashlib[…]

The downside is that there is no ability to parallelize the upload of a single file because each chunk has to be submitted serially.

While tus2 (IIUC) requires serialization of chunk uploads, isn’t it possible for the server to preserve the chunks, and thus the hashes for each chunk, until the upload is complete? It makes the implementation more complicated, but is it valid within the tus2 protocol to defer constituting the complete uploaded file until Upload-Incomplete is omitted?

I think I saw a suggestion to allow file upload deletion be optional, but I’d argue that this should be mandatory. It’s not entirely rare to encounter a build problem, say in one of your dozen wheel files, that only surfaces once all files are staged, and testing against the draft is performed. A major benefit of the whole draft feature is that you can perform that testing before publishing! If you’d have to delete the whole session and re-upload everything, rather than just respin the offending file and replace it, it wouldn’t be as useful. However I may be misunderstanding the protocol, so if it’s possible to just start re-uploading a file without deleting it, and it gets replaced, then the actual DELETE step may be largely unnecessary.

The PR uses URLs like <pkg>/release/0.23.0--draft--<some-hash>/ which doesn’t look great. Was <pkg>/release/0.23.0?draft=<some-hash> considered?

The visual demo in the PR was very helpful! The picture that describes the UI for publishing a release has a red box below it that says “Delete release” with an admonition “You will not be able to re-upload a new distribution of the same type with the same version number”. That seems incorrect - i.e. until a draft is released, the publisher should have free reign to delete individual files or the entire draft release and start over from scratch, right? The whole point being that drafts eliminate the need to artificially bump version numbers on broken releases, up until the release is published.

I think those are all my questions so far!

though I haven’t reviewed all the details of the code changes ↩︎
although revisions to individual files in the stage can happen if a file upload to a stage goes awry ↩︎
and I might use the terms interchangeably below ↩︎
er, um, draft ↩︎