PEP 694: Upload 2.0 API for Python Package Repositories

dstufft · June 28, 2022, 12:42am

Riffing off the work done in PEP 691, I’ve gone ahead and used a similar pattern to write a new upload API for PyPI, this time documented as a standard PEP rather than implementation defined like the existing upload API.

The new API is a bit more complicated (it would be pretty hard to get less complicated than the old API), but with it we unlock a host of new features:

We move to an asynchronous API, that allows servers to respond quickly to requests, and enable any sort of extended processing of a file to happen out of band, while providing clients the ability to poll to monitor the status.
We provide an opt in mechanism (on the client side) to support resumable file uploads, so that if something happens and your network connection is interrupted, you can ask the server how much data they got and send only the remaining data.
We provide a mechanism to send the file in (sequential) chunks, allowing clients to work in situations where some middlebox may be arbitrarily restricting how much data, or how long an individual request may take.
We provide the ability to upload multiple files, in separate HTTP requests, but “commit” them together as one atomic action, making either all of them or none of them available in the repository.
We provide the ability for users to upload files, hold off on publishing them, and test installing and using their artifacts being committing to them, using an ephemeral URL.
Allow the client to “pre-validate” that the upload will be OK, by sending the metadata and additional information prior to actually uploading any bytes on the wire.
Provide better mechanisms for the server to communicate warnings and errors back to the client throughout the process.

The most basic flow to replicate the existing semantics involves making a number of new HTTP requests:

POST to create an upload session.
POST to create a file in that session.
POST to actually upload the file.
POST to complete the session and publish the artifacts.
GET to query the status, assuming the server didn’t respond with a 201 to 4.

Steps 2/3 can be completed multiple times for each file, and you may also delete and overwrite files within the session at will prior to completing it.

Unlike PEP 691, this does not attempt to match the semantics or live at the same URLs as the existing, legacy API. This represents an entirely new API, starting at version 2. However, it uses the same versioning mechanism as PEP 691 (but this is largely academic, as there is currently only the single version, and only JSON serialization).

You can see PEP 694 online once it’s rendered or see the original PR that added it.

In either case, discussions should go to this thread!

CAM-Gerlach · June 28, 2022, 3:06am

Just a small comment: as a package author, it is hard to express how excited and relieved I am for this particular feature—I try to use testPyPI on first upload, but after that its such a hassle; being able to upload and test them from “real” PyPI without actually irrevocably committing the release will hugely streamline my release workflow.

fschulze · June 28, 2022, 4:58am

Great proposal!

I’m not sure I understand the reasoning for this part:

Unlike PEP 691, this PEP does not change the existing 1.0 API in any way, so servers will be required to host the new API described in this PEP at a different endpoint than the existing upload API.

Can’t we discern this with the Content-Type? If the client uses application/vnd.pypi.upload.v2+json we use the new API, if it is multipart/form-data we use the old one. The client can retry with the old one if the new one got a failure. Am I missing something there?

Regards,
Florian Schulze

njs · June 28, 2022, 7:18am

What you considered using tus or similar for the upload protocol? idk how much benefit we would get from the existing implementations, but possibly some, and at a quick skim the protocol seems pretty reasonable with a clean extension negotiation mechanism for future-proofing. Plus it’s always nice to just … not have to think about these things

jack1142 · June 28, 2022, 8:40am

Could there be a way to list current pending sessions? I feel like communicating this data between different CI stages could sometimes be inconvenient. In a “simple” setup where you have a job that all of build jobs depend on, communicating this probably isn’t a problem (at least on GH Actions, I don’t know how it looks with other platforms) but I imagine this doesn’t cover all use cases. One of such cases could be using multiple CI providers where communicating data between them would be more difficult (not impossible though).

oscarbenjamin · June 28, 2022, 11:04am

Agreed. This is what I’ve been waiting for before fully automating everything. I just want to be able to have one last look at those files before they go live and I don’t want to have to worry about anyone accidentally triggering a public release.

Thanks to everyone for the good work!

pradyunsg · June 28, 2022, 12:02pm

This is exactly what the PEP is based on, and there’s a discussion about the differences compared to tus in there,

fungi · June 28, 2022, 12:05pm

This will be great when package maintainers start to rely on it,
especially the ability to not make a release ready for use until
you’ve uploaded all the archives you intend to support for it.

One of the biggest pain points for our projects is that many of our
dependencies have significant delays (hours or even days) between
uploading some parts of their releases. Libfoo will upload an sdist
and maybe a binary wheel for x86/64 for a new release, so our CI
jobs want to start testing with that; but AArch64 platforms end up
needing to build those releases from sdist, if they even have
sufficient information to do so, until official wheels for that are
uploaded. As a result, we often introduce artificial delays to test
a new release of these dependencies on any platform until all the
wheels we’ll need are present on PyPI, or we use local automation to
build unofficial stand-ins for the missing wheels ourselves before
proceeding.

dstufft · June 28, 2022, 12:37pm

I view /upload/?:action=file_upload&protocol_version=1 and /upload/ as different endpoints. Keeping the same endpoint would then require keeping the query parameters as well, which I don’t want to do. On PyPI the existing one lives at https://upload.pypi.org/legacy/, and the new one will likely live just at https://upload.pypi.org/, but another project could keep the path the same, and just dispatch based on whether the query params are there or not.

Yea. There’s tus1 and tus2, tus2 is an internet rfc draft that takes what they learned from tus1 and attempts to make it something that could be supported by anyone. The actual process of uploading the file itself is more or less 1:1 tus2, except we constrain it further by declaring it must be a POST, and we say what the status code return types are.

The only deviation from tus2 is:

The rfc draft has an extra header to opt into it, that is only expected to be used while the rfc is still a draft.
The rfc draft uses a 104 status code, designed to enable clients to determine if an arbitrary server supports resumable uploads or not. Which we don’t use because we know out of band that these servers do, and most of the http libraries and frameworks in Python make using 100 status codes hard or impossible.

You don’t need to list the current pending sessions, because if you attempt to create a second upload session for the same name+version as an already pending uploading session, the server is required to return the existing session rather than create a new one.

I chose this, because otherwise there is a race condition.

If you have 5 jobs running simultaneously that all are trying to upload a file, each doing the query → create if no pending → upload process, then you can have multiple of them seeing there is no pending sessions prior to any of them creating a new one. It would force you to have one job always run first to create the initial session, and likely would require the client to add a command to initiate a session.

With having the initial session creation actually be “get or create”, rather than just “create”, we push that responsibility off onto the upload server (who can likely use something like database transactions to enforce it), and now something like twine can just do the exact same thing for each upload.

That’s the hope! I think PyPI will likely have sessions valid for 7d before they get reaped, but if folks think that’s not long enough, we could make it longer. We just want some reasonably short expiration to prevent a bunch of dangling sessions from just sitting there.

dstufft · June 28, 2022, 12:46pm

One thing I’d love to hear from other implementations is whether the draft url is something they want to support or not. This is the thing where you setup an url that someone can pass it --extra-index-url (or maybe --index-url) to install with the files that are staged in the session available.

It wouldn’t be hard to make that particular feature optional in the PEP, and clients can gate whether or not it’s available for their use by if that key exists in the urls dictionary or not. The PEP currently makes it mandatory, but it may not be as useful on a repository that natively allows re-uploading files with different content or that already has some other mechanism for draft releases.

So really I leave that up to other implementations, if they’d prefer the draft url functionality to be optional, we can do that. If everyone seems happy supporting it, then we can leave it in.

jack1142 · June 28, 2022, 12:52pm

Donald Stufft:

You don’t need to list the current pending sessions, because if you attempt to create a second upload session for the same name+version as an already pending uploading session, the server is required to return the existing session rather than create a new one.

I chose this, because otherwise there is a race condition.

If you have 5 jobs running simultaneously that all are trying to upload a file, each doing the query → create if no pending → upload process, then you can have multiple of them seeing there is no pending sessions prior to any of them creating a new one. It would force you to have one job always run first to create the initial session, and likely would require the client to add a command to initiate a session.

With having the initial session creation actually be “get or create”, rather than just “create”, we push that responsibility off onto the upload server (who can likely use something like database transactions to enforce it), and now something like twine can just do the exact same thing for each upload.

Oh yeah, this is fine, I didn’t notice this changed since I read the original draft, sorry about that, I’ll have to read through the whole diff.

dstufft · June 28, 2022, 1:00pm

No problem! That was one of the things that some of the initial feedback when I was circulating the PR prior to posting it called out too. The initial draft didn’t support multiple independent CI jobs very well at all since, at the time, it required knowing all of the filenames, hashes, and sizes of all the files up front.

The biggest differences between the initial draft and now (besides shuffling around some endpoints):

Support for draft urls
Sessions don’t know up front what files they will hold, so clients can add/remove arbitrary files over time to them.
Get Or Create semantics for creating sessions
Explicit “complete session” request, rather than autocompleting when all of the files are uploaded (since the second item removes our ability to know what “all” the files even are).

tiran · June 28, 2022, 1:17pm

Does the new API also support downloads from a session? IMHO it would be a useful feature. It would allow packagers to upload the sdist to PyPI and then use the sdist from the session to generate the other artifacts. For example PyCA cryptography used create binary wheels that way.

dstufft · June 28, 2022, 1:26pm

The “draft” key in the urls dictionary is an URL that implements the simple repository api, but with the staged files in the session available at it.

The PEP doesn’t specify whether that draft URL must be a full copy of the repository (and thus able to be used with --index-url), or whether it can be a limited copy that contains just the files added in the session (and thus, must be used with --extra-index-url).

So, you could implement the old pyca/cryptography workflow using something like (hypothetical commands):

$ twine upload --draft dist/example-1.0.tar.gz
$ pip wheel --no-deps --extra-index-url $(twine draft-url example 1.0) -w wheels/ example==1.0
$ twine upload --draft wheels/*
$ twine publish example 1.0

The PEP doesn’t state whether or not you can download the file from the other URLs in the session (like the one that you’re POSTing to, to upload it). If folks think it should, we can add that, but all of the use cases I can think of for downloading the files that are in the session are likely best handled by the draft url.

wkoorn · June 28, 2022, 9:01pm

This looks great - and exciting! I’m enjoying the pace of these PEP’s rolling out, thanks!

As I understand, besides one minor deviation, this PEP is completely based on tus / it’s successor (draft-tus-httpbis-resumable-uploads-protocol-01 - tus - Resumable Uploads Protocol).
That sounds like something quite fundamental, and in my opinion, deserves to be (way) higher up in the PEP, instead of down below in the FAQ.

And using it as reference could maybe reduce the size of this PEP? Similar to how PEP-691 referenced external documentation (Mozilla) on content-negotiation, instead of copying over everything.

CAM-Gerlach · June 28, 2022, 10:52pm

One issue with that is that tus2 is currently an Internet Draft, which may evolve or expire before it potentially becomes a RFC, and from there potentially an Internet Standard, and contain express warnings that they should not be relied upon for anything permanent or for implementing other standards.

To note, @dstufft , is there any notional plan on synchronizing the upload API with any changes to tus2 as it evolves toward standardization, to remain a fully interoperable and conformant implementation?

dstufft · June 28, 2022, 11:20pm

There’re no specific plans, because it largely depends on what the changes are. I didn’t just reference the tus2 spec because who knows what changes will happen to it between now and then, or if it will even continue to exist.

If tus2 ultimately fails and never becomes an RFC, then being based on it isn’t a useful property for people to know, because it’s ultimately just an application specific protocol built on top of HTTP for us.

If tus2 ultimately becomes a real RFC, then either it will be wholly compatible with what we’ve done, and we can just update the specification to point to tus2 or it will not, and we’ll require a PEP to figure out if we want to become compatible with tus2, and if so, how we manage that change.

In any case, in 2/3 of those options we want to keep the specification defined in our own PEP/specifications at least until we see what happens with tus2 and what changes it may make.

That’s also why the fact it’s using tus2 is buried in a FAQ. At this point in time, it is an application specific protocol that happens to look almost exactly like tus2, with the hopes someday we can say it is just tus2.

EpicWink · June 30, 2022, 11:16pm

The PEP says to get the status of (HEAD) or cancel (DELETE) existing file uploads, you need the upload token. What if your upload app crashes and you lose your upload token? How did you then restart the upload for a file?

dstufft · July 7, 2022, 3:48pm

Sorry for the delay in a response to this, I’ve had competing priorities lately.

I’m going to update the PEP to allow DELETE to be called without the upload token, thanks for noticing this!

woodruffw · July 7, 2022, 9:01pm

This PEP is very exciting!

Since we’re currently working on integration between Sigstore and PyPI, this PEP might be the right place to ensure that we’re able to upload the appropriate signing artifacts (signature and certificate) to PyPI, and associate them with the correct distribution under the release.

The PEP currently says:

The file itself is then sent as a application/octet-stream part with the name of content , and if there is a PGP signature attached, then it will be included as a application/octet-stream part with the name of gpg_signature .

Do you think it makes sense to allow two more names here, e.g. sigstore_signature and sigstore_certificate (or similar) for this purpose? Or is there another location (or separate PEP) that’s more appropriate?

cc @dustin for thoughts as well.