PEP 694: Upload 2.0 API for Python Package Repositories

barry · July 10, 2024, 11:50pm

I think I’ve caught up on this DPO thread, PEP 694 (as written today), and @alanbato 's PR^[1].

I think this is a fantastic feature that would solve many of the problems with test.pypi.org, and maybe even allow us to decommission that service.

The last update to the PR is 2 years old and the last request for review was 5 months ago, so I’m wondering what the status is of both the PEP and the PR. I’d like to help if I can.

In the meantime, I have some questions. I hope maybe this will reboot the initiative, and I apologies in advance if I’ve missed any important details or misunderstood what is being proposed, and also for the length of this reply.

I wonder if “staged” releases is a better term than “draft”? Maybe it’s a distinction without a difference, but ISTM that the feature is really about incrementally preparing a new release, rather than (in general) creating preliminary versions that will undergo revisions ^[2]. Perhaps a minor point^[3].

The PR asks:

Some people think it shouldn’t be guessable by regular users, while others want it to be determinable by maintainers in order to automate their workflows. How can we achieve both? Should it be calculated, or should it be stored?

I think it is important that draft releases aren’t easily guessable. The use case is for releases that are timed to product announcements. I would really like to stage^[4] all the files for the new release, but not generally expose any of the files or even the presence of new versions until the announcement goes out. I think it’s enough to secure-through-obscurity, rather than truly lock down drafts through some sort of authentication token in the request.

“Authentication token” - which reminds me, although I don’t think it’s explicit, the simple index and files for a draft release are both accessible with the standard unauthenticated access mechanisms, and draft releases would likely also be cached in the CDN. I think both consequences are fine, as long as the draft hash token is not easily guessed.

Q1: I think the fact that draft files will be cached in the CDN is okay, with at worse some duplication perhaps? I’m not sure how all that works, so I just wanted to ask if that effect has been considered?

Q2: In the “Create an Upload Session” response, I don’t think the PEP specifies the format of the urls:draft value, and the PR I think currently creates an easily guessable hash from the package name and version. I think unguessability is an important use case to support and would like to see a resolution which preserves security-through-obscurity for drafts.

If I understand correctly, I think the root simple repository index doesn’t have access to the draft hash, so wouldn’t include any responses related to drafts. That seems fine except in one corner case: where a package is getting uploaded via 694 for the first time. I am not sure a) whether an initial draft upload session exposes the package name to the root index; b) if not, whether that’s a use case we should support. It might be rare enough not to matter.

I think the Session Completion section of the PEP is missing the treatment of another corner case. Say I create a session, then complete/publish it without uploading any files. I think that should explicitly be described as an error. An upload session MUST include at least one file (wheel or sdist).

What if I want to stage new releases for more than one package, and then publish them all at the same time? Technically, I think I can do this just by creating multiple sessions, and managing any tokens I need to do the individual uploads, and subsequent publishing steps. The other thing I’d want is to for pip install to see all the drafts, so that I can test that everything works as expected before I push all the big “publish” buttons. I guess the use of multiple --index-url/--extra-index-url flags would be enough to support this use case on the client end? Has anyone thought about this use case or tried it out with the PRs?

I don’t love that I have to use --index-url or --extra-index-url to get client-side access to draft releases, but maybe it’s okay. Alternatively, something like a --draft flag (which can be specified multiple times) to resolve the session hashes and “see” the drafts would be better? That’s a client tool UX question and I don’t think it specifically impacts the protocol.

From the PEP

[…] for hashing the uploaded files, the serial chunks requirement means that the server can maintain hashing state between requests, update it for each request, then write that file back to storage. Unfortunately this isn’t actually possible to do with Python’s hashlib[…]

The downside is that there is no ability to parallelize the upload of a single file because each chunk has to be submitted serially.

While tus2 (IIUC) requires serialization of chunk uploads, isn’t it possible for the server to preserve the chunks, and thus the hashes for each chunk, until the upload is complete? It makes the implementation more complicated, but is it valid within the tus2 protocol to defer constituting the complete uploaded file until Upload-Incomplete is omitted?

I think I saw a suggestion to allow file upload deletion be optional, but I’d argue that this should be mandatory. It’s not entirely rare to encounter a build problem, say in one of your dozen wheel files, that only surfaces once all files are staged, and testing against the draft is performed. A major benefit of the whole draft feature is that you can perform that testing before publishing! If you’d have to delete the whole session and re-upload everything, rather than just respin the offending file and replace it, it wouldn’t be as useful. However I may be misunderstanding the protocol, so if it’s possible to just start re-uploading a file without deleting it, and it gets replaced, then the actual DELETE step may be largely unnecessary.

The PR uses URLs like <pkg>/release/0.23.0--draft--<some-hash>/ which doesn’t look great. Was <pkg>/release/0.23.0?draft=<some-hash> considered?

The visual demo in the PR was very helpful! The picture that describes the UI for publishing a release has a red box below it that says “Delete release” with an admonition “You will not be able to re-upload a new distribution of the same type with the same version number”. That seems incorrect - i.e. until a draft is released, the publisher should have free reign to delete individual files or the entire draft release and start over from scratch, right? The whole point being that drafts eliminate the need to artificially bump version numbers on broken releases, up until the release is published.

I think those are all my questions so far!

though I haven’t reviewed all the details of the code changes ↩︎
although revisions to individual files in the stage can happen if a file upload to a stage goes awry ↩︎
and I might use the terms interchangeably below ↩︎
er, um, draft ↩︎