PEP 694: Upload 2.0 API for Python Package Repositories

That section you’re quoting is describing the status quo API, the later sections describe what this PEP proposed.

1 Like

For additional related data, the question would be whether it’s a large binary blob that needs uploaded, or whether it’s metadata that can be sent in the initial request.

If it’s metadata, then it can just be added as an extra key to the request to create a file upload, that would be sent just prior to uploading the file itself.

If it’s a big binary blob that needs uploaded independently, then we would need to add a way to upload related files that aren’t distributions.

My bad!

The sizes here should be approximately 1.5KB per certificate, and ~150 bytes per signature (base64 encoded). Do you think that’s small enough to make sense as extra keys?

Edit: For context, sigstore’s dogfooded signatures and certificates can be seen here (with their sizes): Release Release 0.6.2 · sigstore/sigstore-python · GitHub

What’s the relation between certificates, signatures, and files?

1 file = 1 signature = 1 certificate? Or something else?

Yep, each file has exactly one signature and certificate.

Edit: In principle, a file can be signed many times and have correspondingly many certificates. But I think the data model can be 1-1 for PyPI’s purposes. Or maybe not, and that’s perhaps something we should discuss…

That’s probably fine to live as metadata I think.

It’s intended that the entire METADATA can be submitted as uh, metadata, and while I don’t have it readily available how big those are, the largest part of them is the project README typically, and the average project README on PyPI is just under 4kb large, and the largest one is 7.2M large (yikes).

2 Likes

Gotcha! In that case, here are some proposals for the keys:

For the 1-1 case (if we go that route):

"sigstore": {"signature": "<base64 sig>", "certificate": "<PEM-encoded cert>"}

For the many case (if we go that route):

"sigstore": [{"signature": "<base64 sig>", "certificate": "<PEM-encoded cert>"}]

Those look fine to me, though I don’t think I’d add them to the PEP yet unless PyPI supports them already and I’ve just missed it.

That’s mostly because I think the question of what PyPI + Sigstore together means deserves it’s own consideration, and a look at the whole proposal of what that means. But I think that scheme would be fine for how to add it to this proposed upload API, and suggests that it wouldn’t be a large blocker to eventually adding sigstore.

3 Likes

Nope, PyPI doesn’t support them yet!

Makes sense. I think the plan is to rehydrate PEP 480 with some of those details.

3 Likes

I have a few comments about the document.

(1) I suggest adding something about the length of Upload-Token. For example the tus Internet Draft says:

A conforming implementation MUST be able to handle a Upload-Token field value of at least 128 octets.

(2) The Upload-Offset of 1001 in the example given in the document is incorrect.

As an example, if uploading a 100,000 byte file in 1000 byte chunks, and this chunk represents bytes 1001 through 2000, you would send headers like:

Upload-Offset is zero based, so the first 1000 bytes are at offsets 0 through 999, so byte 1001 should be at offset 1000 not 1001.

(3) The document doesn’t really describe what is at the draft URL so it is hard for someone implementing a service to know what to provide or for a client to know what to expect. There is the question of what files are available to download from that draft URL: (a) only the files from this session, (b) all the files for this package, (c) all draft files including ones from other packages, or (d) none of the above. More important is what API is implemented at that URL. Should the document specify that the repository at the draft must implement PEP 503? Or PEP 691?

(4) Are the upload, draft, and publish keys all 3 required in the urls key? Can a compliant server omit the draft key? The wording in the document currently says (emphasis added):

For the urls key, there are currently three keys that may appear:

2 Likes

Thanks!

That’s an easy add, sounds good.

Oops yea, I’ll fix that.

It’s intended to be a simple repository API that contains at least the files from the session. It is not required to have any additional files from that project or any other projects, but it may if an implementation wants to. So it’s PEP 503 + PEP 691, etc, depending on what repository specs/version the repository is emitting.

Upload And Publish would be mandatory, there’s an open question about whether draft should be mandatory or not. It’s simpler for end users if they can rely it on it always existing, but if any non PyPI implementations of this would rather not have to support it, I think making it optional would work too.

That sounds reasonable. I suggest adding a sentence or two to the PEP stating that.

The proposed API requires much more from service implementations than the current legacy upload API, so I suggest it is worth considering ways to reduce the amount of work required of services to be compliant with the PEP. Here are a few features I imagine could be made optional:

  1. Chunked uploads, resumable uploads, and cancelling an in progress upload (those all go together so I’d expect an implementation to either support all or none).
  2. Deleting an uploaded file - I’m not sure how often users will get into a situation where they need to delete a file from a session.
  3. Draft repository URL.

The implementation costs of 2 and 3 are relatively small compared to 1. So if anything is to made optional in order to ease service implementations, I’d consider making chunked uploads, resumable uploads, and cancelling uploads optional. I realize the value of it for extremely large files, but as stated in the PEP “it is recommended that most clients should choose to upload each file as a single chunk as that requires fewer requests and typically has better performance.”

I think it is likely that other server implementations will host both the legacy upload API and this new one under the same root repository URL. It won’t be a problem for the service to distinguish legacy upload requests from upload 2.0 requests. But how will the client know which API the service supports? As a user, suppose I upgraded to a new version of twine that supports upload 2.0 and I have a configured my ~/.pypirc with an index server repository URL pointing at an Artifactory or CodeArtifact repository. How will the client know which API to use? I like Florian’s proposal that the client first make an application/vnd.pypi.upload.v2+json request using the new API and if that fails then fall back on legacy API.

2 Likes

And one other thought (sorry for multiple replies in a row). Would it be reasonable to specify a minimum chunk size and/or a maximum number of chunks? (The minimum chunk size would not apply to the final chunk). If the PEP specified a minimum chunk size then server implementations are free to reject smaller chunks (though the server may choose to allow smaller chunks) and compliant clients would only send upload chunks of the specified size or larger.

As a point of reference, S3 has limits for multipart uploads, including a minimum part size of 5 MiB and maximum of 10,000 parts. This is relevant both as a point of comparison and as a potential limitation should a server implementation choose to implement chunked file upload using S3 as a backend.

It should be possible to host this at the same URL as the existing upload API using a few mechanisms to detect it:

  • The existing upload uses some url parameters the new one does not, so you can dispatch based on that.
  • The existing upload uses form upload content type, the new one uses a custom content type, so you can dispatch based on that.
  • The payloads for the two uploads look entirely different, so worse case you can dispatch based on that.

So, it should be possible to have both APIs living at the same URL.

It actually should be possible, if the URLs are the same, to have the client do that proposed fallback as things stand now, but I think I’ll give it deeper thought on how to have clients auto detect the difference and know which API to use.

I’m hesitant to put specific requirements on minimum chunk size or maximum chunk size, if only because I don’t know that it’s possible to pick specific limits that are guaranteed to work across all possible implementations now and into the future.

What I think could be a good value add here is to take a look at the major object stores and how they handle multipart uploads and try to come up with a set of recommendations for what a client should limit itself to for maximum compatibility.

Yes, I completely agree.

The question of how the client detects which API to use is relevant regardless of whether or not the service implements both upload APIs at the same URL (and regardless of whether or not the service even supports both upload APIs). When a user puts a repository URL in their ~/.pypirc file (or whatever configuration is used by the client) the client will need to detect which API to use (unless the client requires the user to configure which API version to use or the client only supports one upload API version).

I believe the proposed fallback approach is sufficient. But if after giving it deeper thought you conclude that having clients use the fallback approach is insufficient or inappropriate then you may want to update the PEP to define a way for the server to advertise what version of the upload API it supports.

I agree with the hesitation to specify a minimum chunk size in the PEP. I’d consider updating the PEP to include:

  1. Server implementations may enforce implementation-specific minimum chunk size and maximum chunk size (and you may want to document how a server should respond if it it receives a chunk size outside of those bounds).
  2. Recommendations for minimum and maximum chunk sizes for maximum compatibility.
  3. A recommendation for clients to allow a user to override chunk size (in the unfortunate case that the client’s default chunk sizes are not supported by a server).