The use of an HTML representation for Python package indices predates efforts to standardize Python packaging. Consequently, the HTML representation standardized with PEP 503 represents a formalization of existing practices (particularly those of PyPI), rather than a design.
The HTML representation serves the Python packaging ecosystem admirably, but is also subject to a handful of technical and social limitations (elaborated in the PEP) that result it being (1) cumbersome to add features to in a backwards-compatible and performant manner, and (2) de facto frozen outside of PyPI itself.
Consequently, the PEP proposes “freezing” the HTML representation, i.e. explicitly discouraging the addition of new features to the HTML representation in future Simple Repository API PEPs. This does not deprecate the HTML representation, and the PEP does not discourage installers or indices from continuing to use it.
Summary of proposed changes
The HTML representation of the simple repository API is frozen for the purposes of Python packaging standards processes. Future Python packaging PEPs SHOULD NOT modify the HTML representation of the simple repository API, and MUST instead modify the JSON representation.
As always, I look forward to the community’s feedback on this proposal
-1 from me, I would like to see the optional upload-time field be allowed in the HTML serialization (currently the specifications say it is JSON only).
The field is critical for lots of client use cases, HTML only mirrors could then pick it up from PyPI, and other tools that generate HTML can choose to implement it, or not.
After that I would be happy to see the HTML specification frozen, and even eventually discouraged.
I don’t feel strongly one way or another on the future of HTML or exposing upload-time in that HTML-- but are you aware of any mirrors that actually work by mirroring the actual HTML that PyPI serves?
My expectation is that pretty much all mirrors are implemented by generating their own HTML, particularly since PyPI uses absolute URLs to the files, so if you want your mirror to mirror the files as well, you need to fix those URLs somehow. It’s probably easier to just generate the HTML than to try and take PyPI’s HTML and mutate it to point to your own URLs.
You’d also have to make sure that you’re mirroring the metadata files (which won’t happen with any sort of generic website mirroring tool) or you’d also have to make sure that you’re dropping the data-core-metadata attribute.
Regardless of whether anyone mirrors PyPI by mirroring the actual HTML, mirrors that generate their own HTML could choose to implement upload-time if it were added, so I don’t think it affects your position much one way or the other. I’m mostly just curious because it’s come up a couple of times and I’ve never actually seen anyone do that in a long time (a decade ago that was the case though).
I guess one question I’d have is whether we know of any non-PyPI indexes that have expressed a willingness to implement upload-time? One of the assertions that PEP 833 makes is that uptake of new features among non PyPI indexes is limited or non-existent. If that’s true are they going to implement upload-time, or is it going to be yet another feature that gets ignored? If it’s not true and there are non PyPI indexes that are adopting new features regularly, then that is useful information to know for determining whether 833 is a good idea or not.
I mentioned it in the earlier thread, but to answer my own question a bit. I suspect there will be a larger motivation to implement upload-time given there are user facing features that require that data, whereas the other features are not required for any user facing feature.
That being said, I have no idea if larger is still effectively zero or not!
I’d also say that adding upload-time to the HTML representation is such a trivially easy addition, that if someone felt motivated to write such a PEP, I’d personally have no problem with it. I expect it’d get approved unless there was a hidden contingent of people who were adamant the HTML representation shouldn’t get even one more feature that refrained from posting in that other thread.
I also wouldn’t be opposed to a PEP that made a last change to the HTML representation for upload-time. But I agree with Donald’s analysis that doing so is unlikely to move the needle on adoption, given that third-party indices generally need to rebuild the representation anyways, and we haven’t seen any evidence of them doing that for other pieces of metadata.
I am generally in favour of the PEP. As it points out, individual PEPs can already make the case that they aren’t worth including in the legacy HTML API, so it’s just shifting the burden of argument from PEPs that want to leave the legacy API alone to those that still want to change it.
I would be enthusiastically in favour of the idea if the PEP (or, more appropriately, a parallel technical PEP) proposed standardisation of JSON-only index servers that didn’t serve HTML files at all, and were just as amenable to static file hosting as HTML-only servers currently are.
(Strawman idea: allow the JSON version of the file to be requested by appending .json the URL in addition to via HTTP content negotiation on the legacy HTML URL)
The PEP does call this limitation out already, but only in the context of hypothetical future removal of the HTML API, not in the context of making freezing the HTML API more palatable by flipping which one we treat as the mandatory interface and which one we treat as the optional alternative.
There’s nothing special about HTML that makes it able to be served statically and JSON cannot.
You can do a JSON-only index server today with static file hosting with nearly zero configuration-- the only required configuration would be informing your web server that it needs to add the application/vnd.pypi.simple.v1+json content type for .json (or whatever extension you use). The previous thread I showed a demo that was hosted entirely in s3 (without any special dynamic logic from Cloudfront or anything).
Conneg is only required if you want to support both HTML and JSON at the same URL, if you only want to support one of them you don’t need to use conneg, you just return the one that you support and ignore the Accept headers, which is exactly what people are doing today to serve HTML.
Conneg also doesn’t mean that HTML has to be the default, the client gives you a list of priorities, and the server can do… anything it wants with that information. It can give the client the form the client prefers or it can give the client the form it prefers, or it can ignore the Accept header and return anything at all that it likes.
On PyPI the default, accept an Accept header that weights JSON higher, is text/html just due to old clients that didn’t send an Accept header at all, but that could easily be swapped around on PyPI (or any specific server) without any ecosystem coordination.
The only thing that’s special about HTML is:
Pretty much every webserver in existence knows that .html should be served with a text/html content type by default, so you don’t have to inform them of the content type.
@brettcannon proposed adding application/json as an alias for application/vnd.pypi.simple.v1+json similarly to how we have one for text/html which would reduce this to some degree since many servers also understand that.
Most servers understand index.html as the directory listing by default, not as many understand index.json, so you might need to tell them to use that.
Any generic web server that can generate an auto index from the file system can probably generate something close enough to PEP 503 that installers will accept it.
I don’t think our current specification page makes this at all clear. I know my interpretation of the status quo was that defaulting to HTML was still part of the specification, so if that already isn’t the intent, this PEP provides a good opportunity to make the actual intent fully explicit (that is, spelling out that statically served JSON files is a permitted index server implementation)
Yeah, the current spec has the “HTML Serialisation” section inside the “Base API” section. It would be hard to interpret that as meaning HTML is optional. IMO, the whole specification page should be updated to properly describe the current index API (at the moment, it’s very visible that it was created by pasting together PEPs 503, 658 and 691.
In fact, I’m inclined to say that if we approve PEP 833, then one key requirement should be that the spec page gets updated to reflect the new position: JSON and HTML forms of the index are served from the base URL, both are optional (do we say that JSON is preferred?), if both are served then content negotiation is the way to serve them from the same URL, but indexes can use 2 distinct base URLs, each serving one format, if they prefer.
Leaving the spec as is while accepting PEP 833 would just make things confusing for users.
That may be the case I definitely think anything we do to improve people’s misconceptions around what PEP 691 actually means would be great!
From the goals of PEP 691:
Enable zero configuration discovery. Clients of the simple API MUST be able to gracefully determine whether a target repository supports this PEP without relying on any form of out of band communication (configuration, prior knowledge, etc). Individual clients MAY choose to require configuration to enable the use of this API, however.
Enable clients to drop support for “legacy” HTML parsing. While it is expected that most clients will keep supporting HTML-only repositories for a while, if not forever, it should be possible for a client to choose to support only the new API formats and no longer invoke an HTML parser.
Enable repositories to drop support for “legacy” HTML formats. Similar to clients, it is expected that most repositories will continue to support HTML responses for a long time, or forever. It should be possible for a repository to choose to only support the new formats.
I’m not sure how better to word that TBH. A lot of this is just baked into the nature of how HTTP works.
The fundamental nature of HTTP is that, given a HTTP request, the server is free to return any content type it wants. There is nothing in HTTP that ever mandates a server returns any given content type ever.
Server driven content negotiation is just a standard that allows clients to indicate, via the Accept header, what content types they would prefer (and in what order their preference is). That indication is more accurately thought of as a hint that the client is sending to the server, rather than any sort of requirement that the server has to listen to it.
That means that the conneg approach chosen by PEP 691 inherently supports a JSON only index, by just having the server ignore the Accept header and always send the JSON content types-- which is exactly how a HTML only index works.
That’s not something that PEP 691 itself had to enable though, it’s just a fundamental part of how HTTP works. Obviously there seems to be a lot of misconceptions around what that means , but it’s nothing we did that enables that (other than by choosing to lean into the HTTP spec for handling this).
Oh yea. I’m not sure that I’ve ever actually looked at the “actual” specification for index servers on packaging.python.org (it is probably not a surprise that I have the spec pretty well memorized).
Like most things, it’s not that the PEP didn’t state its aims clearly, it’s that the translation from the PEP to the specification document lost a lot of important context. To be blunt, we’re pretty bad at managing the transition from PEP to “living specification”. Most PEP authors (myself included!) treat it as a necessary but annoying bit of busy-work, rather than a key part of the process.
From a technical point of view it’s definitely possible, and you’re right the spec doesn’t say index servers can’t just serve JSON unconditionally. However, it was written long enough ago that it heavily implies that servers that only serve JSON will face significant client compatibility issues.
In that context, this PEP is essentially explicit acknowledgement that a modern server could already skip the HTML entirely, and users of even vaguely recent versions of actively maintained clients wouldn’t even notice the HTML pages were missing.
While PyPI can’t reasonably make that assumption (yet), many folks operating private index servers can actively enforce minimum versions on their clients,
so JSON-only servers are likely to be pretty viable now.
I only have second hand information that Astral have spoken to companies that would like to use upload-time on internal indexes that can only be static responses (no Accept JSON) and is therefore HTML (I assume due to historical reasons) and as pip and uv don’t support reading it from HTML it’s pointless, take that for what you will.
I’d prefer it if the following text in the PEP was reworded:
Future Python packaging PEPs SHOULD NOT modify the HTML representation of the simple repository API, and MUST instead modify the JSON representation.
The intention is fine, but I think the wording is confusing. I’d prefer to see it worded as:
Future Python packaging PEPs MUST target the JSON representation as the primary form of the simple repository API. They SHOULD NOT make changes to the HTML representation.
Also, the following statement isn’t quite accurate:
One functional consequence of this freeze is that future changes to the simple repository API will be versioned as they are currently, but that only the JSON representation will receive changes to its versioning marker.
In fact, the HTML representation could receive new versions, because PEPs can (with sufficient justification) ignore the recommendation to not update the HTML representation. If that happens, the HTML version definitely should be updated.
Finally, it might be worth extending the “How to teach this” section to add a discussion of how we publicise the fact that once the PEP is accepted, the JSON form of the index is (at least in some sense) preferred. If we want this change to have any real impact[1], we need to get the message across that the JSON form is now the preferred representation, and suppliers of index services and/or software should understand that supporting HTML only means they are deliberately offering a “bare minimum” implementation. One option here is simply to say that user pressure will make that happen naturally, but honestly, I don’t think that’s realistic[2].
Other than making our lives marginally easier when writing standards ↩︎
Particularly as most of the benefits are just that the JSON version works “better” in some sense, and you have to experience that sort of benefit to appreciate it. ↩︎
I believe OpenDev (OpenStack Infra) just rewrites the URLs
While technically correct, what you’re seeing there is simply Apache mod_proxy coupled with mod_cache to minimize the number of requests made by CI test nodes to PyPI over the Internet in the various public cloud regions where we have donated resources. For those curious, the Apache configuration for that proxy can be found here: Making sure you're not a bot!
(Ignore the “wheel” rewrites below there, that’s for a separate auxiliary index mixed in via --extra-index-url to backfill platform-specific wheels on old releases of projects that only supplied sdists, we mostly don’t use it any more and have deprecated but not removed it yet.)
Once upon a time we did mirror PyPI through a variety of means, mostly with bandersnatch once it existed, but the storage cost of maintaining a complete mirror became overwhelming when AI-oriented projects started uploading massive nightly dataset builds and the size of the mirror exploded by an order of magnitude or more in a matter of a few months.
We’ve had a few users ask us for hacks/workarounds for this, yeah. One of the frustrating things here is that it’s very hard to actually get in contact with these third-party indices – nobody seems to be willing to use their support contracts to actually file feature requests, because they’ve already assumed that it’ll never get done. So there’s a definitely lack of quality signals here, and the third-party index services do not appear to be particularly interested in fixing that (either here or on previous PEPs, from experience).
Thanks, that seems clearer to me.
I can update this to something like “should receive,” but IMO it’s pretty well covered by the use of “SHOULD NOT” instead of “MUST NOT” w/r/t HTML representation changes.
(Or, I think I could probably just remove that paragraph entirely, since it’s kind of in-the-weeds and anything it says imprecisely is already implied more precisely by the rest of the PEP.)
Agreed, I can make that change. I meant the “will make appropriate changes to the living standard” item to imply that, but I can also make it more explicit that I’ll make changes to the living PEP to nudge consumers towards the JSON index if they want newer features (but without deprecating the HTML index, since this PEP isn’t trying to do that).
There’s no mention of updating the HTML format to allow upload-time, in rejected ideas or otherwise. I maintain this field is more important than any of the other reasoning in the PEP, the schema for data fields is defined in HTML, and this is a purely optional field so it is backwards compatible.
So I would want a rationalization of why upload-time specifically should not be added, so we can point users to it when they complain that resolver and installers can’t support cooldowns for HTML simple indexes.
And I will note that this may lead to some clients using the HTTP Last-Modified header in lieu of this missing data field, as this is a real request users are asking of uv.