PEP 691 is proposed. This PEP addresses the use of HTML for the “Simple Repository API” that has been in place for many years and served us well. We’d like to move to a more tool parsable format and allow future PEPs to enhance it for different use cases potentially.
The first main use case is for pip. We would like to change as little as possible here, outside the few things outlines in the PEP. The main goal is to keep the response as brief as possible with people using PEP 658 to obtain more metadata about a package.
The link to the Discourse thread in the PEP is invalid. It looks like it’s still the placeholder from the template. (https://discuss.python.org/t/AAAAAA/999999).
My concern with using HTTP headers rather than separate URLs is that it prevents basic/static clients from supporting the JSON API. Currently you can throw some files in a folder with a few index.html files and serve them with Apache, nginx etc., or even put them in GitHub Pages, where there isn’t necessarily control over the headers. There’s no reason you couldn’t do the same with a JSON API, since the metadata is entirely static after the packages have been uploaded. A tool like my simple503 could be used to generate those JSON files automatically.
I realise this is probably a niche usecase, but I do think a simple API should be simple for both the server and the client.
One value that would be useful that exists in the current PyPi JSON API but not in this spec or the metadata is the upload time.
It would be useful to standardize this feature so when resolvers are doing an initial solve it’s possible to offer the feature “don’t get packages uploaded more than 5 years ago” or “prioritize based on recent upload time”.
While the spec is only about reproducing the Simple Repository API in JSON my and allows for future enhancements my concern is that because the spec states “Additional keys may be added to any dictionary objects in the API responses” that implementations may add their own custom upload time key before there is a standard agreed name and format.
Is name a project name (canonical) or a package name. I believe it’s the second, and if it’s not - then would be great to include package name as well.
Second - do you think it worth to include Etag and Last-modified also in specification, or is server-implementation specific?
I need to think about the content negotiation approach some more. My initial reaction was a massive -1, but having read the FAQ entry, the argument about zero-configuration discovery is reasonable. However:
I think the PEP needs to address the fact that this means that static servers are no longer supported. I know of many cases where people put indexes on github pages, S3 or similar. What will those people do under PEP 691? This may be a big enough issue to prevent content negotiation as a solution, in spite of the benefits suggested in the FAQ[1].
TUF support requires that we have distinct URLs (if I read the PEP correctly). So what, we require indexes to provide both? At the very least the FAQ should be explicit here, that content negotiation doesn’t support TUF, but rather than reject content negotiation, the decision was made to bolt on explicit URLs as well (and the FAQ should explain why that was deemed sufficient).
I think that allowing pages with no “Accepts” header to serve JSON is a bad idea, as it means existing clients will fail because they get back invalid HTML. To fix that they may as well send an “Accepts” header - there’s no scenario I can imagine where a client would ever want to read the URL with no “Accepts” header and see JSON.
Also, a minor point, but allowing indexes to serve JSON when no header is given, makes “View source” in browsers useless (I assume the JSON won’t be pretty-printed, and I assume browsers don’t send an “Accepts” header). I use “view source” as a quick way of checking things like requires_python metadata.
The zero-configuration goal is simply stated as a requirement, without justification. I think there’s an argument that it isn’t necessary, and servers could reasonably have separate URLs for PEP 503 and PEP 691. Clients would then control when they switch over to PEP 691 just by changing the default URL for PyPI (and users could opt in or out by setting an alternative URL).
I’d also say that there’s a subtext in the PEP that PyPI are hoping to drop support for PEP 503 in favour of PEP 691. I do not think that should happen without its own, independent, PEP and migration plan that goes into a lot more detail[2]. In fact, I think that PEP 691 should explicitly state that it is intended to exist alongside PEP 503, and major index providers like PyPI must develop their own migration and compatibility plan before dropping PEP 503 support in favour of PEP 591.
This was done through a discussion between pip and bandersnarch maintainers, who are the two first potential users for the new API.
Well, there’s also PyPI. Are you assuming that everyone realises PyPI is represented because the PEP authors are Warehouse maintainers? What about other index providers (who are often both producers and consumers of the index API, because they offer mirroring)? There’s devpi and Artifactory that come to mind, and I think Azure has an index server as well. Have they been consulted? The whole point about having a PEP is so that clients can avoid doing stuff that’s PyPI-specific, so I’d like to see buy-in from other index providers.
In pip’s tracker, we get a lot of people saying they can’t host a proxying index server. Which makes me wonder how they host their local packages, if not via a statically-delivered index? ↩︎
In particular, I think it’s naive to assume that all access to the simple index goes via pip. I’m fairly confident that there are a lot of tools and scripts out there that parse PEP 503 data. Working out how to ensure that all those scripts won’t get broken should be part of any plan. ↩︎
Thank you. I cannot believe that in 2022, pypi.org still only provides the legacy HTML API. I expected that somone addressed this issue earlier. Well, better late than never!
One value that would be useful that exists in the current PyPi JSON API but not in this spec or the metadata is the upload time.
This isn’t a replacement for the current JSON API, but rather for the current HTML based simple API (which is primarily meant for installers like pip).
Information like the upload time isn’t particularly useful for installers but is certainly metadata that could be useful to other consumers. Adding something like that to what currently is PyPI’s JSON API (pypi.org/pypi/*/json – we need to come up with better names :P) could make sense but is separate from this PEP IMO.
There’s no way such a client would work though, if (for example) an index server implementation only supports the JSON API. The only option there is somehow presenting a not-silent error, which… realistically, this is going to result in.
I’m imagining that indexes that would end up implementing this behaviour (the JSON-by-default model) would be doing so only if they know that the only clients they’ll have are compatible with this. That’s quite a while away but shutting out that possibility doesn’t seem like a particularly worthwhile exercise TBH.
Not allowing this would mean that JSON-supporting clients would need to unconditionally advertise that they support JSON – what they’re going to do anyway. That isn’t the worst thing though, so…
Firefox will present JSON content as a navigable tree, similar to how the existing PyPI JSON API looks.
It’s not useful to installers, but it is useful to resolvers which Pip also is.
My concern about the PyPi JSON API is that it is not standardized and therefore not implemented by other indexing software such as Artifactory. Also because the PyPi JSON API isn’t standardized Pip’s resolver can’t take advantage of it.
This JSON based API replacement already seems to add additional fields over the HTML API such as dist-info-metadata-available it would seem that now would be the moment to standardize any other very useful field. And upload time is useful enough to add to the PyPi JSON API but there is no standardized way to get this information beyond that.
Yes. If a server doesn’t support PEP 503, a PEP 503 client can’t handle it. That’s not surprising, although it’s the reason I think we should require servers to have a good transition plan before dropping PEP 503 support.
But my point is why would a client that is compatible with PEP 691 not be sending an “Accepts” header? It’s a requirement in the PEP. So the only clients not sending an Accepts header can be assumed to not know about PEP 691, and therefore be PEP 503 only. Hence why HTML should be the default.
I guess if we ever, in the far future, declare PEP 503 officially desupported, we could at that stage switch the default to be PEP 691 JSON. But that would require a new PEP (a meta-PEP? I don’t think we’ve ever desupported a specification like that in the past).
Precisely. They have to according to the spec.
Dammit, I knew I should have checked I used to use Chrome (which didn’t, as far as I recall) but switched a little while ago, and haven’t had occasion to do that since. In an attempt to recover my dignity I will note that Ctrl-F to find text doesn’t work on the JSON view (“Filter JSON” does, but it’s slightly different). But whatever, it was a minor point anyway.
This isn’t exactly true TBH, but it’s not not true either
It roughly depends on what capabilities your static server has.
S3 lets you set a custom content type to whatever you want, but it has no support for content negotiation. So you can easily make an index that is either HTML or the new JSON type, but you can’t make a single endpoint that does both transparently. You can however make two endpoints if you want, a JSON and a HTML one, and then use client configuration to select between them. You could also slap a cloudfront distribution in front of S3 and use a lambda@edge task to handle content negotiation.
Github pages only has a set number of content types that it understands. text/html being one of them, but the custom content types not. You’d not be able to support the new JSON api on Github pages unless they added support for it. That being said, we could add application/json as an alias for application/vnd.pypi.simple+json (the non versioned content type) and then you could also host it on Github Pages with the same caveats as S3 (a single endpoint can ONLY be html OR json).
Apache has built in support for this, so can fully support it with just a static directory + configuration.
Other servers will vary, but at a minimum they can support multiple URLs.
It’s probably worthwhile to spell it out in the PEP, but since a repository isn’t required to support both JSON and HTML at the same URL and can make either the default, there’s no reason why a particular repository has to restrict themselves to a single endpoint. They can simply rely on configuration and having users pick either the JSON or the HTML endpoint depending on the capabilities of their client. Content Negotiation degrades gracefully here.
TUF requires distinctly named targets, they’re not URLs, they’re string keys of “files”, but it’s application specific what those string keys actually refer to. The PEP mentions this, but we can just have two different targets in TUF, one for HTML and one for JSON, and just have the target key include the content type. It doesn’t have to match the actual URL being requested If I remember correctly the TUF stuff already diverges the target key from the URL because I think instead of /simple/requests/ it does /simple/requests.html.
I had a comment on one of the earlier drafts that didn’t (yet) bubble into the PEP.
Content Negotiation is a standard HTTP feature, and when it can’t find an acceptable content type to serve (either because the client asked for none, or because it asked for only ones the server didn’t understand) it allows servers to either error (with a HTTP 406) or to serve a default content type.
I think the PEP should generally leave it up to the individual repository to decide what to do when content negotiation fails. Personally I’m a fan of having no Accept header fail, because that scenario is primarily going to be automated clients and that will encourage them to add an appropriate Accept header. If we let the repository pick though, we’re more flexible and can possibly cover more edge cases.
Browsers do send an Accept header. Most websites are using content negotiation without people ever noticing .
Edge browser’s default Accept header looks like this:
I prefer HTML, then I prefer XHTML, then I prefer XML, then I prefer some image types, then I prefer absolutely anything.
Most modern browsers will have a similar default accept header.
So that would mean that as long as a repository supported text/html, browsers would get that (well technically the individual repository implementation gets to pick what it serves, since content negotiation is the client asking nicely, and then the server picking). If that repository no longer supported text/html, it would get a JSON response.
JSON normally isn’t pretty printed by default, but most browsers have a light weight addon you can install to get it. For instance, I use JSON Lite on Edge.
We should expand upon that for sure. Roughly speaking the justification is because it removes the requirement to coordinate between client and server or to have end users care about whether an URL is JSON or HTML.
Like take pip for instance, of course pip can manage it’s own default pretty easily. That’s not a problem.
However, let’s say I have a private repository and it supports JSON. How do I tell pip about this?
Well if I just do --index-url https://example.com/my-json-repo/, pip has no way to know if that is JSON or HTML, so it’s going to have to assume it’s HTML. So pip would have to do something like --json-index-url https://example.com/my-json-repo/.
Now you could say “can’t pip just request the url regardless, and dispatch off of whatever the returned Content-Type is?”, and yes, you definitely can. Which is basically exactly what Content Negotiation is
In fact, pip already is using content negotiation, when it makes a HTTP request it passes an Accept: text/html header, so if you wanted to do “dispatch based on the returned content type, but have separate URLs for JSON vs HTML”, pip would have to update it’s Accept header to list both the HTML and the JSON content type.
There is no plan to have PyPI drop support for HTML responses. The PEP already states:
Similar to clients, it is expected that most repositories will continue to support HTML responses for a long time, or forever. It should be possible for a repository to choose to only support the new formats.
Which is primarily intended to give random private repositories the right to not bother implementing HTML responses if they don’t want to, but the PEP expects that most repositories are not going to drop support for HTML anytime soon, maybe never.
We don’t have a reasonable way to require people to make a migration plan. Like we can certainly put some words in a PEP to say that, but we have no recourse to force them to do so. I prefer to give people the tools they need to handle it, and assume that they’ll make reasonable choices for their situation. For PyPI we wouldn’t drop text/html support without a PEP.
I just want to highlight this particular thing to emphasize it.
PEP 691 does not require that a repository use content negotiation to select between JSON and HTML, and in fact it would not be possible for us to require that unless we required a repository to support both JSON and HTML forever.
PEP 691 (mostly) does not require clients directly use content negotiation anymore than they already are. A client could implement PEP 691 by having a --html-repo-url that only sends Accept: text/html (essentially what pip is doing today for --index-url) and a --json-repo-url that only sends Accept: application/vnd.pypi.simple.v1+json.
The only case where PEP 691 would require clients to implement content negotiation is if all of these are true:
They don’t send an Accept header already.
The repository is implemented expecting clients to pick using content negotiation.
The repository has either chosen to return a HTTP 406 error in light of (1) or has returned a default content type that the client wasn’t expecting.
Otherwise, clients and repositories are free to implement this using separate URLs and using out of band configuration and things will just work. This is one of the benefits of using Content Negotiation, which is a foundational part how HTTP works all across the internet.
In addition, even if a client or repository chooses to use out of band negotiation (vs content negotiation), this PEP still benefits them, since it defines the content type and JSON responses.
It’s the project name, same as what’s on the simple API today.
There is no new metadata in this PEP.
dist-info-metadata-available is a translation of PEP 658, which added a data-dist-info-metadata attribute to the simple HTML.
The only new capability/metadata added by this PEP to the content of the API is the ability to have multiple hashes instead of HTML’s ability to have a single hash. Which was explicitly called out in the PEP with rationale.
We don’t want to really allow scope creep here, this is a pretty reasonably sized chunk of stuff to discuss, and additional metadata will just make it more difficult.
Fake Edit: That’s slightly a lie, technically the project name being in the API response isn’t part of PEP 503, but it has existed on PyPI (inside the title tag and a h1 tag) for as long as the simple API has existed. So you could call that technically a new piece of required metadata.
I think that’s just part of HTTP, and it’s generally just assumed to be something that can exist as a consequence of our use of HTTP.
Content Negotiation is also part of HTTP, but we’re specifying it here because we’re looking to use it in a specific way, to solve a specific problem.
I have to admit the use of the Accept header here makes me nervous.
As a developer, the only API I frequently interact with that uses Accept is the GitHub API - and it’s pretty frustrating. I mostly explore REST APIs directly in my browser - but with Vary APIs I can’t do that any more.
I can switch to curl in my laptop, but that’s harder on my phone (I do a lot of programming research on my phone).
More importantly, I can’t create links to API examples and easily share them with other developers or save them to my notes.
Any chance of offering a ?accept=application/vnd.pypi.simple.v1+html query string option as well?
I think that’s compatible with the overall goals - and it’s not yet mentioned as a considered alternative in the section about content negotiation at the end of the PEP. And it would let me link to examples.
My second problem with Accept headers is more obscure but could be relevant here: if you serve content that changes based on the Accept header, you need to worry about whether the wrong version of the content might be cached by an intermediary. Especially if you plan to use a caching CDN such as Cloudflare to help with distribution.
The HTTP Vary header is designed to address this - but infuriatingly, Cloudflare only support the Vary header for images and CORS headers - they deliberately do not support Vary: Accept, which makes the Accept header unsafe to use with anything cached behind Cloudflare!
I have bugged them about this before. Maybe the PSF has enough clout to encourage them to change their kind on this one?
Just chiming in here to say that I don’t expect PyPI to drop support for PEP 503 anytime soon, and it’s not a pressing need to remove support or a maintenance burden. The pressing need is to provide a non-HTML API that we can build upon.
The PEP says:
Similar to clients, it is expected that most repositories will continue to support HTML responses for a long time, or forever.
But if there’s something specific in the PEP that gives you the inclination that PyPI is hoping to drop support, please raise that because it’s not the intention.
Edit: I missed that @dstufft said basically the same thing in his response.
We would probably have to word it correctly to make it optional, things like Apache or what have you aren’t going to understand that and will just ignore it. But with that caveat it seems fine to me.
We would probably want to strongly recommend that clients don’t rely on it, and state that it’s primarily intended for browser based API exploration.
PyPI would probably just strip it and copy the data into the Accept header at the CDN edge, so for our purposes it wouldn’t look any different then the Accept header past the edge.
Wow that is ridiculous. Breaking standard HTTP features is incredibly weird for a CDN.
Any way, should be easy to work around. Even if we don’t add the query param you would just work around it by using a Cloudflare worker at the edge to modify the request to turn Accept headers into the query string. If we do add it then you have a built it mechanism for it.
Ah, thanks, I see now. Yes, that is definitely worth making clear in the PEP. I’d go so far as to say that the PEP should present the JSON API initially as a new format for the simple API, with no mention of content negotiation at all. It can then state that servers that want to offer both APIs MAY serve both types under the same URL, using content negotiation, or they may serve them under distinct URLs, requiring the client to choose which URL (and hence API) to use.
That also makes it much clearer why allowing an index to serve JSON by default is reasonable.
Presumably the server is required to return a Content-Type header, as that appears to be part of the content negotiation protocol. But again, it would be good to be explicit, and note in the PEP that this is required. Presumably therefore, if a client requests data from an index and gets back something without a Content-Type, it should therefore assume it’s seeing a PEP 503 response (and hence it’s HTML).
But I do think the PEP should spell out some of the implications, as not everyone reading it will know the details. And more to the point, people writing a simple script to query a package index won’t necessarily know that the protocol is more complex than requests.get(url), so we should be explicit. Maybe it would be worth adding some example client code to the PEP, showing how to query an index, and handling the content negotiation, would be useful (I know I’d find it helpful when updating my various scripts!)
I’m honestly more interested in having what the client should do specified. I’m honestly struggling to think through the various combinations that could arise (client expecting to see PEP 503, getting PEP 691, client calling PEP 503 as if it were 591, client asking for HTML explicitly but server can only return JSON, two cases either server returns 406 or server returns HTML…) A robust client interaction seems like it could be way harder than PEP 502. Or maybe I’m over-thinking things. Again, having the algorithm explicitly spelled out in the PEP would help a lot[1].
Well, it turns the “requirement to co-ordinate” into a requirement to implement the HTTP content-negotiation protocol. That’s still co-ordinating, and I think it’s naïve to assume readers will know how to do that any better than custom negotiation.
Maybe it would be enough for the paragraph you quoted to be strengthened. It certainly felt to me as if the message was “JSON is the new, better way, HTML is legacy”. Maybe it would be enough to say “The JSON API should not be seen as a replacement for the HTML API - indexes which currently serve HTML can add JSON support, and may (with suitable deprecation) drop HTML, but they should not replace their existing HTML API with JSON. New indexes are recommended to implement both APIs for maximum compatibility.”[2]
Actually, this feels complex enough that it’s bordering on needing a library that implements it so that clients don’t have to keep reinventing it. ↩︎
This wording strongly reflects my bias as the writer of clients. Index maintainers will clearly have a different view (“why must I provide two APIs?”). ↩︎
Yea, while I think we should leave it up to the repository to make those specific decisions, I think it would be reasonable to spell out the implications of those decisions, and to provide a recommended set of choices as a sane default for people.
For what it’s worth, this is roughly what it looks like on the client side if you don’t care about working with every repository (e.g. you’re happy to hard code either HTML or JSON).
import requests
# You can technically omit this completely if the server you're talking to
# returns a default content-type that you find acceptable, however it's
# recommended to explicitly define what you're trying to accept.
headers={
"Accept": "application/vnd.pypi.simple.v1+json",
# This is if you wanted HTML
# "Accept": "application/vnd.pypi.simple.v1+html",
# This is if you wanted HTML, but via the legacy compat name,
# which is what pip currently does.
# "Accept": "text/html",
}
resp = requests.get("https://pypi.org/simple/", headers=headers)
# If the server does not support the content type you requested, AND
# it has chosen to return a HTTP 406 error instead of a default response
# then this will raise an exception for the 406 error.
resp.raise_for_status()
# Check to see if the content-type we got back matches what we asked for.
# This is actually optional, but since the server could have returned something
# other than what we asked for, it's better to check up front rather than wait
# for it to error out when we try to process it.
#
# Note: Check against the content type you actually expected.
content_type, *_ = resp.headers.get("content-type", "").split(";")
if content_type != "application/vnd.pypi.simple.v1+json":
raise Exception(f"Invalid content type: {content_type}")
# Do something with the data
print(resp.json())
If you wanted to implement that so you supported any of the content types (E.g. how pip would implement this in the future), that would look something like:
import requests
content_types = [
"application/vnd.pypi.simple.v1+json",
"application/vnd.pypi.simple.v1+html",
"text/html", # For legacy compatibility
]
resp = requests.get(
"https://pypi.org/simple/",
headers={"Accept": ", ".join(content_types)}.
)
# If the server does not support any of the content types you requested,
# AND it has chosen to return a HTTP 406 error instead of a default
# response then this will raise an exception for the 406 error.
resp.raise_for_status()
# Dispatch based on the content type:
content_type, *_ = resp.headers.get("content-type", "").split(";")
if content_type == "application/vnd.pypi.simple.v1+json":
data = resp.json()
elif content_type in {"application/vnd.pypi.simple.v1+html", "text/html"}:
data = parse_html(resp)
else:
raise Exception(f"Invalid content type: {content_type}")
# Do something with the data
print(data)
That’s really the extent of it from the client’s POV. You ask for the content type(s) you support, get a response, check if you got what you wanted, then do something with it.
What about if you’re pointed at a PEP 503 only repository? That would return HTML, but would not necessarily set a content type. So raising an exception on no content type is wrong. Should the line getting the content type be resp.headers.get("content-type", "text/html") to cover that case? Is that sufficient?
This is what I mean about it being tricky to get right…
def _ensure_html_header(response: Response) -> None:
"""Check the Content-Type header to ensure the response contains HTML.
Raises `_NotHTML` if the content type is not text/html.
"""
content_type = response.headers.get("Content-Type", "")
if not content_type.lower().startswith("text/html"):
raise _NotHTML(content_type, response.request.method)
def _get_html_response(url: str, session: PipSession) -> Response:
"""Access an HTML page with GET, and return the response.
This consists of three parts:
1. If the URL looks suspiciously like an archive, send a HEAD first to
check the Content-Type is HTML, to avoid downloading a large file.
Raise `_NotHTTP` if the content type cannot be determined, or
`_NotHTML` if it is not HTML.
2. Actually perform the request. Raise HTTP exceptions on network failures.
3. Check the Content-Type header to make sure we got HTML, and raise
`_NotHTML` otherwise.
"""
if is_archive_file(Link(url).filename):
_ensure_html_response(url, session=session)
logger.debug("Getting page %s", redact_auth_from_url(url))
resp = session.get(
url,
headers={
"Accept": "text/html",
# We don't want to blindly returned cached data for
# /simple/, because authors generally expecting that
# twine upload && pip install will function, but if
# they've done a pip install in the last ~10 minutes
# it won't. Thus by setting this to zero we will not
# blindly use any cached data, however the benefit of
# using max-age=0 instead of no-cache, is that we will
# still support conditional requests, so we will still
# minimize traffic sent in cases where the page hasn't
# changed at all, we will just always incur the round
# trip for the conditional GET now instead of only
# once per 10 minutes.
# For more information, please see pypa/pip#5670.
"Cache-Control": "max-age=0",
},
)
raise_for_status(resp)
# The check for archives above only works if the url ends with
# something that looks like an archive. However that is not a
# requirement of an url. Unless we issue a HEAD request on every
# url we cannot know ahead of time for sure if something is HTML
# or not. However we can check after we've downloaded it.
_ensure_html_header(resp)
return resp
IIRC, pip has refused to work with something that didn’t return a text/html content-type for… maybe forever? I believe a similar check has been there for as long as I can remember, unless my memory is bad.
Technically the Content-Type header on a response is not a mandatory header, it is a SHOULD header. Which allows omitting, but says things are probably going to break if you don’t include it.
Given pip + the SHOULD language, I don’t think we particularly need to worry about it, but if I were to worry about it, I wouldn’t change the code I posted. Without a content-type you can’t know how to interpret the response, so it’s either an error (if you choose to explicitly error out) or you can keep trucking and see if you fail somewhere along the way.