PEP 691: JSON-based Simple API for Python Package Indexes

dstufft · May 31, 2022, 10:42pm

Of course! I welcome people to pick apart these proposals

In the very short term, I suspect nobody will benefit since the very short term will be all cost (the cost of having to implement this thing) and no benefit (it’s expected that everyone will continue to maintain their existing HTML parsing solution, and the data will be largely 1:1).

In the longer term, we have a couple of benefits:

People implementing repositories and clients that implement this API can, on their own time schedule, start dropping support for HTML responses. The expectation is that PyPI and pip will likely maintain theirs for quite some time just due to their positions within the ecosystem, but projects without those constraints are enabled to be much more aggressive in dropping support.
- This includes brand new projects, whom may decide not to ever implement the HTML content type at all.
It unlocks the ability to start adding new features that are no longer constrained by the limitations of HTML.

There are a couple of things here that I don’t agree with.

The first is that I don’t think content negotiation is actually harder than the current situation. Content negotiation is a foundational part of how HTTP works, and every client has to be prepared to cope with it in every request.

To expand on that, there is not actually a way to make HTTP requests that don’t, at their core, boil down to content negotiation. So currently when you make an HTTP request to a simple API you can either in include an Accept: text/html header or not.

If you do not, then the server is, by nature of HTTP, welcome to choose any content type it wants, or return an error. If you do send an Accept header, again the server is free to use that information in guiding what it will return, or it can ignore it and return whatever it wants if it doesn’t support that.

The important bit here, is this is fundamentally just content negotiation, whether you’re not including the Accept header (which tells the server that you’re happy with whatever representation the server gives you) or whether you are (which tells the server you prefer text/html).

In both cases, you may not get the content type that you expect, there is no way in HTTP to mandate that you only get the correct content type, and you have to be ready to cope, in some fashion, with the fact that you may not get the content type you expect. Now granted, in practice most servers will return the content type that you expect, and in the cases they don’t, you can just assume they did and at some point you could will hit a point where the assumptions it made about the response content don’t hold and you’ll get some random error.

But that’s all mostly true with this PEP too, you can just assume that the server sent you the content type you expected.

You can also just not send an Accept header at all, and assume the server will send you something that you expect, which matches the simplest possible implementation for the client today, the only difference is that there is a greater chance that the server won’t be sending you what you expect (since previously it should have only returned text/html, but now it could return other content types as well), so it’s really recommended that you at least include an Accept header.

I will go back to my example code, here’s the absolute simplest code that will more or less reliably do what you want in most cases with the existing API:

import requests

resp = requests.get("https://pypi.org/simple/")
resp.raise_for_status()

data = parse_html(resp)

Here’s the same absolute simplest code with the changes in the PEP, assuming that you’re handling the most complex case possible, of supporting both HTML and JSON:

import requests

resp = requests.get(
    "https://pypi.org/simple/",
    headers={"Accept": "application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, text/html"},
)
resp.raise_for_status()

if "application/vnd.pypi.simple.v1+json" in resp.headers.get("content-type", ""):
    data = parse_json(resp)
else:
    data = parse_html(resp)

This isn’t as robust as the example code in the PEP, but it’s as robust as the existing code was (it’s actually technically slightly more robust!). It makes an HTTP request, then assumes that the content type is something it understands, and if not it will error out at some point.

But if you look at these two things, the additional complexity caused by content negotiation is… an extra dictionary being passed to requests.get(), and an extra conditional on the response. That’s not hardly what I’d call a lot of extra complexity, and in fact, that actually matches what pip itself does today (other than the addition of the application/vnd.pypi.simple.* types, and it’s conditional just raises an error if it’s not text/html).

On the server side there is some additional complexity in parsing and selecting the content type that you’re going to respond with, but all of the major web frameworks that I could find support it, some of the static file servers support it (some don’t).

The other statement here is that the complex logic of HTML isn’t present in the simple API, but that’s not actually true IMO, because of these two lines from PEP 503:

URL must respond with a valid HTML5 page
There may be any other HTML elements on the API pages as long as the required anchor elements exist.

That means that a fully PEP 503 conformant client MUST be prepared to accept a response body that contains literally any valid HTML5 content, regardless of what that content is. Now in practice it’s highly unusual to put something in your simple response that html.parser can’t parse, so you can most likely get away with ignoring that requirement of the PEP without any ill effect, but doing so means that you’re deviating from the PEP.

Here again, I don’t agree with this conclusion.

I think this does represent an intermediate step for future improvements, because a major blocker to improvements right now is trying to fit things into capabilities of HTML. For example, something we would like to do is add all of the dependencies for a project in response, but there isn’t really a good way to serialize a list of data into an HTML attribute besides doing something like embedding JSON inside of an HTML attribute.

An important aspect of this PEP is in this line:

Future versions of the API may add things that can only be represented in a subset of the available serializations of that version.

This gives us full permission to effectively freeze the HTML API in place, never adding another feature to it, while we start adding new features to the JSON API, freeing us from having to worry about how we can encode something that we want to add into HTML.

Certainly, some of the value in this PEP will not manifest itself until after clients or repositories start dropping support for HTML, though even in the interim, it makes things like “just” using html.parser a little more palatable. Though as mentioned above this PEP does allow us to start improving the API with new features right away.

I do want to challenge the idea of “a better solution should already be around”. I don’t think that the data model of the simple API is actually a problem for its intended use case, and I think it serves it well. There are things that we would like to add that are tough to express in HTML, but I think the fundamental shape of the data is… fine?

I don’t really see us needing to replace this API in the future unless the state of the art drastically changes in some way that I don’t think it’s possible for us to see right now.

Certainly, this API isn’t well structured for a general purpose API to interact with PyPI, but that’s not it’s goal and never should be. The amount of traffic we get for this API is massive, and it deserves to have an API that is specialized for it’s use cases, and a general purpose API will never be that.