PEP 691: JSON-based Simple API for Python Package Indexes

dstufft · June 28, 2022, 12:17am

I don’t have strong opinions either way, I included it because in practice PyPI’s implementation of /simple/foo/ included it, and it was the normalized name. It’s a relatively small amount of data so I wasn’t too concerned either way.

davidism · June 28, 2022, 1:09pm

In PEP 694: Upload 2.0 API for Python Package Repositories, it states that unlike this PEP, the new upload API uses a new endpoint.

Unlike PEP 691, this PEP does not change the existing 1.0 API in any way, so servers will be required to host the new API described in this PEP at a different endpoint than the existing upload API.

Why was the upload API chosen to be a separate endpoint, while the JSON API focuses on a single endpoint differentiated by headers (with the option of course to point at different endpoints per format)? It’s definitely more natural to use a different endpoint for new upload semantics, but that same argument could be applied to the new download API as well.

dstufft · June 28, 2022, 1:16pm

Because the URL structure, and the semantics of what those URLs are, of the simple API hasn’t changed at all in PEP 691. It was just creating a different representation of the same data.

For the Upload API the semantics of the URLs are drastically different to the point that they have almost nothing in common with each other.

dstufft · July 13, 2022, 11:45pm

Just to close the loop here since there were some concerns with static mirrors.

I have working configurations for both Apache and Nginx for bandersnatch.

Assuming you have Apache configured to have mod_negotiation enabled and to allow .htaccess, you can implement basic support for PEP 691 by writing index.html, index.v1_html, and index.v1_json files for all of the URLs, and dropping a top level .htaccess that looks like:

Options -Indexes +Multiviews

DirectoryIndex index

AddType application/vnd.pypi.simple.v1+json v1_json
AddType application/vnd.pypi.simple.v1+html v1_html

That doesn’t support the latest version (Apache doesn’t make it easy to separate the returned content type from the content type specified in the Accept header) or the ?format= query param. Both of those things are supportable I think using mod_rewrite, but I didnt’ have time to dig into it further.

A weird artifact of the Apache configuration is that it doesn’t have any option to configure a server side preference, so in cases where multiple content types are equally preferred by the client, it will return which ever response is the smallest.

The Nginx configuration is a little more complex, to see the whole thing you’re best off looking at the bandersnatch issue, but the important parts are:

http {

    # ...

    map $http_accept $mirror_suffix {
        default ".html";

        "~*application/vnd\.pypi\.simple\.latest\+json" ".v1_json";
        "~*application/vnd\.pypi\.simple\.latest\+html" ".v1_html";

        "~*application/vnd\.pypi\.simple\.v1\+json" ".v1_json";
        "~*application/vnd\.pypi\.simple\.v1\+html" ".v1_html";

        "~*text/html" ".html";
    }

    map $arg_format $mirror_suffix_via_url {
        "application/vnd.pypi.simple.latest+json" ".v1_json";
        "application/vnd.pypi.simple.latest+html" ".v1_html";

        "application/vnd.pypi.simple.v1+json" ".v1_json";
        "application/vnd.pypi.simple.v1+html" ".v1_html";

        "text/html" ".html";
    }

    server {

        # ...

        location /simple/ {
            index index$mirror_suffix_via_url index$mirror_suffix;

            types {
                application/vnd.pypi.simple.v1+json v1_json;
                application/vnd.pypi.simple.v1+html v1_html;
                text/html html;
            }
        }
    }
}

This doesn’t actually implement conneg, in that Nginx is not parsing the Accept header and doing the full content negotiation algorithm as recommended by the RFC, and instead it’s just doing a regex match against the Accept header (and a basic string equals against the ?format= parameter) and mapping that to a file extension that gets set in the index directive.

In practice, this should be fine. The main downside is it won’t let clients express a relative preference between the content types in their Accept header (they can specify it, nginx just won’t pay attention to it). The RFCs don’t require the server to take the relative client preferences into account, so it’s valid not to do that, it’s just somewhat better if you do.

Unlike the Apache example, the Nginx example allows setting the default value you want when there is no Accept header, or the Accept header doesn’t contain one of the specified content types, which is controlled by the default value in the first map. It also allows the server to express a preference between the content types, controlled by putting the preferred, non-default, option higher in the map.

In addition, the nginx example also supports:

The latest version, which will return the correct Content-Type.
The ?format= query string, which correctly overrides the Accept header.

Those two web servers probably cover the bulk of all static mirrors out there, and of course (as mentioned in the PEP), if someone is in a situation where they cannot use conneg, the PEP still supports using independent URLs for different versions, and selecting html or json by configuring your index url in the client.

pf_moore · July 22, 2022, 12:22pm

I wish I’d thought of this while the discussion was ongoing, but the one remaining place where (as far as I can tell) the PyPI XML-RPC API is needed is for determining what packages have changed in a given period^[1]. This is useful for incremental mirroring of metadata and similar types of operation. Caching isn’t much help here, as for many index/JSON responses, the header is bigger than the body - it’s the number of requests that is the bottleneck in my experience, not the volume of data. It’s only when downloading files that I see significant savings from caching^[2].

If the top-level index were to return a “last modified” timestamp for each project, as well as the name, this would allow consumers to avoid requesting unchanged data without needing to do a HEAD request.

I’m not quite sure from @brettcannon’s comment what he expects to happen with possible extensions like this (or the url example he mentioned). Would they be candidates for inclusion in a “version 2” of the spec (with the expectation that it would be a while before the new version happens) or is the fact that the PEP allows servers to send extra data beyond what the spec requires intended to provide the option for people to experiment with such extensions? If it’s the latter, I might raise a feature request on Warehouse (I’d love to see the back of the XML-RPC interface in my code ).

And frankly, it’s a bit clumsy even for that… ↩︎
This may not be true if you have low bandwidth of course. ↩︎

dstufft · July 22, 2022, 2:06pm

The shape of the JSON response is made to enable future PEPs to add additional “stuff” to the response without having to make a whole new v2 API. We generally wouldn’t want to just add features to the simple api in Warehouse unless it’s truly something Warehouse specific.

But new PEPs can add new data just by specifying the new data itself, similar to how we add new fields to the core metadata.

brettcannon · July 22, 2022, 7:12pm

I expect way more minor version bumps than major version bumps. So adding a “last modified” would probably be a 1.1 thing, and maybe not even a version number change (e.g. look at all the changes to the HTML API that didn’t bump the minor version since they were discoverable independent of what was in the original spec).

cooperlees · September 1, 2022, 4:04am

Bandersnatch 6.0.0 now support this PEP.

user138234 · September 21, 2022, 9:55am

Just a quick heads-up: Our local mirror was also affected by this change, as I reported here: [solved] PIP fetches from files.pythonhosted.org despite local mirror was specified This was difficult to debug, as accessing the mirror via curl worked as expected, but downloads via pip were still redirected to files.pythonhosted.org although the index was correctly fetched. I didn’t find this thread by my search keywords, so I thought, reporting back here might help users find this in the future

brettcannon · November 22, 2022, 1:14am

I have an update for GitHub - brettcannon/mousebender: A package for installing Python packages to change its mousebender.simple module to convert HTML-based Simple repository API responses to the equivalent JSON-based one. Right now I’m just waiting to see if PEP 700: Additional Fields for the Simple API for Package Indexes gets accepted or the end of the month to do a new release, whichever comes first. After that I will add some niceties that should make it easier to not care about whether a server responded with HTML or JSON.

brettcannon · December 4, 2022, 12:05am

Just released mousebender 2022.0.0 with PEP 691 support via converting HTML-based Simple Repository API data to JSON-based data. We have some ideas on how to make this fairly transparent to consumers of a Simple Repository API response such that they don’t have to care if it’s HTML or JSON as they will just get the JSON in the end.