PEP RFC: Python Package Index (Warehouse) JSON API v1

nchepanov · June 10, 2021, 5:55pm

Greetings!

@cooperlees @sumanah and I would like to propose a PEP that formalizes the existing JSON API.

The PEP introduces a JSON Schema and includes changes to the API URL structure

The draft: peps/pep-9999.rst at warehouse_json_api · nchepanov/peps · GitHub
JSON Schema: peps/json_api.schema.json at warehouse_json_api · nchepanov/peps · GitHub

Non-goals

The following is not part of this proposal, but is likely to warrant subsequent PEPs:

Adding properties that aren’t already returned by the legacy JSON API endpoints
Removing properties that are already returned by the legacy JSON API endpoints
Adding discovery endpoints
Adding pagination capabilities
Adding authentication
Adding writeable endpoints
Supporting TUF (PEP 458): This version of the JSON API is not protected by TUF, and so should not be used for dependency resolution.
Deprecating XMLRPC API: The PEP lays out the foundation for the future deprecation of the XMLRPC API.

Proposed API structure changes

$root/pypi/$project_name/json          -> $root/api/v1/project/$project_name/latest
$root/pypi/$project_name/$version/json -> $root/api/v1/project/$project_name/$version

Help needed

Should X-PyPI-Last-Serial header be part of the spec?
I’d appreciate if someone with better understanding of the domain can verify for all properties:
- is it nullable?
- is it required?
- is it deprecated?

Relevant background:

https://discuss.python.org/t/pep-for-the-python-package-index-json-api/5717
https://github.com/pypa/packaging-problems/issues/367
https://github.com/devpi/devpi/issues/801

domdfcoding · June 10, 2021, 6:07pm

I assume the existing URLs will redirect to the new scheme? I don’t see it mentioned in the PEP, although it is probably a warehouse implementation detail and out of scope.

pf_moore · June 10, 2021, 6:35pm

As a migration/transition point, can the spec be explicit that on PyPI the serial numbers will not be reset when moving to the new API, so that tools that use the last serial number to detect changes when calling the existing API will be able to change URLs transparently, and will not have to re-fetch data they already have just because all of the serial numbers have changed?

I like this, but am I reading the spec right and it’ll always be the same as the last_serial value in the JSON response? I can imagine using this to issue a HEAD request and skip requesting the body if there’s no serial number change. I don’t do that at the moment, but I certainly could (and given that I’m downloading a few thousand responses in a single run, it could be a worthwhile saving).

I can try and do that for you - if I get a chance this weekend I’ll take a look. I don’t work on warehouse itself, so I can’t check against the DB schema or anything like that, though.

dustin · June 10, 2021, 7:54pm

Thanks for working on this!

I took a quick pass through the PEP (and found some typos, PR here: https://github.com/nchepanov/peps/pull/1). My main issue with this PEP is that it doesn’t really do what it says it’s going to do. In the “Motivation” section, it says:

This PEP aims to lock in the existing standard as a guarantee for consumers

But then later, one of the goals is:

Declare legacy JSON API endpoints deprecated

And instead this PEP describes how the new JSON API will work.

Given that all the indexes you mentioned will probably continue to support the “legacy” API for a long time, I think we actually need two PEPs:

A PEP that defines the existing, legacy JSON API as a standard, that people can continue to use
A PEP that declares that standard deprecated and provides the new standard JSON API

With the regards to the new standard, I’d like to see some of the comments that @dstufft raised around hypermedia-based APIs addressed as well (e.g. here and here). I think this is a really important part of the new API and it needs to be included from the very start for any proposal for a new API standard.

I agree, some discussion of how these APIs should be deprecated/discontinued in a way that won’t break downstream consumers should be included here.

As an implementation detail of PyPI, these are baked into our DB and won’t change when they are surfaced via any API. I’m curious though – does anyone know if any third-party private indexes provide this field as well? Or just mirrors?

domdfcoding · June 10, 2021, 9:02pm

Devpi includes X-Pypi-Last-Serial for mirrors, identical to the value sent by PyPI for the mirrored page. It also includes X-Devpi-Serial for both mirrors and private indexes.
(This is for the simple API, as devpi has no JSON api yet).

EpicWink · June 10, 2021, 11:12pm

If you wanted to include the JSON schema in the PEP, you can put it in an HTML collapsible

.. raw:: html

   <details>
   <summary><a>JSON schema</a></summary>

.. code-block:: json

   {}

.. raw:: html

   </details>

I would suggest converting the schema to YAML for readability

Also, there’s an opportunity to have an OpenAPI schema

cooperlees · June 11, 2021, 2:23pm

And instead this PEP describes how the new JSON API will work.

Lets tune the wording here then. This PEP’s goal is to do minimal changes to start the deprecation of the old endpoints (e.g. that we can leave un-versioned and unchanged for a defined period with warnings to callers) and introduce the same API with better name-spacing, versioned and spec’d defined data offered so we have a record of what it offers and mirrors/other indexes can implement a JSON endpoint too (per index).

The data offered / schema changes here are very little of the actual JSON contents that are offered today. Due to this, I don’t really call this a new API. It’s more of a long overdue cleanup and defining what the “JSON” API actually offers.

How do Nikita and I move forward here? Is it a shared preferred view between thos who make the call here we make two PEPs for this? I feel the first PEP will be non beneficial due to:

Non friendly namespacing makes it hard to implement elsewhere outside of pypi/warehouse
The fact the majority want to move to a new or extended version of this api

Can we just modify this PEP to talk about the legacy {URL}/pypi/PKG_NAME/json and it being available for a period until we make /api mature and get all main callers using it? I’m happy to try identify via logs the callers and reach out / do PRs to move to the new versioned URL too.

njs · June 16, 2021, 12:36am

The requires-dist metadata is just a trap right now, IIUC? Probably that should also be marked [DEPRECATED]?

I also agree with @pradyunsg’s comment here that including the full list of releases in every version-specific page seems gratuitous. Is there any use case at all?

I get the argument that you want to avoid changes now to avoid scope creep, but then you’re changing the URLs, which already breaks every existing consumer… If you weren’t changing anything, that would make sense to me, or if you were doing all the obvious cleanups that would make sense to me, but doing only some of the obvious cleanups doesn’t make sense.

(Personally I don’t think it’s critical to come up with a PEP for the existing API unless we’re recommending that other projects like devpi implement it. As long as there’s just a single implementation at Warehouse, then the Warehouse docs are good enough.)

pradyunsg · June 18, 2021, 10:29am

IIUC, that’s one of the goals. It’s also that folks didn’t wanna make changes to the existing JSON API without first documenting it in a standard, and going through the whole process to hold it at the same footing as the simple API.

nchepanov · July 6, 2021, 9:53pm

I can see now that if any progress to be made, it’s best to limit the scope of this PEP to the existing API, avoid making any changes to it and focus on formalizing required / optional fields.

Please see the changes I made to the draft.

Few questions to whoever has the expertise:

Can someone more knowledgeable please describe the logic used to determine the “latest installable version” when using pypi/$project_name/json API? Or point out where it happens in the warehouse codebase?

I see that it’s typically the latest non-yanked, non-pre-release version? However in some cases e.g. https://pypi.org/project/black/#history where all versions are pre-releases it returns the latest pre-released version instead.
per @njs request, I’d like to clarify whether requires_dist is deprecated. It appears that many projects don’t have this populated.

This would be very helpful. It appears that most of the fields are defined as nullable which makes all properties optional. Would you be able to confirm this for me please?

It’s much easier to work with it when it’s in a separate file

JSON validators I found require the JSON schema file to be JSON. I hope the schema will be used for validation purposes and not just read by humans. I’m don’t believe that readability is a concern, there are a number of JSON Schema visualization tools too.

Why is it useful or necessary? Is it more expressive, or more “accepted”? Couldn’t an OpenAPI schema be constructed 1-1 from a JSON Schema?

nchepanov · July 6, 2021, 10:02pm

I’m unable to make edits to my original post (or simply can’t find the right button).

EpicWink · July 6, 2021, 10:48pm

OpenAPI specs allow client projects to test their interaction with an API with an official schema without having to make requests to the API. They also allow for easy visualisation of the API using tools like SwaggerUI.

OpenAPI is different from JSON Schema in that they’re specifications for different things: JSON Schema for JSON data validation, OpenAPI for HTTP API specification. Usually you use JSON Schema to provide a specification for the request and response JSON bodies inside OpenAPI.

nchepanov · July 7, 2021, 5:22pm

Thank you for the explanation! Makes sense to me, unless I hear objections - shouldn’t be too hard to add.

nchepanov · July 19, 2021, 5:40pm

The draft is awaiting feedback.

pradyunsg · July 21, 2021, 11:44am

Looks like the draft has been updated to basically describe the current API as-is.

After re-reading this thread quite a few times, I’m confused about what we’re trying to do here and why we’re doing it. I’m also feeling like we’re spending the little availability we all have, toward the wrong thing.

I disagree.

Is documenting the current JSON API an absolute blocker for defining a new API? Why?
Do we really need the current JSON API to be an interoperability standard?
- If we just wanna deprecate the current JSON API, why are we spending energy on standardising all of its ~~horrible behaviours~~ quirks in detail?
Do we want to remove the current JSON API from PyPI at some point in the future? Especially, if we have a better designed alternative + migration period?

I think we all agree that the current API has a lot of shortcomings. I think it should definitely NOT be an interoperability standard – the only reason other tooling in the ecosystem mimics our current JSON API is because there’s no alternative to that (and no one’s gonna write an XMLRPC API). Our energy is likely better spent solving the lack of a standard by building a better alternative for what we call the PyPI JSON API today.

Documenting / describing / tweaking / improving the current API is not gonna help us with the design of the new API. The new API has to be substantially different – it’s gonna have to account for a whole bunch of additional concerns that the current API doesn’t – and almost none of it is gonna be informed by the current API (not the URL structure, not most of the content schema/semantics, no X- HTTP headers other than maybe one).

“We documented a thing so that you can continue to depend on it and implement it in other tooling” followed by “We want you move to the new alternative we created that has more constraints, because that first thing is deprecated with an EoL far away” – to me, that sounds like exactly what we should do if we wanna progress as slowly as possible – I don’t think it’s the approach we should take here. It has no compelling reasons to migrate for any of our users, we have to do additional work and we also need to keep multiple APIs alive longer.

From @nchepanov’s draft:

XXX: What is the state of requires_dist property, is it required?

IIUC, it is populated from the upload API form’s contents:

warehouse/warehouse/forklift/legacy.py at 011cdf2ba1c8a043f779e5101b6c9c80cae7afbe · pypi/warehouse · GitHub for the warehouse side of things.
twine/twine/package.py at fc20d94040cb8f289ccd988928b5a784d643de37 · pypa/twine · GitHub for the twine side of things.

To describe that as I understand it – requires_dist’s value are based on what the upload tool provided in the upload request, NOT by the contents of the files that have been uploaded to PyPI. Thus, if the upload tool doesn’t provide this information, this attribute is gonna be null. It is also null if the package has no dependencies.

For users, this means that you can’t rely on the dependency information in the API being correct, and the last two sentences describe a very annoying quirk that make even the available information difficult to use.

pf_moore · July 21, 2021, 12:29pm

This is, I think, one of the biggest reasons not to try to preserve the existing API (except as a legacy interface for existing tools, most of which probably ignore all of the problematic fields, so that’s yet one more reason not to waste effort documenting/standardising them).

I basically agree with what @pradyunsg says here. I’m an enthusiastic user of the existing JSON API, but I’ve no interest in trawling over its quirks and trying to determine what it all means - I’d much rather be trying to create something even better - collecting use cases, working out where the existing API doesn’t give people what they need, etc. Yes, there will be an element of looking at where the existing API doesn’t do what people want, but I’d rather we focus on what people need, and not on what we need to avoid…

For me, the biggest way this whole exercise could fail is if we end up developing a new API that people can’t use because it doesn’t solve their problems, so they have to stick with the old API, consequently entrenching it even further, and making it harder for us to move forward.

pradyunsg · July 21, 2021, 12:34pm

If there’s no non-pre-release version, it’ll use the latest pre-release. Otherwise, it’ll use the latest non-prerelease. It puts any yanked versions last / ignores them. This is basically mimicking how pip would pick the “latest” package version.

warehouse/warehouse/legacy/api/json.py at 011cdf2ba1c8a043f779e5101b6c9c80cae7afbe · pypi/warehouse · GitHub looks like the relevant chunk of code.

uranusjr · July 22, 2021, 1:42pm

More accurately, this version selection logic is specified in PEP 440 (the pre-release part) and PEP 592 (the yanked part). So a PEP should refer to those PEPs instead of pip’s behaviour.

p.s. pip’s implementation actually contains bugs, but I believe that buggy parst do not exist in Warehouse, and the parts implemented by both are standards-compliant.

btskinn · August 18, 2021, 4:54pm

Relevant to the overall discussion of the JSON API, I would like to make a note here about my currently open PR implementing different flavors of latest endpoints.

Obviously, I would expect this PR to be on hold until the overall path forward is resolved, but I’d request it be kept in mind as part of the conversation regarding a standardized API. Multiple people have expressed interested on the PR and associated issue, so it seems sufficiently not-YAGNI to my mind.

cooperlees · April 29, 2022, 9:26pm

Howdy all,

Long time. I’m going to try and get this moving again. I’m at PyCon US 2022 and in the Packaging summit, so might try and start a discussion.

I guess the main thing to workout now is:
a) Do we want a PEP on the current API at all? It seems people are divided on this
b) If not, do we want to just describe the new API (which we did try many many years ago) via example

Nikita has moved on from Bloomberg and no longer has cycles for this so I am currently back porting the PEP from his repo back to mine so we can hopefully progress.

I’m in the PyPA Discourse + around at the sprints. Who’s also down to try make some progress here?