PEP 691: JSON-based Simple API for Python Package Indexes

I don’t have time right this moment to look at the other comment, I just wanted to say that I knew the cgi module was being deprecated, but I used this method anyways because the way of doing it with the email module is somewhat noisier, and I felt it distracted from the actual meat and potatoes of the example code that showed the overall flow.

If folks really think it’s super useful to have

import email.message

def parse_header(header):
    m = email.message.Message()
    m["content-type"] = header
    ct, *params_raw = m.get_params()
    return ct[0], dict(params_raw)

At the top of the code, I can add it. It just felt like noise to me.

FYI I’m going to wait until I’m told the PEP is done and ready on my feedback before I dive back into it.

2 Likes

Does anyone have any other thoughts? Concerns? Anything :slight_smile: I think we’ve covered most of the concerns people have had, but if we hadn’t, I’d love to figure them out and get them handled.

3 Likes

I don’t want to sound too pessimistic (the work on this very much appreciated), but some push back can be healthy. Although I agree with taking an incremental approach, I’m not sure if this is really a step forward. The introduced complexity of this PEP has a cost, and if the reward is not significant, I don’t think it is worth it.

My main question(s) would be: for whom is this PEP, who will benefit from this? and when? (now, nearby future, or far future)

Although JSON in general might be nicer to parse, in it’s current state the HTML Simple API parsing is much easier to implement than the JSON will be behind content-negotiation - with all the possible alternative responses/errors - as proposed in this PEP. And the complex logic of HTML as stated under “Abstract” is not really present in the Simple API ( html.parser works fine).

I also get that this is maybe not about an improvement for today, but instead an intermediate step for future improvements. But in that case, I don’t think this PEP is laying out a solid foundation for that.

In my opinion the real value from this PEP will manifest after tools (client and server) will drop the HTML support. And before that eventually happens, a better solution should already be around.

This is a lot of complaining from my side, without providing any solutions. After thinking about it for a while about this, I can’t think of great alternatives, at least not without (partly) dropping the zero-configuration requirement.

Which lead me to the think, maybe we shouldn’t continue with this PEP at all…

3 Likes

It’s fine-ish. As someone who has implemented code to handle the HTML-based Simple API, html.parser is not exactly a robust parser. It’s fine for simple things, but there’s no guarantee it will succeed on valid HTML.

Plus it’s way easier to find libraries to consume JSON than HTML in other languages these days (and that is important for tooling purposes).

That’s typically not how we evolve standards because it makes switching harder. By making only the parsing step different but the overall data model the same, it makes this more of a change at the edges of your code rather than at the logic level (e.g. it’s more like encoding/decoding strings with this PEP than switching to integers for everything).

3 Likes

Of course! I welcome people to pick apart these proposals :slight_smile:

In the very short term, I suspect nobody will benefit since the very short term will be all cost (the cost of having to implement this thing) and no benefit (it’s expected that everyone will continue to maintain their existing HTML parsing solution, and the data will be largely 1:1).

In the longer term, we have a couple of benefits:

  • People implementing repositories and clients that implement this API can, on their own time schedule, start dropping support for HTML responses. The expectation is that PyPI and pip will likely maintain theirs for quite some time just due to their positions within the ecosystem, but projects without those constraints are enabled to be much more aggressive in dropping support.
    • This includes brand new projects, whom may decide not to ever implement the HTML content type at all.
  • It unlocks the ability to start adding new features that are no longer constrained by the limitations of HTML.

There are a couple of things here that I don’t agree with.

The first is that I don’t think content negotiation is actually harder than the current situation. Content negotiation is a foundational part of how HTTP works, and every client has to be prepared to cope with it in every request.

To expand on that, there is not actually a way to make HTTP requests that don’t, at their core, boil down to content negotiation. So currently when you make an HTTP request to a simple API you can either in include an Accept: text/html header or not.

If you do not, then the server is, by nature of HTTP, welcome to choose any content type it wants, or return an error. If you do send an Accept header, again the server is free to use that information in guiding what it will return, or it can ignore it and return whatever it wants if it doesn’t support that.

The important bit here, is this is fundamentally just content negotiation, whether you’re not including the Accept header (which tells the server that you’re happy with whatever representation the server gives you) or whether you are (which tells the server you prefer text/html).

In both cases, you may not get the content type that you expect, there is no way in HTTP to mandate that you only get the correct content type, and you have to be ready to cope, in some fashion, with the fact that you may not get the content type you expect. Now granted, in practice most servers will return the content type that you expect, and in the cases they don’t, you can just assume they did and at some point you could will hit a point where the assumptions it made about the response content don’t hold and you’ll get some random error.

But that’s all mostly true with this PEP too, you can just assume that the server sent you the content type you expected.

You can also just not send an Accept header at all, and assume the server will send you something that you expect, which matches the simplest possible implementation for the client today, the only difference is that there is a greater chance that the server won’t be sending you what you expect (since previously it should have only returned text/html, but now it could return other content types as well), so it’s really recommended that you at least include an Accept header.

I will go back to my example code, here’s the absolute simplest code that will more or less reliably do what you want in most cases with the existing API:

import requests

resp = requests.get("https://pypi.org/simple/")
resp.raise_for_status()

data = parse_html(resp)

Here’s the same absolute simplest code with the changes in the PEP, assuming that you’re handling the most complex case possible, of supporting both HTML and JSON:

import requests

resp = requests.get(
    "https://pypi.org/simple/",
    headers={"Accept": "application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, text/html"},
)
resp.raise_for_status()

if "application/vnd.pypi.simple.v1+json" in resp.headers.get("content-type", ""):
    data = parse_json(resp)
else:
    data = parse_html(resp)

This isn’t as robust as the example code in the PEP, but it’s as robust as the existing code was (it’s actually technically slightly more robust!). It makes an HTTP request, then assumes that the content type is something it understands, and if not it will error out at some point.

But if you look at these two things, the additional complexity caused by content negotiation is… an extra dictionary being passed to requests.get(), and an extra conditional on the response. That’s not hardly what I’d call a lot of extra complexity, and in fact, that actually matches what pip itself does today (other than the addition of the application/vnd.pypi.simple.* types, and it’s conditional just raises an error if it’s not text/html).

On the server side there is some additional complexity in parsing and selecting the content type that you’re going to respond with, but all of the major web frameworks that I could find support it, some of the static file servers support it (some don’t).

The other statement here is that the complex logic of HTML isn’t present in the simple API, but that’s not actually true IMO, because of these two lines from PEP 503:

URL must respond with a valid HTML5 page
There may be any other HTML elements on the API pages as long as the required anchor elements exist.

That means that a fully PEP 503 conformant client MUST be prepared to accept a response body that contains literally any valid HTML5 content, regardless of what that content is. Now in practice it’s highly unusual to put something in your simple response that html.parser can’t parse, so you can most likely get away with ignoring that requirement of the PEP without any ill effect, but doing so means that you’re deviating from the PEP.

Here again, I don’t agree with this conclusion.

I think this does represent an intermediate step for future improvements, because a major blocker to improvements right now is trying to fit things into capabilities of HTML. For example, something we would like to do is add all of the dependencies for a project in response, but there isn’t really a good way to serialize a list of data into an HTML attribute besides doing something like embedding JSON inside of an HTML attribute.

An important aspect of this PEP is in this line:

Future versions of the API may add things that can only be represented in a subset of the available serializations of that version.

This gives us full permission to effectively freeze the HTML API in place, never adding another feature to it, while we start adding new features to the JSON API, freeing us from having to worry about how we can encode something that we want to add into HTML.

Certainly, some of the value in this PEP will not manifest itself until after clients or repositories start dropping support for HTML, though even in the interim, it makes things like “just” using html.parser a little more palatable. Though as mentioned above this PEP does allow us to start improving the API with new features right away.

I do want to challenge the idea of “a better solution should already be around”. I don’t think that the data model of the simple API is actually a problem for its intended use case, and I think it serves it well. There are things that we would like to add that are tough to express in HTML, but I think the fundamental shape of the data is… fine?

I don’t really see us needing to replace this API in the future unless the state of the art drastically changes in some way that I don’t think it’s possible for us to see right now.

Certainly, this API isn’t well structured for a general purpose API to interact with PyPI, but that’s not it’s goal and never should be. The amount of traffic we get for this API is massive, and it deserves to have an API that is specialized for it’s use cases, and a general purpose API will never be that.

6 Likes

Just for kicks, here’s an implementation of this for Warehouse that should be fully featured, not able tobe landed since it needs tests and such, but manual testing has it working fine: Implement a PoC for PEP 691 by dstufft · Pull Request #11485 · pypa/warehouse · GitHub.

Might try to throw something together for pip as well here in a bit.

3 Likes

Here’s the same thing for pip.

2 Likes

Ok, and I tested both of these locally, both with Warehouse serving both, and with Warehouse commenting out its support for HTML all together. My Warehouse isn’t setup to serve files so fetching files 404’d, but it got to that point just fine.

Most of the changed lines in the pip PR are just removing the word “html”, maybe I should have left them to make it more obvious what the actual required changes are.

2 Likes

I’ve also got my proxy index working with manual tests: Comparing master...json-api · EpicWink/proxpi · GitHub


Any users who use Curl without explicitly setting Accept will likely start getting JSON responses and breaking their scripts, due to Curl setting Accept: */* by default. The solution to this would be to require JSON to be chosen only if its quality is strictly greater than HTML, but then Accept: ...+json, ...+html (ie without setting quality) will always return HTML.


I did a benchmark of the response body size difference between HTML and JSON APIs. On average, the JSON response with 1.91x as large (ie 91% bigger).

Individual packages (click to expand)
Project HTML size (kB) JSON size (kB) JSON size ratio
babel 8.5 18.4 2.16
cython 498.7 883.0 1.77
flask 11.7 25.0 2.14
gitpython 25.3 50.7 2.0
jinja2 14.2 30.7 2.16
keras-preprocessing 4.3 8.7 2.02
mako 10.1 22.8 2.26
markdown 14.1 31.0 2.2
markupsafe 85.7 153.5 1.79
pillow 478.8 924.8 1.93
pyjwt 16.7 38.3 2.29
pyopengl 9.7 22.5 2.32
pyopengl-accelerate 26.0 52.9 2.03
pyqt5 27.7 51.9 1.87
pyqt5-qt5 0.8 1.5 1.88
pyqt5-sip 51.1 98.4 1.93
pywavelets 68.4 129.6 1.89
pyyaml 69.0 133.0 1.93
pygments 24.4 53.9 2.21
qtpy 10.2 22.1 2.17
sqlalchemy 504.1 842.8 1.67
send2trash 4.0 8.7 2.17
shapely 132.2 259.8 1.97
sphinx 62.9 138.2 2.2
werkzeug 22.1 46.8 2.12
absl-py 6.3 14.5 2.3
alabaster 4.9 11.1 2.27
alembic 22.0 45.1 2.05
argon2-cffi 39.5 79.5 2.01
astunparse 3.1 6.8 2.19
attrs 8.6 17.5 2.03
azure-common 10.2 22.2 2.18
azure-core 13.7 29.8 2.18
azure-cosmos 8.4 17.9 2.13
azure-identity 13.6 28.7 2.11
azure-keyvault-secrets 5.2 10.4 2.0
azure-storage-blob 15.2 31.2 2.05
backcall 0.7 1.4 2.0
bleach 14.5 30.1 2.08
boto3 346.4 745.5 2.15
botocore 460.8 986.6 2.14
build 7.2 13.8 1.92
cachetools 12.0 25.8 2.15
certifi 13.3 29.0 2.18
cffi 231.6 483.1 2.09
charset-normalizer 12.1 23.2 1.92
click 14.9 32.8 2.2
cloudpickle 11.4 24.4 2.14
colorama 12.1 27.9 2.31
coverage 533.9 984.8 1.84
cryptography 353.3 672.5 1.9
cycler 1.1 2.1 1.91
databricks-cli 13.0 28.2 2.17
debugpy 246.0 421.3 1.71
decorator 10.5 22.2 2.11
defusedxml 4.2 8.2 1.95
deprecation 4.1 8.9 2.17
docker 18.2 37.5 2.06
docutils 10.0 20.0 2.0
entrypoints 2.0 3.8 1.9
flaky 8.3 18.2 2.19
flatbuffers 1.7 3.5 2.06
floto 0.1 0.1 1.0
gast 4.6 9.4 2.04
gitdb 4.4 9.3 2.11
glfw 40.5 75.6 1.87
google-auth 45.5 85.9 1.89
google-auth-oauthlib 5.4 10.6 1.96
google-pasta 4.8 10.4 2.17
greenlet 159.6 301.4 1.89
grpcio 852.9 1705.8 2.0
gunicorn 16.0 35.6 2.23
h5py 80.5 155.1 1.93
idna 6.2 13.8 2.23
imageio 18.8 40.1 2.13
imagesize 3.0 5.9 1.97
imgaug 2.4 5.3 2.21
imgviz 10.9 25.0 2.29
importlib-metadata 38.0 70.3 1.85
importlib-resources 20.6 37.9 1.84
iniconfig 1.3 2.7 2.08
ipykernel 32.2 66.4 2.06
ipyparallel 15.0 30.5 2.03
ipython 54.8 119.1 2.17
ipython-genutils 0.9 1.8 2.0
ipywidgets 36.5 79.5 2.18
isodate 2.6 5.9 2.27
itsdangerous 6.0 12.4 2.07
jedi 10.0 19.9 1.99
jmespath 5.4 11.9 2.2
joblib 28.0 63.9 2.28
jsonschema 18.1 39.3 2.17
jupyter-client 22.3 44.0 1.97
jupyter-core 13.0 26.4 2.03
jupyterlab-pygments 2.4 4.5 1.88
jupyterlab-widgets 16.1 32.0 1.99
keras 12.9 29.6 2.29
kiwisolver 75.8 133.9 1.77
labelme 24.1 57.4 2.38
libclang 7.1 13.8 1.94
majora 1.2 2.2 1.83
marshmallow 54.5 116.0 2.13
marshmallow-dataclass 18.3 36.4 1.99
marshmallow-oneofschema 5.0 9.6 1.92
marshmallow-union 1.9 3.7 1.95
matplotlib 241.3 445.6 1.85
matplotlib-inline 1.7 3.1 1.82
mistune 14.8 29.3 1.98
mlflow 18.7 40.3 2.16
msal 10.8 24.8 2.3
msal-extensions 3.4 7.1 2.09
msrest 22.3 50.7 2.27
mypy-extensions 1.7 3.5 2.06
nbclient 10.6 20.7 1.95
nbconvert 19.3 38.9 2.02
nbformat 8.1 16.5 2.04
nest-asyncio 11.6 22.9 1.97
networkx 30.0 67.8 2.26
nose 2.8 6.5 2.32
notebook 29.5 63.6 2.16
numpy 504.2 907.6 1.8
oauthlib 8.6 19.1 2.22
opencv-python 285.8 518.4 1.81
opencv-python-headless 242.7 428.2 1.76
opt-einsum 3.0 6.3 2.1
packaging 14.3 28.7 2.01
pandas 272.8 516.2 1.89
pandocfilters 2.6 5.6 2.15
parso 8.5 18.2 2.14
pep517 4.2 9.4 2.24
pexpect 4.0 9.0 2.25
pickleshare 3.3 7.2 2.18
pip 36.6 76.4 2.09
pluggy 7.0 13.6 1.94
portalocker 8.6 18.9 2.2
prometheus-client 9.4 19.2 2.04
prometheus-flask-exporter 12.1 25.0 2.07
prompt-toolkit 42.6 90.3 2.12
protobuf 315.6 614.4 1.95
psutil 198.9 408.2 2.05
ptyprocess 2.2 4.8 2.18
py 12.9 28.7 2.22
pyasn1 46.8 109.6 2.34
pyasn1-modules 40.0 88.4 2.21
pycocotools 1.0 2.1 2.1
pycparser 3.4 7.7 2.26
pymap3d 11.9 25.6 2.15
pyparsing 43.3 93.8 2.17
pyproj 123.2 230.2 1.87
pyrsistent 24.3 48.6 2.0
pytest 49.1 100.2 2.04
pytest-cov 11.9 24.1 2.03
python-dateutil 9.2 18.4 2.0
python-editor 2.4 5.1 2.12
python-json-logger 4.2 8.6 2.05
pytz 95.0 228.2 2.4
pyzmq 291.6 556.2 1.91
qtconsole 12.8 27.7 2.16
querystring-parser 1.4 2.8 2.0
requests 32.6 71.5 2.19
requests-oauthlib 6.4 13.1 2.05
rsa 11.2 25.5 2.28
s3transfer 10.2 22.2 2.18
scikit-image 105.4 185.9 1.76
scikit-learn 243.3 448.9 1.85
scipy 286.8 516.0 1.8
sentry-sdk 46.1 101.6 2.2
setuptools 208.1 441.1 2.12
six 7.0 15.6 2.23
sklearn 0.3 0.4 1.33
smmap 3.8 7.7 2.03
snowballstemmer 2.3 4.9 2.13
sphinxcontrib-applehelp 1.3 2.3 1.77
sphinxcontrib-devhelp 1.3 2.3 1.77
sphinxcontrib-htmlhelp 2.0 3.8 1.9
sphinxcontrib-jsmath 1.0 1.6 1.6
sphinxcontrib-qthelp 1.6 3.0 1.88
sphinxcontrib-serializinghtml 2.6 4.7 1.81
sqlparse 6.3 13.7 2.17
tabulate 4.5 10.2 2.27
tensorboard 21.2 40.0 1.89
tensorboard-data-server 5.1 8.9 1.75
tensorboard-plugin-wit 1.5 2.7 1.8
tensorflow 171.6 338.0 1.97
tensorflow-estimator 6.9 13.4 1.94
tensorflow-io-gcs-filesystem 40.3 65.5 1.63
termcolor 1.1 2.3 2.09
terminado 9.3 19.0 2.04
testpath 3.1 6.6 2.13
threadpoolctl 2.8 5.2 1.86
tifffile 35.4 72.8 2.06
toml 3.1 7.0 2.26
tomli 8.0 16.4 2.05
tornado 49.6 97.0 1.96
tqdm 49.4 104.1 2.11
traitlets 11.4 23.9 2.1
typeguard 15.1 31.3 2.07
typing-extensions 8.2 16.8 2.05
typing-inspect 4.0 8.4 2.1
urllib3 21.7 43.4 2.0
wcwidth 4.4 9.8 2.23
webencodings 1.1 2.4 2.18
websocket-client 17.5 36.5 2.09
wheel 19.4 40.6 2.09
widgetsnbextension 38.5 79.9 2.08
wrapt 182.5 308.7 1.69
xmltodict 4.6 10.3 2.24
zipp 12.1 24.9 2.06

Edit: bad benchmark, see below comments and for correct run

The way I’ve implemented this in Warehouse, is it essentially starts with a priority list that is hard coded in Warehouse, it then takes the list from the client, and effectively sorts the list using the priority values. Then it takes the first item.

This ends up working because when you’re using a stable sort, items with equal preference will retain their ordering, so this ends up letting clients express their priority, but within the same priority level, the server’s initial priority ends up controlling the outcome.

For compatibility reasons, Warehouse prefers text/html over +html over +json, absent any signal from the client that it prefers JSON over HTML. I would like to have the server itself prefer JSON over HTML, but I believe the chances of breakages are much higher in that situation, but it might be a good intermediate step at some point in the future if we ever decide we want to more directly push people towards JSON.

Does this take into account any compression? Or is it the decompressed size?

Oh I see:

This is what your HTML output looks like per file:

    <a href="nose-1.3.7.tar.gz#sha256=f1bffef9cbc82628f6e7d7b40d7e255aefaa1adb6a1b1d26c69a8b79e6208a98">nose-1.3.7.tar.gz</a><br />

(plus a new line)

This is what it looks like for JSON:

{"filename":"nose-1.3.7.tar.gz","hashes":{"sha256":"f1bffef9cbc82628f6e7d7b40d7e255aefaa1adb6a1b1d26c69a8b79e6208a98"},"url":"https://files.pythonhosted.org/packages/58/a5/0dc93c3ec33f4e281849523a5a913fa1eea9a3068acfa754d44d88107a44/nose-1.3.7.tar.gz#sha256=f1bffef9cbc82628f6e7d7b40d7e255aefaa1adb6a1b1d26c69a8b79e6208a98"}

That’s 132 bytes per file for HTML vs 324 bytes per file for JSON (for nose-1.3.7.tar.gz) or a 2.45 ratio.

You should be able to drop the #sha256=... from the url, that’s not required in JSON, that’s what the hashes key is for. That should save 72 bytes per file.

That takes us to 132:252, or 1.9 ratio.

The URLs are another big difference, nose-1.3.7.tar.gz vs https://files.pythonhosted.org/packages/58/a5/0dc93c3ec33f4e281849523a5a913fa1eea9a3068acfa754d44d88107a44/nose-1.3.7.tar.gz is an extra 107 bytes per file (and I think it may be a bug in the PR, I assume you want to serve the file locally for caching?).

The PEP does specify that URLs are to be interpreted as it would for HTML, which allows relative URLs to work, so the same URL should work for both.

If we remove the 107 bytes, that brings us to 132:145, or 1.1 ratio, which the remaining 13 bytes per file difference is going to largely be noise between having to specify “filename” and “hashes” as a key and not having spaces and newlines. Compression should erase most of that.

Yup, thanks for pointing that out. I’ll fix the response and re-run the benchmark. This is why you don’t code at 3am


Warehouse’s preference for text/html when equal is what I’m talking about: if a client requests Accept: text/html, ...+json the server will always respond with HTML. With Warehouse’s ordering, clients must either specify quality or both not specify text/html and specify ...+json before `…+html. Not to mention content negotiation doesn’t seem to care about order.

I would prefer if the PEP said to default to assuming text/html when qualities are equal (and nonzero), and that clients should always set quality.


By latest version, I’m assuming that that means the latest version the server knows about?

We currently leave it up to each server to decide what to do. This was to give each implementor the most flexibility to decide what makes the most sense for them.

We pick the most compatible possible option in Warehouse because there is only one version of Warehouse, so people can’t select different behaviors by different versions. I think it’s fine for other implementations to do something different.

Yes.

2 Likes

Fixed benchmark result: JSON response is 1.05x (± 0.04) as large (ie 5% bigger)

Individual packages (click to expand)
Project HTML size (kB) JSON size (kB) JSON size ratio
babel 8.5 9.1 1.07
cython 498.7 518.7 1.04
flask 11.7 12.4 1.06
gitpython 25.3 26.6 1.05
jinja2 14.2 15.2 1.07
keras-preprocessing 4.3 4.5 1.05
mako 10.1 10.9 1.08
markdown 14.1 15.1 1.07
markupsafe 85.7 88.6 1.03
pillow 478.8 505.2 1.06
pyjwt 16.7 18.2 1.09
pyopengl 9.7 10.6 1.09
pyopengl-accelerate 26.0 27.9 1.07
pyqt5 27.7 29.2 1.05
pyqt5-qt5 0.8 0.8 1.0
pyqt5-sip 51.1 53.6 1.05
pywavelets 68.4 72.3 1.06
pyyaml 69.0 72.7 1.05
pygments 24.4 26.3 1.08
qtpy 10.2 10.9 1.07
sqlalchemy 504.1 519.4 1.03
send2trash 4.0 4.3 1.07
shapely 132.2 140.4 1.06
sphinx 62.9 67.5 1.07
werkzeug 22.1 23.6 1.07
absl-py 6.7 7.2 1.07
alabaster 4.9 5.3 1.08
alembic 22.0 23.3 1.06
argon2-cffi 39.5 42.2 1.07
astunparse 3.1 3.3 1.06
attrs 8.6 9.1 1.06
azure-common 10.2 11.0 1.08
azure-core 14.1 15.1 1.07
azure-cosmos 8.4 9.0 1.07
azure-identity 13.6 14.6 1.07
azure-keyvault-secrets 5.2 5.5 1.06
azure-storage-blob 15.2 16.2 1.07
backcall 0.7 0.7 1.0
bleach 14.5 15.4 1.06
boto3 346.7 372.2 1.07
botocore 461.0 495.6 1.08
build 7.2 7.5 1.04
cachetools 12.0 12.8 1.07
certifi 13.3 14.3 1.08
cffi 231.6 249.8 1.08
charset-normalizer 12.1 12.5 1.03
click 14.9 16.0 1.07
cloudpickle 11.4 12.2 1.07
colorama 12.1 13.2 1.09
coverage 533.9 555.8 1.04
cryptography 353.3 372.7 1.05
cycler 1.1 1.1 1.0
databricks-cli 13.0 14.1 1.08
debugpy 246.0 253.4 1.03
decorator 10.5 11.2 1.07
defusedxml 4.2 4.4 1.05
deprecation 4.1 4.4 1.07
docker 18.2 19.3 1.06
docutils 10.0 10.5 1.05
entrypoints 2.0 2.0 1.0
flaky 8.3 8.9 1.07
flatbuffers 1.7 1.7 1.0
floto 0.1 0.1 1.0
gast 4.6 4.8 1.04
gitdb 4.4 4.6 1.05
glfw 40.5 42.9 1.06
google-auth 45.5 47.5 1.04
google-auth-oauthlib 5.4 5.7 1.06
google-pasta 4.8 5.2 1.08
greenlet 159.6 168.0 1.05
grpcio 852.9 910.8 1.07
gunicorn 16.0 17.2 1.07
h5py 80.5 85.2 1.06
idna 6.2 6.6 1.06
imageio 18.8 20.0 1.06
imagesize 3.0 3.1 1.03
imgaug 2.4 2.5 1.04
imgviz 10.9 11.7 1.07
importlib-metadata 38.0 39.3 1.03
importlib-resources 20.6 21.3 1.03
iniconfig 1.3 1.3 1.0
ipykernel 32.2 33.9 1.05
ipyparallel 15.0 15.8 1.05
ipython 54.8 58.7 1.07
ipython-genutils 0.9 0.9 1.0
ipywidgets 36.5 39.5 1.08
isodate 2.6 2.8 1.08
itsdangerous 6.0 6.3 1.05
jedi 10.0 10.5 1.05
jmespath 5.4 5.7 1.06
joblib 28.0 30.5 1.09
jsonschema 18.4 19.8 1.08
jupyter-client 22.3 23.4 1.05
jupyter-core 13.0 13.7 1.05
jupyterlab-pygments 2.4 2.4 1.0
jupyterlab-widgets 16.1 17.0 1.06
keras 12.9 14.0 1.09
kiwisolver 75.8 78.3 1.03
labelme 24.1 26.5 1.1
libclang 7.1 7.5 1.06
majora 1.2 1.1 0.92
marshmallow 54.5 58.5 1.07
marshmallow-dataclass 18.3 19.3 1.05
marshmallow-oneofschema 5.0 5.2 1.04
marshmallow-union 1.9 2.0 1.05
matplotlib 241.3 252.6 1.05
matplotlib-inline 1.7 1.7 1.0
mistune 14.8 15.8 1.07
mlflow 18.7 19.9 1.06
msal 10.8 11.7 1.08
msal-extensions 3.4 3.6 1.06
msrest 22.3 24.3 1.09
mypy-extensions 1.7 1.8 1.06
nbclient 10.6 10.9 1.03
nbconvert 19.3 20.2 1.05
nbformat 8.1 8.5 1.05
nest-asyncio 11.6 12.1 1.04
networkx 30.0 32.5 1.08
nose 2.8 3.0 1.07
notebook 29.5 31.6 1.07
numpy 504.2 522.9 1.04
oauthlib 8.6 9.3 1.08
opencv-python 285.8 299.2 1.05
opencv-python-headless 242.7 253.2 1.04
opt-einsum 3.0 3.2 1.07
packaging 14.3 15.1 1.06
pandas 272.8 286.3 1.05
pandocfilters 2.6 2.8 1.08
parso 8.5 9.1 1.07
pep517 4.2 4.5 1.07
pexpect 4.0 4.3 1.07
pickleshare 3.3 3.5 1.06
pip 36.6 38.8 1.06
pluggy 7.0 7.3 1.04
portalocker 8.6 9.3 1.08
prometheus-client 9.4 9.9 1.05
prometheus-flask-exporter 12.1 13.0 1.07
prompt-toolkit 42.6 45.7 1.07
protobuf 315.6 335.2 1.06
psutil 198.9 212.5 1.07
ptyprocess 2.2 2.3 1.05
py 12.9 13.9 1.08
pyasn1 46.8 51.3 1.1
pyasn1-modules 40.0 43.5 1.09
pycocotools 1.0 1.0 1.0
pycparser 3.4 3.7 1.09
pymap3d 11.9 12.6 1.06
pyparsing 43.3 46.6 1.08
pyproj 123.2 128.8 1.05
pyrsistent 24.3 25.7 1.06
pytest 49.1 51.8 1.05
pytest-cov 11.9 12.6 1.06
python-dateutil 9.2 9.7 1.05
python-editor 2.4 2.5 1.04
python-json-logger 4.2 4.4 1.05
pytz 95.0 104.6 1.1
pyzmq 291.6 305.7 1.05
qtconsole 12.8 13.7 1.07
querystring-parser 1.4 1.4 1.0
requests 32.6 35.2 1.08
requests-oauthlib 6.4 6.8 1.06
rsa 11.2 12.1 1.08
s3transfer 10.2 11.0 1.08
scikit-image 105.4 109.3 1.04
scikit-learn 243.3 255.2 1.05
scipy 286.8 296.9 1.04
sentry-sdk 46.1 50.0 1.08
setuptools 208.1 222.3 1.07
six 7.0 7.5 1.07
sklearn 0.3 0.2 0.67
smmap 3.8 4.0 1.05
snowballstemmer 2.3 2.5 1.09
sphinxcontrib-applehelp 1.3 1.3 1.0
sphinxcontrib-devhelp 1.3 1.3 1.0
sphinxcontrib-htmlhelp 2.0 2.1 1.05
sphinxcontrib-jsmath 1.0 0.9 0.9
sphinxcontrib-qthelp 1.6 1.6 1.0
sphinxcontrib-serializinghtml 2.6 2.6 1.0
sqlparse 6.3 6.7 1.06
tabulate 4.5 4.8 1.07
tensorboard 21.2 22.2 1.05
tensorboard-data-server 5.1 5.2 1.02
tensorboard-plugin-wit 1.5 1.5 1.0
tensorflow 171.6 183.3 1.07
tensorflow-estimator 6.9 7.3 1.06
tensorflow-io-gcs-filesystem 40.3 40.9 1.01
termcolor 1.1 1.1 1.0
terminado 9.3 9.7 1.04
testpath 3.1 3.3 1.06
threadpoolctl 2.8 2.8 1.0
tifffile 35.4 37.3 1.05
toml 3.1 3.3 1.06
tomli 8.0 8.3 1.04
tornado 49.6 51.9 1.05
tqdm 49.4 52.6 1.06
traitlets 11.4 12.1 1.06
typeguard 15.1 15.9 1.05
typing-extensions 8.2 8.7 1.06
typing-inspect 4.0 4.2 1.05
urllib3 21.7 22.8 1.05
wcwidth 4.4 4.7 1.07
webencodings 1.1 1.2 1.09
websocket-client 17.5 18.7 1.07
wheel 19.4 20.7 1.07
widgetsnbextension 38.5 41.4 1.08
wrapt 182.5 187.8 1.03
xmltodict 4.6 4.9 1.07
zipp 12.1 12.6 1.04

Benchmark with (gzip) compression result: JSON response is 0.97x (± 0.05) as large (ie 3% smaller)

Individual packages (click to expand)
Project HTML size (kB) JSON size (kB) JSON size ratio
babel 2.7 2.6 0.96
cython 98.1 97.9 1.0
flask 3.5 3.5 1.0
gitpython 6.9 6.9 1.0
jinja2 4.4 4.3 0.98
keras-preprocessing 1.3 1.3 1.0
mako 3.4 3.3 0.97
markdown 4.5 4.4 0.98
markupsafe 17.9 17.8 0.99
pillow 112.6 112.3 1.0
pyjwt 5.5 5.4 0.98
pyopengl 3.3 3.3 1.0
pyopengl-accelerate 6.9 6.8 0.99
pyqt5 6.5 6.5 1.0
pyqt5-qt5 0.4 0.4 1.0
pyqt5-sip 12.2 12.1 0.99
pywavelets 16.0 15.9 0.99
pyyaml 16.6 16.5 0.99
pygments 7.5 7.5 1.0
qtpy 3.2 3.1 0.97
sqlalchemy 86.9 86.7 1.0
send2trash 1.4 1.3 0.93
shapely 32.6 32.5 1.0
sphinx 18.9 18.8 0.99
werkzeug 6.4 6.3 0.98
absl-py 2.3 2.3 1.0
alabaster 1.7 1.6 0.94
alembic 6.1 6.0 0.98
argon2-cffi 10.4 10.3 0.99
astunparse 1.1 1.1 1.0
attrs 2.5 2.4 0.96
azure-common 3.1 3.1 1.0
azure-core 4.3 4.3 1.0
azure-cosmos 2.6 2.5 0.96
azure-identity 4.0 3.9 0.97
azure-keyvault-secrets 1.5 1.5 1.0
azure-storage-blob 4.3 4.2 0.98
backcall 0.4 0.3 0.75
bleach 4.1 4.1 1.0
boto3 97.8 97.8 1.0
botocore 128.2 128.6 1.0
build 2.0 1.9 0.95
cachetools 3.6 3.5 0.97
certifi 4.2 4.1 0.98
cffi 62.2 62.1 1.0
charset-normalizer 3.1 3.0 0.97
click 4.7 4.6 0.98
cloudpickle 3.4 3.4 1.0
colorama 4.1 4.1 1.0
coverage 114.8 114.5 1.0
cryptography 80.2 79.9 1.0
cycler 0.5 0.4 0.8
databricks-cli 3.9 3.9 1.0
debugpy 45.3 45.1 1.0
decorator 3.2 3.2 1.0
defusedxml 1.3 1.2 0.92
deprecation 1.4 1.3 0.93
docker 5.0 5.0 1.0
docutils 2.7 2.7 1.0
entrypoints 0.7 0.6 0.86
flaky 2.7 2.6 0.96
flatbuffers 0.6 0.6 1.0
floto 0.1 0.1 1.0
gast 1.4 1.4 1.0
gitdb 1.5 1.4 0.93
glfw 9.1 9.0 0.99
google-auth 10.6 10.6 1.0
google-auth-oauthlib 1.5 1.5 1.0
google-pasta 1.6 1.5 0.94
greenlet 36.2 36.1 1.0
grpcio 210.6 209.9 1.0
gunicorn 5.0 5.0 1.0
h5py 19.2 19.1 0.99
idna 2.1 2.1 1.0
imageio 5.7 5.7 1.0
imagesize 1.0 0.9 0.9
imgaug 0.9 0.9 1.0
imgviz 3.6 3.6 1.0
importlib-metadata 8.6 8.5 0.99
importlib-resources 4.7 4.6 0.98
iniconfig 0.6 0.5 0.83
ipykernel 8.8 8.8 1.0
ipyparallel 4.2 4.1 0.98
ipython 16.4 16.3 0.99
ipython-genutils 0.4 0.4 1.0
ipywidgets 10.8 10.7 0.99
isodate 1.0 0.9 0.9
itsdangerous 1.9 1.8 0.95
jedi 2.8 2.7 0.96
jmespath 1.8 1.8 1.0
joblib 9.0 9.0 1.0
jsonschema 5.6 5.5 0.98
jupyter-client 5.8 5.7 0.98
jupyter-core 3.6 3.6 1.0
jupyterlab-pygments 0.8 0.7 0.87
jupyterlab-widgets 4.2 4.1 0.98
keras 4.3 4.2 0.98
kiwisolver 15.6 15.5 0.99
labelme 8.2 8.2 1.0
libclang 1.9 1.9 1.0
majora 0.5 0.4 0.8
marshmallow 15.4 15.3 0.99
marshmallow-dataclass 4.8 4.7 0.98
marshmallow-oneofschema 1.4 1.4 1.0
marshmallow-union 0.7 0.6 0.86
matplotlib 52.0 51.8 1.0
matplotlib-inline 0.6 0.5 0.83
mistune 3.8 3.8 1.0
mlflow 5.6 5.5 0.98
msal 3.6 3.5 0.97
msal-extensions 1.1 1.1 1.0
msrest 7.1 7.0 0.99
mypy-extensions 0.7 0.6 0.86
nbclient 2.9 2.8 0.97
nbconvert 5.2 5.1 0.98
nbformat 2.4 2.4 1.0
nest-asyncio 3.0 3.0 1.0
networkx 9.6 9.6 1.0
nose 1.1 1.0 0.91
notebook 8.7 8.6 0.99
numpy 103.1 102.8 1.0
oauthlib 2.8 2.7 0.96
opencv-python 59.8 59.7 1.0
opencv-python-headless 48.0 48.0 1.0
opt-einsum 1.1 1.0 0.91
packaging 3.9 3.8 0.97
pandas 61.7 61.5 1.0
pandocfilters 1.0 0.9 0.9
parso 2.6 2.5 0.96
pep517 1.5 1.4 0.93
pexpect 1.4 1.4 1.0
pickleshare 1.2 1.1 0.92
pip 10.3 10.2 0.99
pluggy 1.9 1.9 1.0
portalocker 2.7 2.7 1.0
prometheus-client 2.7 2.6 0.96
prometheus-flask-exporter 3.4 3.3 0.97
prompt-toolkit 12.0 12.0 1.0
protobuf 75.0 74.8 1.0
psutil 52.1 51.9 1.0
ptyprocess 0.8 0.8 1.0
py 4.2 4.1 0.98
pyasn1 15.4 15.3 0.99
pyasn1-modules 12.0 12.0 1.0
pycocotools 0.5 0.4 0.8
pycparser 1.3 1.2 0.92
pymap3d 3.6 3.5 0.97
pyparsing 12.8 12.7 0.99
pyproj 27.7 27.5 0.99
pyrsistent 6.6 6.6 1.0
pytest 13.1 13.0 0.99
pytest-cov 3.3 3.3 1.0
python-dateutil 2.6 2.5 0.96
python-editor 0.9 0.8 0.89
python-json-logger 1.4 1.3 0.93
pytz 32.6 32.5 1.0
pyzmq 67.5 67.4 1.0
qtconsole 3.9 3.9 1.0
querystring-parser 0.6 0.5 0.83
requests 9.8 9.8 1.0
requests-oauthlib 1.9 1.9 1.0
rsa 3.9 3.8 0.97
s3transfer 3.1 3.1 1.0
scikit-image 21.2 21.1 1.0
scikit-learn 52.4 52.3 1.0
scipy 58.9 58.7 1.0
sentry-sdk 13.8 13.8 1.0
setuptools 58.0 57.8 1.0
six 2.3 2.3 1.0
sklearn 0.2 0.2 1.0
smmap 1.2 1.1 0.92
snowballstemmer 0.8 0.8 1.0
sphinxcontrib-applehelp 0.5 0.5 1.0
sphinxcontrib-devhelp 0.5 0.5 1.0
sphinxcontrib-htmlhelp 0.7 0.6 0.86
sphinxcontrib-jsmath 0.4 0.4 1.0
sphinxcontrib-qthelp 0.6 0.5 0.83
sphinxcontrib-serializinghtml 0.8 0.7 0.87
sqlparse 2.1 2.0 0.95
tabulate 1.6 1.5 0.94
tensorboard 5.0 4.9 0.98
tensorboard-data-server 1.2 1.2 1.0
tensorboard-plugin-wit 0.5 0.5 1.0
tensorflow 41.3 41.1 1.0
tensorflow-estimator 1.8 1.8 1.0
tensorflow-io-gcs-filesystem 6.9 6.8 0.99
termcolor 0.5 0.4 0.8
terminado 2.7 2.7 1.0
testpath 1.1 1.0 0.91
threadpoolctl 0.9 0.8 0.89
tifffile 9.8 9.7 0.99
toml 1.2 1.1 0.92
tomli 2.3 2.3 1.0
tornado 12.4 12.3 0.99
tqdm 14.0 13.9 0.99
traitlets 3.4 3.3 0.97
typeguard 4.2 4.2 1.0
typing-extensions 2.3 2.3 1.0
typing-inspect 1.3 1.2 0.92
urllib3 5.7 5.6 0.98
wcwidth 1.5 1.5 1.0
webencodings 0.5 0.5 1.0
websocket-client 5.0 4.9 0.98
wheel 5.5 5.4 0.98
widgetsnbextension 10.4 10.3 0.99
wrapt 33.0 32.9 1.0
xmltodict 1.6 1.6 1.0
zipp 3.4 3.4 1.0
2 Likes

Great to see some work on this, many thanks for the initiative!

Looking at the Project List specification, 2 questions arise:

  • Was it intentional to drop the un-normalized (real) project name from the list? This information was available in the HTML serialization.
  • Is the url field only there to be consistent with PEP-503 (1.0)? It otherwise seems redundant, because according to the spec the url can be deducted from the name.

It makes the information self-contained. Otherwise you would have to pass around the JSON and the URL to be able to construct/extract all relevant data instead of just the JSON payload.

I went and double-checked PEP 503, and it’s unclear in this area. It states that anchor text must be the “name” of the project.

It’s been a while since I had looked closely at the project list response on /simple/, and I had assumed that it was the normalized name I referenced in PEP 503 TBH, though upon closer examination I see that it’s actually the unnormalized name in practice.

So no, it wasn’t actually intentional.

However, normalized name makes much more sense for the key in the JSON response, so I’m not going to remove that.

I’m also hesitant to add that key. Currently the /simple/ response on PyPI is 20M uncompressed and 3M compressed. The current PEP 691 changes that to 18M and 2.9M. Adding in a name key changes that to 27M and 4.5M [1]. It doesn’t feel worth it to me to add that unless someone feels strongly about it.

It is somewhat redundant, and I thought about removing it. I ultimately didn’t for two reasons:

  1. This makes it an easier diff between the two formats, so integrating with existing projects is simpler.
  2. I want to leave our options open for adding extra information to each project in the future. It felt, odd to make the structure be an empty dictionary like {"projects": {"$name": {}}}, and adding the URL there was the easiest thing to do to resolve that.

Honestly though, I didn’t spend a ton of time thinking about the project list, it’s not really used by any installers anymore, so from an installer POV, it’s largely a vestigial URL. If there’s projects out there using it currently who need something like unnormalized name then I’m open to changes to it.

You still have to pass around the URL (just like you have to do with HTML), because the URLs are able to be relative to the URL that you fetched the response from. HTML allows that, and PEP 691 explicitly says that relative URLs are resolved as if it were HTML (we just don’t have a base url meta tag like HTML does).

I think that’s a positive thing, since it allows API responses to be mirrored byte for byte, which will end up being important for TUF integration[2].


  1. Data generated using pep691.py · GitHub ↩︎

  2. Saying this now reminds me that the status quo for PEP 503 is that mirrors cannot byte for byte copy PEP 503 from PyPI for the same reason, since URLs are allowed to be absolute, and PyPI uses that to point files to a different domain, mirrors have to rewrite /simple/$project/ to point to different URLs in the filename. This is actually a whole other problem that we’ll have to resolve somehow. ↩︎

2 Likes

It’s been about a month since I posted the last update to the PR. The feedback on this PR in that time hasn’t really raised any major concerns that I think the PEP doesn’t already address, and overall, I think that any concerns folks did have, the PEP has ended up addressing. We also have two proof of concept PRs that I wrote that are more or less ready to land after writing tests for them, other than the Warehouse PR which also needs some VCL written. There is also a draft PR for proxpi by @EpicWink that appears to be functional, and maybe even ready to land if this PEP gets accepted, and @brettcannon has indicated he could implement this for mousebounder.

We’ve also got some good data from @EpicWink that suggests that it doesn’t meaningfully affect response size (5% bigger without compression, 3% smaller with), and while it’s not as big of a deal since installers don’t really use that page, this does actually make /simple/ smaller for both uncompressed and compressed.

I think the only real open questions that have come up are:

  • My question about some of the recommendations, but that’s a non-normative section so we can update it at any time, and I suspect we might want to once we have real world experience, so I think that’s fine.
  • The recent question about the unnormalized name being available. I think we can leave that out for now, we can always add in that key later if we decide it’s useful enough since adding keys is backwards compatible but removing them is not.

Given all of that, I’m going to ask @brettcannon to go ahead and pronounce on this PEP, unless someone has some concern or objection that they’ve not yet raised.

4 Likes

I object! I’ve always wanted to say that :stuck_out_tongue_winking_eye:. Hereby (this post) some general feedback I’ve gathered.
I also do still have a major issue I want to discuss (not this post), trying my best to get that finished up as soon as possible!


Abstract

However, due to limited time constraints, that effort has not gained much if any traction beyond people thinking that it would be nice to do it.

This was a bit akward/unpleasant to read. Maybe add commas around “if any” and remove the last word “it”?


Both the terms “canonicalized name” and “normalized name” are used, would it maybe be better to choose one of the two? Could be confusing to use both.


Project Detail

This URL must respond with a JSON encoded dictionary that has two keys, name, which represents the normalized name of the project and files. The files key is a list of dictionaries, each one representing an individual file.

Shouldn’t it be “three keys"? The metadata key was not mentioned. Although the metadata field is not mandatory, I think it should at least be mentioned here.


TUF Support - PEP 458

“But I believe that”

Has this now been confirmed? If so, could we replace “I believe that” with something more factual?


TUF Support - PEP 458
and
Doesn’t TUF support require having different URLs for each representation?

These 2 sections are largely duplicate text. In my opinion, they can either be reduced in size, or the FAQ section can be removed entirely.


Appendix 1: Survey of use cases to cover
This listing is described by the following phrase:

This is how they use the Simple + JSON APIs today:

Nitpicking a bit here :sweat_smile:, but pip lists “Full metadata (data-dist-info-metadata)” (PEP-658), although that isn’t the case right now: use data-dist-info-metadata (PEP 658) to decouple resolution from downloading by cosmicexplorer · Pull Request #11111 · pypa/pip · GitHub


I don’t fully understand how the following two quotes reconcile?

“All serializations version numbers SHOULD be kept in sync”
and
“since 1.0 will likely be the only HTML version to exist”


I feel like the points from this message have not yet been properly addressed: PEP 691: JSON-based Simple API for Python Package Indexes - #25 by layday