The way I’ve implemented this in Warehouse, is it essentially starts with a priority list that is hard coded in Warehouse, it then takes the list from the client, and effectively sorts the list using the priority values. Then it takes the first item.
This ends up working because when you’re using a stable sort, items with equal preference will retain their ordering, so this ends up letting clients express their priority, but within the same priority level, the server’s initial priority ends up controlling the outcome.
For compatibility reasons, Warehouse prefers text/html over +html over +json, absent any signal from the client that it prefers JSON over HTML. I would like to have the server itself prefer JSON over HTML, but I believe the chances of breakages are much higher in that situation, but it might be a good intermediate step at some point in the future if we ever decide we want to more directly push people towards JSON.
Does this take into account any compression? Or is it the decompressed size?
That’s 132 bytes per file for HTML vs 324 bytes per file for JSON (for nose-1.3.7.tar.gz) or a 2.45 ratio.
You should be able to drop the #sha256=... from the url, that’s not required in JSON, that’s what the hashes key is for. That should save 72 bytes per file.
That takes us to 132:252, or 1.9 ratio.
The URLs are another big difference, nose-1.3.7.tar.gz vs https://files.pythonhosted.org/packages/58/a5/0dc93c3ec33f4e281849523a5a913fa1eea9a3068acfa754d44d88107a44/nose-1.3.7.tar.gz is an extra 107 bytes per file (and I think it may be a bug in the PR, I assume you want to serve the file locally for caching?).
The PEP does specify that URLs are to be interpreted as it would for HTML, which allows relative URLs to work, so the same URL should work for both.
If we remove the 107 bytes, that brings us to 132:145, or 1.1 ratio, which the remaining 13 bytes per file difference is going to largely be noise between having to specify “filename” and “hashes” as a key and not having spaces and newlines. Compression should erase most of that.
Yup, thanks for pointing that out. I’ll fix the response and re-run the benchmark. This is why you don’t code at 3am
Warehouse’s preference for text/html when equal is what I’m talking about: if a client requests Accept: text/html, ...+json the server will always respond with HTML. With Warehouse’s ordering, clients must either specify quality or both not specify text/html and specify ...+json before `…+html. Not to mention content negotiation doesn’t seem to care about order.
I would prefer if the PEP said to default to assuming text/html when qualities are equal (and nonzero), and that clients should always set quality.
By latest version, I’m assuming that that means the latest version the server knows about?
We currently leave it up to each server to decide what to do. This was to give each implementor the most flexibility to decide what makes the most sense for them.
We pick the most compatible possible option in Warehouse because there is only one version of Warehouse, so people can’t select different behaviors by different versions. I think it’s fine for other implementations to do something different.
Was it intentional to drop the un-normalized (real) project name from the list? This information was available in the HTML serialization.
Is the url field only there to be consistent with PEP-503 (1.0)? It otherwise seems redundant, because according to the spec the url can be deducted from the name.
It makes the information self-contained. Otherwise you would have to pass around the JSON and the URL to be able to construct/extract all relevant data instead of just the JSON payload.
I went and double-checked PEP 503, and it’s unclear in this area. It states that anchor text must be the “name” of the project.
It’s been a while since I had looked closely at the project list response on /simple/, and I had assumed that it was the normalized name I referenced in PEP 503 TBH, though upon closer examination I see that it’s actually the unnormalized name in practice.
So no, it wasn’t actually intentional.
However, normalized name makes much more sense for the key in the JSON response, so I’m not going to remove that.
I’m also hesitant to add that key. Currently the /simple/ response on PyPI is 20M uncompressed and 3M compressed. The current PEP 691 changes that to 18M and 2.9M. Adding in a name key changes that to 27M and 4.5M [1]. It doesn’t feel worth it to me to add that unless someone feels strongly about it.
It is somewhat redundant, and I thought about removing it. I ultimately didn’t for two reasons:
This makes it an easier diff between the two formats, so integrating with existing projects is simpler.
I want to leave our options open for adding extra information to each project in the future. It felt, odd to make the structure be an empty dictionary like {"projects": {"$name": {}}}, and adding the URL there was the easiest thing to do to resolve that.
Honestly though, I didn’t spend a ton of time thinking about the project list, it’s not really used by any installers anymore, so from an installer POV, it’s largely a vestigial URL. If there’s projects out there using it currently who need something like unnormalized name then I’m open to changes to it.
You still have to pass around the URL (just like you have to do with HTML), because the URLs are able to be relative to the URL that you fetched the response from. HTML allows that, and PEP 691 explicitly says that relative URLs are resolved as if it were HTML (we just don’t have a base url meta tag like HTML does).
I think that’s a positive thing, since it allows API responses to be mirrored byte for byte, which will end up being important for TUF integration[2].
Saying this now reminds me that the status quo for PEP 503 is that mirrors cannot byte for byte copy PEP 503 from PyPI for the same reason, since URLs are allowed to be absolute, and PyPI uses that to point files to a different domain, mirrors have to rewrite /simple/$project/ to point to different URLs in the filename. This is actually a whole other problem that we’ll have to resolve somehow. ↩︎
It’s been about a month since I posted the last update to the PR. The feedback on this PR in that time hasn’t really raised any major concerns that I think the PEP doesn’t already address, and overall, I think that any concerns folks did have, the PEP has ended up addressing. We also have two proof of concept PRs that I wrote that are more or less ready to land after writing tests for them, other than the Warehouse PR which also needs some VCL written. There is also a draft PR for proxpi by @EpicWink that appears to be functional, and maybe even ready to land if this PEP gets accepted, and @brettcannon has indicated he could implement this for mousebounder.
We’ve also got some good data from @EpicWink that suggests that it doesn’t meaningfully affect response size (5% bigger without compression, 3% smaller with), and while it’s not as big of a deal since installers don’t really use that page, this does actually make /simple/ smaller for both uncompressed and compressed.
I think the only real open questions that have come up are:
My question about some of the recommendations, but that’s a non-normative section so we can update it at any time, and I suspect we might want to once we have real world experience, so I think that’s fine.
The recent question about the unnormalized name being available. I think we can leave that out for now, we can always add in that key later if we decide it’s useful enough since adding keys is backwards compatible but removing them is not.
Given all of that, I’m going to ask @brettcannon to go ahead and pronounce on this PEP, unless someone has some concern or objection that they’ve not yet raised.
I object! I’ve always wanted to say that . Hereby (this post) some general feedback I’ve gathered.
I also do still have a major issue I want to discuss (not this post), trying my best to get that finished up as soon as possible!
This URL must respond with a JSON encoded dictionary that has two keys, name, which represents the normalized name of the project and files. The files key is a list of dictionaries, each one representing an individual file.
Shouldn’t it be “three keys"? The metadata key was not mentioned. Although the metadata field is not mandatory, I think it should at least be mentioned here.
I agree with @domdfcoding that the to-be-deprecated cgi should probably not be used in the code example. Examples will get copied, and will be used. It might be unfortunate that the alternative is more verbose, but if that is the reality of the situation, so be it…
According to the RFC “If no Accept header field is present, then it is assumed that the client accepts all media types.”. Meaning no accept header is present is equal to Accept: */*. So a server must never return a 406 when presented with a missing Accept header. Agree?
Yea, the original PEP didn’t have the meta key, and I just forgot to update two to three in that spot. Fixed in the PEP.
Yes, updated the PEP.
Dropped the FAQ.
Slight reword to mention that it’s how they use today, or plan to in the near future.
Since it was talking about Content-Types, I meant the content type for version 1, if we make a v2 content type we’re unlikely to ever produce that for HTML.
I’ve updated the PEP to be clearer that versions should be kept in sync across serializations, within a major version, but across major versions do not have that same recommendation. I’ve also clarified that 1.x will likely be the only version of HTML to exist, instead of 1.0.
I’m curious what points you think haven’t been addressed? I see 4 points in that post:
Clarification of whether requesting v1 means 1.x or 1.0, which the PEP states:
Since only major versions should be disruptive to clients attempting to
understand one of these API responses, only the major version will be included
in the content type
What constitutes a backwards compatible change, which the PEP gives rough guidelines under the “Versioning” section, but explicitly calls out that it is intentionally vague because it is hard to fully express the full set of changes that may or may not be compatible. Future PEPs can decide whether it’s a Major or Minor version bump, and can justify that on their own merits.
Being explicit with the latest version, which the PEP already incorporated that suggestion.
The recommendation not to add the +html content-type, and only rely on text/html. I don’t agree with “just” sticking with text/html, so I purposely kept the new content type for HTML. I’ve updated the PEP with an explicit FAQ about it.
If there’s something else you didn’t think was addressed I’m not seeing it, 3/4 of that post directly resulted in updates to the PEP, and the remaining one I disagreed with, but I’ve added a FAQ section for it now.
I’ve updated the example, I think it makes it minorly less clear, but it’s not a big deal either way. I’m less worried about the verbosity and more worried that parsing a header isn’t an interesting part of the client request flow, so dedicating more lines of code than needed to it just adds extra noise that makes it harder to understand what’s going on.
A missing Accept header is functionally equivalent to Accept: */* yes, so a server should not respond with a 406.
@wkoorn (an aside, you have a rather confusing username, especially for this forum — would you consider changing it?)
The grammar / purely readability points can be directly proposed as a PR to the text on the PEPs repo, and one of the editors will review. I would then edit your comments above to focus on the substantive challenges/questions to the text. Edit: Donald posted a response seconds before I posted this! Comment is rendered moot.
I agree that I can distract a bit from the actual topic at hand. I’m open to alternatives (if there are), as long as that doesn’t include promoting deprecated modules.
If the server does not support any of the content types in the Accept header or if the client did not provide an Accept header at all, then they are able to choose between 3 different options for how to respond:
It now treats a missing Accept header (== Accept: */*) the same as an Accept mismatch. And this would be wrong for option b:
b. Return a HTTP 406 Not Acceptable response to indicate that none of the requested content types were available, and the server was unable or unwilling to select a default content type to respond with.
(full disclosure: I am a colleague of Wouter, though I post this independently)
One of the biggest issues that I see with the PEP is that it claims to represent a sufficiently small change to the underlying data-model that it does not warrant a version increment. I fully support the notion of making the minimal change from which later improvements can be built-out, but I don’t see sufficient justification of why the new API shouldn’t just be called v2 (i.e. application/vnd.pypi.simple.v2+json) if any (breaking) changes are introduced.
As a case in point, the project “list” being converted to a dictionary fundamentally changes the underlying data-model. If I wish to have a type which represents v1 data, should I choose a (sorted) list of projects, or an (un-ordered, as per JSON spec) dictionary of them, keyed by the normalized project name? My personal preference would be towards preserving the non-normalized name and order (since it is easy to normalize, and to construct a dictionary if I want one). I could also imagine the order playing a more important role in the future: for example, I believe it would be easy to add pagination and ordering (by last update) to the project list in a future PEP.
To be concrete about this, I propose that the data-model be explicitly stated in the PEP, as I believe this will help to show breaking changes to the data model more clearly and make it easy to know what is serialization implementation detail (esp. in the case of HTML). I put forward an example of a SimpleIndexFile type, even if in practice the API wasn’t incremented when new features were added:
@dataclasses.dataclass
class SimpleIndexFile_Version1p0:
url: str
gpg_sig: typing.Optional[bool]
requires_python: typing.Optional[packaging.specifiers.SpecifierSet]
@dataclasses.dataclass
class SimpleIndexFile_Version1p1(SimpleIndexFile_Version1p0):
yanked: typing.Optional[str] # PEP 592
@dataclasses.dataclass
class SimpleIndexFile_Version1p2(SimpleIndexFile_Version1p1):
dist_info_metadata: typing.Optional[str] # PEP 658
@dataclasses.dataclass
class SimpleIndexFile_Version2p0:
filename: str
url: str
hashes: typing.Dict[str, str]
requires_python: typing.Optional[packaging.specifiers.SpecifierSet]
...
(note: this is a bit simplified, since it doesn’t deal with the nested type definitions which would be necessary to document the datamodel properly)
For the same reason of data model breakage, the “latest” concept, which goes on to be discouraged (at least, this is how I read “It is recommended however, …”), seems like an unnecessary complication. If you know which metadata you are interested in using from a client implementation perspective, you already know which versions you support and so don’t need the “latest” concept at all. Since the concept of “latest” is entirely optional and client/request-side (the server can respond with whatever it likes), the latest concept is something that can be added later on if necessary, I believe.
To summarise, the list of proposals that I would be interested to have feedback on:
Document the datamodel in the PEP (either as Python types, or as a JSON schema)
“Project list” becomes a (sorted) list again (if you remove the URL to be part of the metadata definition, then this can represent a 40% pep691.py · GitHub reduction in compressed size compared to today)
The JSON response is called v1.1 if new concepts are introduced but no old ones removed, or v2 if breaking concepts are introduced to the data-model (as per the dataclass definition).
The unnormalized name is included (either as the “name” concept, or in some new key) in the project list if the JSON response continues to be called 1.x. For 2.0 it is totally reasonable to remove it from the project list/dictionary (the non-normalized name is actually more useful in the project detail page, but adding this is a proposal that can be easily made after the PEP, since it would be additive)
Consider whether dropping the “latest” concept from PEP is reasonable (and whether it indeed can be later proposed in a subsequent PEP if necessary)
To be fair, the data-model is already fairly explicit in both PEP691 and (to a much lesser extent) PEP503. The problem with the way it is structured in the PEPs though is that it is harder to see breaking data-model changes when they are written as bulleted prose (code is easier to comprehend in this regard, though I accept that this is subjective).
Speaking from experience, the only change in the data model in terms of how you may represent it as a Python class is you can have multiple hashes compared to under the HTML representation only having one. That’s pretty minor and had I thought things through I probably would have been more flexible in how it was represented in mousebender.
But another way to look at it is it is already a major version change: from version html to version json, both starting at a minor version of 1. In the end I don’t think it really matters since this PEP asks you to explicitly opt into the JSON format, so there isn’t really any confusion on the consumer side of what you’re getting.
But I will also fully admit I am known for not liking SemVer, so I have a bias to begin with.
FYI I plan to give folks up to a week to file an feedback/objections until this Friday, June 17, at which point I will consider the PEP ready for me to review (when I have time ).
I don’t see the PEP as requiring serializing the information exactly the same between different serialization formats. That obviously can’t happen because not every format supports the same data types or the same constraints, and I think things will generally be more positive of a user experience if each serialization format is free to serialize the same data in whatever form makes the most sense for that serialization format.
The question to me is whether the two serialization formats are serializing the same data or not, not whether the line format takes the exact same shape or not. In other words, it’s not intended that you can swap between serialization formats by just blindly swapping between html.parse, json.loads(), or a hypothetical other serialization format. In some cases, you may be able to do that because two serialization formats are similar enough, but that’s not hardly a requirement of this PEP.
When I look at what is being serialized between the HTML serialization format and the JSON serialization format, the data is the same data being serialized, the only difference is HTML and JSON are representing that data in whatever way makes the most sense for their respective formats.
The only real differences in terms of the data that is modeled after PEP 691 is:
There is allowed to be multiple hashes
The PEP allows more featureful serialization formats to have data that doesn’t exist in the less featureful serializations. Hashes exist in both, the JSON just supports more than one.
We mandate normalized name in the /simple/ index.
I consider PEP 503 ambiguous here. It says “name”, which could be either the unnormalized or the normalized name. I went and looked what is being done in practice, currently as implemented Warehouse (the main implementation behind PyPI) displays the unnormalized name, but our fallback mirror uses the normalized name, so from PyPI you could get either currently. Other implementations seem to be using the normalized name.
So we’re adding an extra feature to the JSON serialization, and we’re making an ambiguous statement in PEP 503 less ambiguous by specifying which of two options you should pick. Neither of which is a major change to the underlying data model IMO.
The question of pagination or similar I think is a red herring. Outputting an unordered collection doesn’t mean that the input to that response has to be unordered. There’s nothing stopping us from paginating a dict response, and just saying that the response is paginated by some key.
And while it is easy to go from a sorted list, to a dict, it’s just as easy to go from a dict to a sorted list, so that’s not really a useful concern.
I’m not married to the latest version. I added it because I thought it would be useful for people who want to specify that they want a specific serialization format, but they don’t care about the specific version they get. This would be most useful for people who are just manually exploring the API.
I’m not sure that I see a ton of value here, but it can be added like people want it, but like I said, I don’t think the data model has to match the serialization on the wire, it’s more that the same data is being serialized.
PEP 503 does not make any claims about the ordering of items on either response, and implementors are free to put them in any order they want. So while they’re technically sorted, by nature of the fact HTML requires them to be in some order, their order has no meaning and can change at any time, including on every page load.
On PyPI they are ordered by normalized name because its’ convenient to have the page have a deterministic output since it makes debugging the CDN easier, and normalized name is just the field we happened to pick.
That’s an implementation detail of PyPI though, the underlying data is best thought of currently as a set. I chose to make it a dict in JSON because I felt that was a more natural way to express that data in JSON.
I’m torn on removing the url completely. PEP 503 does explicitly say the url is a required part of the content, and it would be required for the historical purpose of that API… but that historical purpose is kind of not really useful anymore, so the page overall is a bit of a vestigial appendage on the API, something we preserved mostly to keep backwards compatibility with clients using pre PEP 503 normalized URLs on servers that can’t reliably redirect to the normalized name.
I’d point out that removing the url is where all your savings come from in that case, putting the name in is always an increase in the response size. For instance, if I have a structure that uses a dict mapping normalized name to an empty dict (so the same as pep691 is now, but dropping url) I get even smaller than the list with no url (9M Uncompressed / 1.8M Compressed vs 7.2M Uncompressed / 1.7M Compresed).
There’s really three questions here:
Should we represent the projects on the index page, in JSON, as a list or a map.
Should we include information on the non-normalized name, the normalized name, or both.
An important thing of note here, is that given PEP 503 never mandated non-normalized names, is that they might not even be available for some implementations of PEP 503. I am aware of at least one implementation (that is internal to a company) where that is the case, so I don’t think it’s even possible to mandate normalized names. So we could either leave it ambiguous or mandate normalized names, since you can always go from an unknown type of name to a normalized type.
Should we include the URL.
Each of those questions impacts the total response size in some way, I personally still feel comfortable with the decisions made in the PEP regarding those three questions (map, normalized, yes). I’m struggling to think of a use case for the API as it exists in PEP 503 that those decisions don’t cover.
I don’t think v2 is appropriate here, as the underlying data model is fully backwards compatible. You could argue for v1.1, but I don’t think it’s a useful distinction, nothing new is being added to HTML, just JSON is being added.
As said above, PEP 503 doesn’t specify what kind of name should be used on the project index, and in practice both types of names are in use, so from my POV, the non-normalized name was never guaranteed in PEP 503.
I don’t feel strongly about it either way. It can definitely be added later since it’s just another content type, if folks don’t think it’s useful it’s easy enough to strike it.
Ironically enough, the only thing that caused pip any problems when I implemented that, was yanked, and that’s because pip’s internal representation matches HTML exactly, yanked=None means it’s not yanked, yanked=Any means it’s yanked, and yanked=str means it’s yanked with a reason. That didn’t require changing their data model though, just adding some extra deserialization logic after the json.loads().
Having hashes not attached to the URL will likely require some minor python class changes.