PEP 819: JSON Package Metadata

I seem to be mistaken! I suppose we could! I expected this to be met with too much push-back about breaking the PEP 566 way of things originally but it seems that may not be the case.

Awesome! I originally wrote PEP 819 to be conservative and minimally breaking, but I think this would be much more valuable. I presume we’d want to do the same for the WHEEL file as well.


I think there definitely is an appetite for this now. We’ve gotten a lot more experience working with and extending the email format and learned a lot about the limitations of the format from that. Also people seem broadly in favor of JSON metadata, in this thread at least!

I think I originally was going to mention PEP 426, and say that it was withdrawn and superseded by PEP 566, but decided to focus on PEP 566 since the PEP was meant to be a minimal change on the current standards. I do think PEP 426 provides useful context, so I’ll make sure to mention it in my next PEP draft.

3 Likes

The email-but-not-really-email format is terrible. I’ve had to implement it in pyproject-metadata and packaging. It’s so bad that I’d take anything in JSON as an improvement. It doesn’t really define what to do on multiline strings other than the description properly, special characters are a mess (did you know some people put form feeds in licenses?), unicode is non-standard, and even collides with bugs in CPython <3.12.4’s email module. The email module is itself full of TODO’s and warnings. I also don’t like that the keys are case insensitive and there’s not a clear “correct” capitalization. Nor is there a clear “correct” indentation.

I think we could change build --metadata pretty easily. packaging.metadata, from @brettcannon, currently just stores metadata according do PEP 566. It (as of packaging 26.0rc1) can write out to email format, but it cannot write to json format. It originally could, following the rules in PEP 566, but that was removed precisely so that something like this could be done in the future. pyproject-metadata can produce JSON following the PEP 566 rules, FWIW; though since there’s no way to use that JSON currently, I think it could be modified to match whatever is decided on without much fuss. I’m personally supportive of whatever is decided on here.

There are already a lot of special cases when converting metadata; personally, my preference would be to have the most accurate representation of data in both JSON (and internal) form, then have rules to convert to the “classic” PEP 566; that would eventually go away in theory, and we’d be left with proper datatypes and hopefully fewer special cases in the long run. But I’d be happy with anything that moves toward getting rid of the email module when working with metadata!

I agree! I think all data types should be natural, or it should follow PEP 566, doing it partway seems like the worse of both worlds (but still better than the status quo!).

Very happy to see this moving along!

5 Likes

I thought it also used different field names (pluralised) in the RawMetadata class? I’d be inclined to suggest that those should be unified with any new JSON metadata naming.

For being a nearly 45 year old standard, email is incredibly complex! Despite Zawinksi’s Law, even email apps barely get it right. :laughing:

1 Like

The packaging.metadata data model uses pluralized names but that is an implementation detail, not exposed to any serialized format. packaging/src/packaging/metadata.py at aad7845069d2b1651860fe42fd3bcc25ccace1a8 · pypa/packaging · GitHub

packaging has no JSON serialization of core metadata as Henry said:

I agree with your previous suggestion we use a similar data model and standardize it however.

1 Like

I really agree that describing everything in a standard data format and then describing how to translate that into JSON is the way to go here.

I’m going to say “I” here to take any blame, but others like Donald helped bring packaging.metadata to life, so I also don’t want to deny credit to anyone.

I viewed RawMetadata as a compromise for those that wanted help parsing the email metadata but didn’t want enriched objects from ‘packaging’. To me, RawMetadata is to support Metadata; the fact that you could pass RawMetadata to json.dumps is happenstance. The key API is Metadata which doesn’t have serialization support, and so matching PEP 566 wasn’t a goal nor seemed necessary.

As for why those specific differences, the project_urls type change just made sense to me; if you read that field you are going to have to parse it, and parsing is already lazy. As for the plurals, I find it’s a type hint to know when something is going to be a container.

3 Likes

Is there an opportunity in this to preserve some of the information which is present in pyproject.toml but then can’t be accurately reconstructed from wheel metadata (optional-dependencies) or is difficult to rebuild (authors/maintainers)?

I think there might be, but it can’t be done while staying fully in line with the current metadata, since doing so would preserve the same difficulties.

In general, I’d rather parse the pyproject.toml format for data, including authors, maintainers, etc. I would expect that to be a popular preference. Is there a way to get that with some new fields? I’m not seeing it discussed, but I hope that’s just a matter of some (laudable!) cautiousness about making too many changes at once.

I think this is out of scope for this PEP. I’d like to keep very focused on encoding what we have already rather than also extending that.

Yes, I’d like to make this PEP as focused as reasonably possible. With the current discussion it seems that direction is “define an abstract data model for existing metadata files, then define serializations for email and JSON”.

I’d add that pyproject.toml and core metadata are rather different. One is an input to build frontends, the other an input for installers. pyproject.toml can have a lot of dynamic keys and not contain a lot of the data that ends up in core metadata.

3 Likes

Yeah, that’s why I thought defining a model slightly less like the email format would fit in. But I admit to being unclear on how much freedom we have here.

For example, could authors be defined as an array of name, email objects? A dynamically sourced backend – maybe one which uses git commit metadata – might not have the data in that format.
So what is the rule? It needs to map exactly to the core metadata fields? That’s maximally safe but it does feel like a shame to be stuck with that constraint.

1 Like

That example certainly feels like it should be in scope to me.

6 Likes

One minor point here. With the release of TOML 1.1, we’re just having a debate over which TOML version our standards require. While JSON appears to be rock-solid stable at this point, it’s probably worth explicitly stating in this PEP what version of JSON it refers to.

It would be a shame if some standards committee produced a new version of JSON that (for example) included comment support, and we had to patch up our standards because we never expected that to happen[1].


  1. “Our weapons are fear, surprise, and a relentless devotion to evolving standards!” :slightly_smiling_face: ↩︎

7 Likes

Just noting that people have mostly stuck to the lowest common denominator, but things like comments or trailing commas being forbidden is a consistent source of friction, and fixed by efforts such as JSON5. Trailing commas are allowed in JavaScript since ES2009, and even the JSON standard (ECMA-404) itself allows them since its initial edition in 2013. But compare the railroad diagrams for object/array between JSON.org & ECMA-404:2013 – a visual reminder that the lowest common denominator cannot assume these improvements yet.

In other words, JSON has the same kind of concerns that we have with TOML, just several orders of magnitude larger. People have (mostly?[1]) resigned themselves to not care or talk about it.


  1. definite bias warning on this, this is strictly my impression based on my not-very-web-related corner of the ecosystem. Googling reveals many people that are very passionate about this. ↩︎

1 Like

RFC 8259 is probably an acceptable reference/anchor point for a stable definition of JSON. I believe that standard is entirely compatible with ECMA-404 as well, so that could also serve as the reference point.

(To my read, neither the RFC nor the ECMA railroad grammar permit trailing commas. ECMAScript does, but that’s because it’s in the context of a larger language grammar rather than data exchange.)

3 Likes

+1, on reading the BNF I agree.

I think trailing commas and comments are feature used largely by human authors. Given package metadata is almost entirely machine written, I think these aren’t really of concern?

I believe the json module uses RFC 7159 as it’s reference. RFC 8259 which William referenced is as I understand it, RFC 7159 with more errata resolved.

Perhaps it would work to specify the format as “the standard used by the json module in the oldest supported version of CPython”?

1 Like

That would be harder for tools written in other languages, such as uv, to track.

2 Likes

I just read these and confirmed that errata are all copyedits and don’t impact the technical elements of the spec.

I think we can just say RFC 8259 and be done with it?

And we can advance a documentation update for the json module, which I think is worth doing anyway.

2 Likes

I’m not sure I see how, given it would refer to an RFC that is at least 5 years old, if not older. However:

Based on this I think selecting RFC 8259 is reasonable for this PEP.

4 Likes

Sorry, if the PEP refers to the standard, you’re right. I thought you meant that the PEP should say “parse JSON the way the Python stdlib module does”.

My misunderstanding.

1 Like

You can blame me too :wink:

The discussion is online still so if folks want to see the whole thing it’s mostly here, but a key idea behind it was that Metadata and RawMetadata wasn’t just de-serialized METADATA, they were data structures, independent of any serialization format that represented the metadata in it’s “logical” form.

The Project-URL key is where I think this is most obvious. I don’t think anyone would reasonably argue that the “logical” form of Project-URL is a list of strings like Foo, https://example.com, but rather that was a mechanism used to serialize a key → value into the METADATA format. So having Metadata and RawMetadata not constrained by the serialization format, makes the API better/easier to use, and means new serialization formats can be written to feel “native”.


Some other thoughts.

The PEP currently documents conversion for METADATA and WHEEL files. I would recommend dropping that and documenting the actual format being used (or rely on the JSON schema). Trying to convert existing METADATA files into a METADATA.json file cannot reliably be done (but you can get a reasonable approximation of it in the common case), and documenting things as a “conversion” suggests that is something that is reasonable to do.

Specifically, I think the goal should be that tools should be natively producing a METADATA.json when the artifact is built, not trying to convert METADATA to METADATA.json.

Of course tooling can provide a conversion process, but since that process is, by it’s nature, somewhat heuristics based, I don’t think we should try to standardize it.

If we do keep the conversion section, the PEP currently states

Conversion from the current email format for core metadata to JSON should follow the process described in PEP 566, with the following modification: the Project-URL entries should be converted into an object with keys containing the labels and values containing the URLs from the original email value.

However, PEP 819 says to split keywords on commas and PEP 566 says to split it on whitespace.

I would also recommend (as a few others suggested) not feeling constrained by PEP 566. PEP 566 doesn’t specify a METADATA.json, it tells you how you might store the data that is contained in METADATA inside of a data structure that doesn’t allow repetitive keys.

I said it in the packaging.metadata thread, but I still agree with it: “Which I think would be a shame, when defining how to serialize our metadata into a new serialization format, we should strive to produce an output that feels like it belongs in that serialization format. I think it’s about as lame to define a JSON standard that feels like it’s just email headers, but with different syntax, as it is to write Python, but like you’re a Java programmer.”

I also think that you’re missing some security considerations, particularly around duplicate keys since it’s pretty well known that different parsers treat duplicate keys differently (some are first win, some are last win) and you can end up with some kind of confused deputy attack. See An Exploration & Remediation of JSON Interoperability… | Bishop Fox for more.

Another thing that came up when designing packaging.metadata is when de-serialization encountered unknown fields. That library provides guard rails for that, but the PEP should probably at least mention it as a thing to think about, if not flat out give a SHOULD on what to do with them.

I’d also consider whether we should define any serialization rules for format (maybe as a SHOULD?), I’m thinking things like should keys be sorted, should the JSON file be written in a “compressed” format or a “pretty” format, etc. There’s not a security benefit to it, but if we at least recommend a deterministic serialization, then it helps get us closer to deterministic packaging (not building itself ofc), which feels like a nice win for very little cost?

In the packaging.metadata thread I linked the original thought was to treat pyproject.toml and METADATA as the “same”, and I thought it was wrong to do that, and I still do.

Stealing my comments from there

I’m personally kind of sketch on the idea of using these APIs for pyproject.toml. My concern is that while some of the data there is the same, I think that conceptually pyproject.toml is a different (but related) thing all together, which I think is further supported by the fact that they’re two different specifications on packaging.python.org and those specifications have different rules for validation.

For instance, pyproject.toml allows version to be specified as a dynamic value, but nowhere else is that actually allowed, and for practical reasons, if you’ve got a set of bytes that are a pyproject.toml, you don’t have a good way to even know if it contains this metadata without first parsing it, since the [project] table is optional. So, my assertion here is that pyproject.toml is a different thing from METADATA and friends, one is the input, and one is the output, but they should re-use a lot of the same primitives (Version, etc).

The authors/maintainers field here is interesting because while pyproject.toml is an easier to understand data format, it’s actually less powerful than what the core metadata allows.. which is fine for a restricted input format, but that means that pyproject.toml can’t represent otherwise valid metadata.

I swear I’m not trying to sound like a broken record here, but I really do think that we’re going to save ourselves a lot of pain if we don’t try to cram pypyroject.toml into the same box as the core metadata. It was intended for a different purpose than the core metadata, and it has different rules and it can’t even fully represent everything that core metadata can (which is fine! It’s designed as a more restrictive input format, not as a replacement for the core metadata)

And this entire comment.

5 Likes