PEP 819: JSON Package Metadata

I’m happy to announce PEP 819: JSON Package Metadata. I was motivated to work on this PEP based on issues experienced both in working on tooling around Python packages at work and discussions about pain points with others working on open source Python packaging tools.

Abstract

This PEP proposes introducing JSON encoded core metadata and wheel file format metadata files in Python packages. Python package metadata (“core metadata”) was first defined in PEP 241 to use RFC 822 email headers to encode information about packages. This was reasonable in 2001; email messages were the only widely used, standardized text format that had a parser in the standard library. However, issues with handling different encodings, differing handling of line breaks, and other differences between implementations have caused numerous packaging bugs. Using the JSON format for encoding metadata files would eliminate a wide range of these potential issues.

The full PEP text is published: PEP 819 – JSON Package Metadata | peps.python.org

Interested to see what folks think!

19 Likes

Sounds awesome! I agree that the standard JSON schema should be served from a sensible URL within packaging.python.org.

1 Like

Thanks for opening this @emmatyping! I’m a big fan of enabling a less ambiguous transfer format for Python packaging metadata.

You preempted my main request, which was for an official JSON schema! I agree with @CarrotManMatt that PPO would be the right place for it. I also think it’d be great to have it made available through SchemaStore, which can automatically pull external URLs (e.g. one on PPO) and ensure that IDEs and other tools are aware of METADATA.json’s expected structure.

One other minor thing that came to mind: this is currently a non-issue because the standard metadata doesn’t contain any integer fields, but we may want to stipulate that any future integer fields are encoded as strings (rather than as naive JSON integers) to avoid any sizing/precision issues across different JSON parsers. However, since that has literally zero impact on anything current, I think it could also be deferred; I just wanted to raise it as a source of parser differentials in JSON that occur in practice :slightly_smiling_face:

(The other one being duplicate keys, although I’m not sure how to handle that one in practice. Maybe it’s sufficient to say that the last duplicate wins, which is how both Python and serde-json behave?)

8 Likes

Great idea! I was not familiar with SchemaStore, but once the JSON Schema is finalized it makes a lot of sense to include METADATA.json’s and WHEEL.json’s schemas there.

I think I am inclined to defer this to a future PEP if any wish to encode data as an integer. I’ll note this in the PEP text when I make my first batch of updates.

This is a good point, I do think specifying behavior here is a good idea. I think my first preference would be last duplicate wins, but an alternative is that any duplicates are treated as an error. I think last-wins makes the most sense. It’s the behavior of most JSON parsers, and ECMA-262 specifies it as the behavior of JSON.parse ECMAScript 2015 Language Specification – ECMA-262 6th Edition.

3 Likes

Generally, I like this! Two relatively minor points.

  1. You mention as a security implication that the JSON payload may be designed to consume excessive memory or resources. I’m not sure what you consider “excessive” here, but the bulk of the metadata file will almost certainly be the description field, which is the project readme, and can be pretty large (some projects include their whole changelog in the README). Is there some other risk you’re thinking of?
  2. Can we not say that it’s mandatory for producers to create the new JSON metadata files? Obviously if a build backend doesn’t support PEP 819, it won’t create those files. But if it does support the PEP, then that pretty much means that it will create them[1]. So let’s be explicit. If we make it a requirement, that makes the transition process a bit easier - new wheels and sdists are guaranteed to have JSON metadata.

  1. What else would support mean? And having a way to opt out of JSON metadata seems pointless. ↩︎

7 Likes

Thanks Paul! Glad to hear that.

Let me preface this with saying that I think practically speaking this is unlikely to be a concern in the vast majority of cases. And it will only be a concern if a malicious payload is crafted. Unlike email, JSON has an integer type. As @woodruffw pointed out, this is the source of a lot of compatibility issues:

A warning was added to the json module’s documentation when CVE-2020-10735 came out, having to do with slow int() parsing gh-95778: CVE-2020-10735: Prevent DoS by very large int() (#96499) · python/cpython@511ca94

Upon thinking about the compatibility angle of this, as well as the security angle, I think maybe it would be better to stipulate that integers (and probably floats? my understanding is there is some deviance from standards here too) should be parsed as strings when deserializing from JSON, which you can do by setting parse_int and parse_float already when calling json.load[s]. If a future core metadata field required an integer or float value, I would say it can always be parsed lazily later, reducing the chance of a DOS. I didn’t propose this initially because I wanted to leave the door open for integer metadata values, but I think they are more trouble than they are worth.

We certainly can and I agree we should. I’ll be explicit about this in my next update to the PEP text.

4 Likes

I’m mildly against having special rules for parsing JSON when it’s metadata.

How about we say in the security considerations that maliciously crafted metadata files could introduce issues by using integer values, but as no valid metadata fields use integers, this is not expected to be an issue for valid data. We can add that tools that want to protect against malicious data can use the parse_int and parse_float options to avoid the risk.

6 Likes

I agree that not special casing integers or floats in the specification is the way to go here. We should just recommend that tools that parse the metadata apply the necessary precautions to counteract malicious payloads.

I think a mention about overly long descriptions would also be helpful in the security considerations section.

1 Like

Yeah, this seems very reasonable to me :slightly_smiling_face:

(My concern was less about malicious inputs, and more about ambiguous ones: Python has arbitrary-precision integers so it’s compatible with the arbitrary-precision definition of integers in JSON’s grammar, but a lot of other languages assume that JSON integers fit within a union { u64, f64 } and will silently truncate/mangle/etc. them if that assumption is violated. So there’s a risk that Python packaging tools written in languages other than Python can do funky things with JSON integers, and IMO the standards should ideally preempt that. But this is such a minor issue!)

2 Likes

That’s understandable, I think we may need to add some rules to resolve ambiguity if we ever have integer values in metadata, but I’m fine deferring that until we practically need it.

I’ve updated the security section in my PR updated the PEP text to reflect this: PEP 819: Disambiguate JSON parsing and make new files required by emmatyping · Pull Request #4773 · python/peps. I also removed the portion about floats because I think they only pose a potential compatibility concern, not security issue.

3 Likes

Two things. One, I totally support this!

Two, if this PEP gets accepted then packaging.metadata should get a function that can take the JSON object and produce the email headers string, or a method. The reason for the function suggestion is to help make it clear which format is preferred.

8 Likes

Thanks for looking into this!

A few quick notes from proofreading:

The semantic contents of the METADATA and METADATA.json files MUST be equivalent if METADATA.json is present. Installers MAY verify this information. Public package indexes SHOULD verify the files are semantically equivalent.

Does the same hold for PKG-INFO and METADATA.json?

The message body, if present, should be set to the value of the description key.

Perhaps I’m reading this wrong but it seems to be written the other way around; i.e. shouldn’t it say that the description key should be set to the message body?

Please see the next section for more information on backwards compatibility caveats to that change.

I’d recommend against “next” in case things get moved around. Better to explicitly say “the backwards compatibility section”.

3 Likes

Thanks for pushing this forward!

The Project-URL field should be converted into a JSON object with keys containing the labels and values containing the URLs from the original email value.

Since this deviates from the PEP 566, would it make sense to name this field project_urls instead of project_url ?

Specifically, I worry about backward compatibility of the pip install --report output, which exposes metadata in PEP 566 JSON format, so with project_url (singular) as a list. If this PEP is accepted, I suspect that at some point pip will want to use it (and its future iterations) in the installation report. In that perspective, having a new field instead of a type change to an existing field would help with the transition.

I assume other tools relying on the PEP 566 JSON serialization would have similar concerns.

Interestingly I notice that the new pypa/build --metadata flag already produces a project_urls dictionary, as it relies on packaging.metadata.parse_email. I also see it pluralizes license_files. I don’t know if there are other differences.

As far as I can see build does not claim to implement any specific spec in its metadata JSON output, however.

But it might be a good opportunity to try and align these things if possible.

6 Likes

Yes. I can add something to that effect in the next update to the PEP.

yes, good catch!

I worry that if we have both project_url and project_urls we’ll have a lot of headaches around tools silently ignoring project_urls, which seems pretty bad. I’d rather have this fail loudly so people can search for an error rather than have some tool miss the project url metadata and cause confusion.

1 Like

Maybe.

Further thinking about what build --metadata now outputs and looking closer to what packaging.metadata.parse_email does (it pluralizes most list fields and has the project_urls dict and splits keywords into a list), we would end up with 3 dict representations of metadata:

  • PEP 566 JSON
  • What packaging.metadata.parse_email() and thus build --metadata return
  • PEP 819 JSON which is PEP 566 with one exception for the type of project_url

This gives me the feeling that this little PEP 819 exception is unnecessary and it could either stick to PEP 566 or go all the way to aligning with packaging’s RawMetadata dict.

2 Likes

“Unnecessary” is probably not the right word here, apologies. What I want to get at is that, as a way to make the PEP 566 JSON more ergonomic, only making project_url a dict is insufficient. The pluralization of list field names and parsing keywords into a list also seem relevant to that end.

So it feels to me like introducing a breaking change to PEP 566 JSON while solving only parts of the issues with the format.

4 Likes

Well, the output of build --metadata does not follow any standard, and was released 2 days ago. I’m sure there are other tools that have their own mappings of core metadata to JSON. But I don’t think those need to affect how we define JSON encoded core metadata in the standards process. We already have PEP 566, let’s build off that standard and make it the best it can be.

I think it would be unfortunate if we introduced JSON metadata and didn’t use it’s features to make parsing nicer by taking advantage of nested mappings and required further custom parsing. Custom parsing for fields like Project-Url is exactly the type of thing I’d like to move away from by introducing JSON metadata. We could go all the way to RawMetadata’s naming, and I did consider this, but I’d much rather make one update to an existing standard than change things completely based on what a tool that isn’t standardized does. I do prefer how RawMetadata names things personally, but we cannot reasonably adopt it at this point.

1 Like

While I agree pluralization might be nice, as stated above I think it would be too much of a change to the standard. Regarding Keywords, PEP 566 already indicates this should be parsed as a list, which PEP 819 mirrors in step 4. I think this indicates that Project-Url ought to be a dictionary by similar logic.

So really the only change introduced on top of PEP 566 is the encoding of Project-Url. Pluralization of the fields could be done, but would break things for a lot more users.

I’d be interested to hear what @dustin thinks, as the original author of PEP 566.

In one way, I agree with this comment, but in another I’m becoming quite uncomfortable with the number of subtly different mappings that are being discussed.

PEP 566 is the current standard for mapping core metadata to JSON. The mapping rules weren’t transferred to the packaging standards documentation, but that doesn’t alter the fact that they were defined in an accepted PEP. I think we should ask the packaging and build projects why they chose not to follow PEP 566, and what it would take for them to switch to an updated standard mapping. I emphatically do not think it’s reasonable to introduce an updated mapping in PEP 819 if we can’t get existing projects to follow it. It looks like the build feature is only 2 days old, but packaging.metadata has been around for 2 years now, so there would certainly need to be some sort of transition mechanism, but we should be looking for unification here, not yet another form for people to learn.

Agreed. If JSON metadata (more accurately, I guess, structured datatypes for metadata) is to be the standard going forward, we should use the capabilities it gives us. But we need to focus on the data model rather than the representation, and recognise that we’re changing that model, not just layering parsing rules on top of the old, flat model.

Why not? How is it any harder to define core metadata in terms of strings, lists and dictionaries, with lowercase names and plurals where appropriate, plus a set of mapping rules to translate that to an email representation, than to define core metadata in a “flat” form modelled on what email format allows, with a set of mapping rules to translate to JSON and/or strings, lists and dictionaries?

Unless we use the PEP 566 rules, we’re changing the model, so let’s do it properly now, rather than creating the need for a second transition, when we’re fully using JSON metadata and tying it to a flat underlying data model no longer makes sense.

Looking at this another way, I think that if we limit the scope of PEP 819 to just the storage of metadata in JSON files, we’re pretty much compelled to use the existing standard for how to map metadata to JSON. But I’m starting to think we’d be better expanding the scope of PEP 819 to actually define the core metadata in a structured form (dictionaries, lists and strings), let the JSON format fall naturally out of how that form is serialised, and write serialisation rules for translating the new metadata format to the legacy email form. That way, when someone wants to add a new metadata item in the future, they can just use structured data naturally, get JSON serialisation for free, and it’s 100% clear that they need to define how to map the structured form to email.

7 Likes

I think the idea of JSON metadata is great, but we’ve been here before (sort of) with the withdrawn PEP426. As far as I recall, there wasn’t anything technical wrong with what was proposed there (broadly speaking), but it basically died for social/political reasons. Maybe there’s a greater appetite now for this … I know I spent a lot of time on `distlib` working with the PEP426 format, back in the day, but that came to nothing.

The PEP text doesn’t seem to refer to PEP426 at all, is there any particular reason for that? Just curious.