Metadata format: metadata is not a plain mapping of strings

8day · November 23, 2020, 2:27pm

Almost all build backends assume metadata to be a plain mapping of strings (distutils, setuptools, flit, poetry, etc.), when in reality RFC822 is much more complex. E.g., values in fields like Description can’t contain EOLs, but in reality they do, “email” body must contain only ASCII chars, but in reality there can be UTF-8 chars (missed bug after the switch from Python 2 to Python 3?), etc.

Even if you will use email to serialize supposedly RFC822-compliant metadata (actually, compliant with the latest standard, which seems to be RFC2822), so that str(email.message.Message) or str(email.message.EmailMessage) produced only ASCII characters with non-ASCII strings encoded in quoted-printable or base64, things won’t go as smooth as you’d expect:

Message encodes non-ASCII header values in base64.
Message.set_payload(string) passes content as is.
Message.set_payload(string, charset="utf-8") encodes content in base64.
EmailMessage passes non-ASCII header values as is.
EmailMessage.set_payload(string) passes content as is.
EmailMessage.set_payload(string, charset="utf-8") encodes content in base64.
EmailMessage.set_content(string) encodes content in quoted-printable.
EmailMessage.set_content(string, charset="utf-8") encodes content in quoted-printable.

The only valid solution seems to be Message.set_payload(string, charset="utf-8"), but Message.add_header() accepts multiline strings w/o raising exception, keeping user ignorant of invalid input (against The Zen of Python). To make Message raise exceptions it must be created with special policy Message(policy = email.policy.strict). But that’s not all: to read decoded content you must call Message.get_content(decode = True).

While described issues may be fixed, it’s likely that there are much more. One example is that standard defines that only CRLF was used for EOLs, but people mention that some software doesn’t work with CRLF, but the one that works with CRLF also works with LF (well, these are email clients, so it may be not valid in case of Python’s metadata). Another example is that header values may contain parameters, and after parsing they become part of a value (Summary: Title; charset="utf-8"). With a wider acceptance of custom build backends, esp. in-tree build backends, there’s bound to be more software that generates invalid metadata due to underestimation of complexity of RFC822: if such a sizable package manager like poetry generates metadata from formatable strings, a project with lots of people working on it, there are likely to be many more packages with invalid metadata.

Ideally, it’d be best to switch to a completely new format because in 2020, when adoption of Unicode/UTF-8 continues to rise, even Python itself may switch to UTF-8-only mode, it’s weird when such a noob-friendly programming language like Python uses such complicated (not just complex) and error-prone format just to pass a mapping of strings… But since there’s little hope, the next best solution is this:

write dedicated PEP as the only way to inform all interested parties about how broken current RFC822-based IO logic is, which will hopefully result in use of proper tooling.
leave things as is for affected distros/package managers and introduce metadata v3.0 based on existing format (major version bump is needed to reflect significance of changes).
provide working example of a valid metadata IO in related PEP(s), or better yet – add dedicated package/logic to stdlib/packaging/whatnot.
apart from dedicated logic for metadata IO, there probably should be some logic to verify sdists and wheels, like auditwheel, but cross-platform and probably w/o checking dependencies of platlib binaries to make implementation simpler (maybe as part of packaging, or complementary to it?).

In case you have second thoughts about existing format, most likely TOML will make most sense as a new format from consistency POV, even though more limiting in comparison to JSON (e.g., homogeneous arrays, but supporting more native Python types out of the box). Though if metadata will be switched to TOML, it will make sense for PEP 621 to use same names of fields in pyproject.toml as defined in core metadata, and possibly switch names to either lowercase-hyphen or snake-case names, which in its own stead may trigger desire to provide PEP with modernised aliases for core metadata…

westurner · November 23, 2020, 2:55pm

CodeMeta covers many of the Python package metadata fields. The codemeta schema defines datatypes.

https://github.com/codemeta/codemeta :

CodeMeta contributors are creating a minimal metadata schema for science software and code, in JSON and XML. The goal of CodeMeta is to create a concept vocabulary that can be used to standardize the exchange of software metadata across repositories and organizations. CodeMeta started by comparing the software metadata used across multiple repositories, which resulted in the CodeMeta Metadata Crosswalk. That crosswalk was then used to generate a set of software metadata concepts, which were arranged into a JSON-LD context for serialization.

See https://codemeta.github.io for a visualization of the crosswalk table and guides for users and developers.

https://pypi.org/project/CodeMetaPy/ :

To integrate this, add the following to your project’s setup.py:

try:
from codemeta.codemeta import CodeMetaCommand
cmdclass={
'codemeta': CodeMetaCommand,
}
except ImportError:
cmdclass={}

And in your setup() call add the parameter:

cmdclass=cmdclass

An example codemeta JSON-LD record:
https://github.com/codemeta/codemeta/blob/master/examples/schema-org-codemeta.json

Repeated properties are represented as lists (of strs or ducts) in JSON-LD.

SHACL and/or JSONschema are sufficient for data validation.

JSON-LD makes sense because we’re describing a graph of things with relations and properties thereof.

8day · November 23, 2020, 5:08pm

Truth be told, I don’t even know what to say… It seems like CodeMetaPy can be useful, but IMO the problem is with the format itself, or at least its generation and interpretation, which can/should be handled by built-in email package (people just have to use it and use it right). E.g., CodeMetaPy uses importlib_metadata which uses email.message_from_string() with the defaults (policy set to email.policy.compat32, etc.), which results in passing of multiline strings, encoded fields (?), non-ASCII chars, etc.

P.S. It seems that CodeMetaPy thinks that the field Author can contain only human name, but it can also be a name of an organization, which will be split into something like "givenName": "Qt", "familyName": "for Python Team".

pradyunsg · November 23, 2020, 5:10pm

TOML will make most sense as a new format from consistency POV, even though more limiting in comparison to JSON (e.g., homogeneous arrays [snip]

Hey, I dropped that restriction last year.

https://github.com/toml-lang/toml/pull/676

pf_moore · November 23, 2020, 6:07pm

I’m not entirely sure what you’re proposing here. You’ve made a number of comments, but I’m not sure what anyone can do about them as they stand…

Could you break down your comments into some specific actions and proposals?

8day · November 23, 2020, 8:17pm

As I have said, because almost everyone assumes that the current format is more or less a plain mapping of key-value pairs, wrong dist metadata is generated by many build backends. To change this, it is enough to write a class based on email.Message that will be configured to produce and consume only valid RFC822-compatible documents (in form of strings), but because it may pass unnoticed, it may be desired to use the power of PEP to inform the world about importance of using such class for metadata IO. Considering that PEP for metadata v2.2 is in the works (not sure if this change is worth major or minor version bump), you could use this moment and include a reference to yet-to-be-written class, as well as an explanation of why it is important to use it.

The biggest change will be caused by handling of multiline strings and will require update of metadata (Description and License): instead of “escaping” CRLF by appending 7 spaces followed by a pipe (“|”) char, strings will have to be forcefully encoded in base64 (if len(str.splitlines()) > 1: str = str.to_base64()). Although I’m not completely sure if this is the right solution, even though it seems like the only valid one.

Oh, and such class will have to work in legacy mode (email.policy.compat32 & Co) with metadata <=2.1, although I’m not sure if “escaped” CRLF in Description should be interpreted (as defined in PEPs, in case someone actually implemented this) or passed untouched (as it seems to be done).

pf_moore · November 23, 2020, 9:41pm

This sounds like you’re proposing either:

A change to the existing standards
A new library for handling package metadata

… or possibly both. If so, are you planning on writing either of these yourself, or are you just intending to flag up a potential issue? Because to my knowledge, the current standards and handling in tools aren’t resulting in any actual bugs - so addressing theoretical problems is relatively low priority, and pretty much everyone in Python’s packaging community is snowed under with the work we already have.

You’re obviously interested in improving Python’s packaging ecosystem - which is much appreciated! - but I think you need to be aware that you won’t get very far describing issues in terms like “it is enough to write…”, “you could use this moment…” and “such class will have to…”. To make progress with your ideas, you’ll have to either implement them yourself, or be much more precise in what you expect others to do, and why you feel that it would be of practical benefit to the ecosystem, improving real world use cases rather than just theoretical situations.

brettcannon · November 23, 2020, 9:42pm

Another way to solve this is to have more universally used code manage the generation of the metadata. With PEP 621 I’m planning to construct such a chunk of code so that PEP 621, PKG-INFO, and METADATA can be managed from a single object. That should help with consistency and data validation.

njs · November 23, 2020, 10:08pm

IIUC, the issue here @8day is pointing out is that our specs say that METADATA and friends use RFC 822 format, but in fact RFC 822 has a bunch of complexities that all our tools ignore.

One option would be to fix all our tools to implement the full RFC 822/2822 format. But maybe a simpler option would be to fix our specs to document the simpler parsing and escaping rules that our tooling uses? E.g. even if RFC 822 says that you can’t use utf-8, there’s no reason we have to follow that.

fdrake · November 23, 2020, 10:47pm

I’m strongly in favor of dropping the RFC 822 reference and document what we’re really doing. Maybe then we can move to all doing it the same way with the shared code Brett alludes to coming in the future.

-Fred

uranusjr · November 24, 2020, 9:13am

I am not sure about tools generating metadata, but metadata parsers (at least both setuptools and importlib.metadata) use the stdlib email.parser, which should be compatible with RFC 822 and friends (documentation says it implements RFC 5322), and everyone seems to be happy with the result. I wonder what the disconnect here is—do tools currently generate non-compliant metadata that email.parser can somehow parse (which sounds like a stdlib bug), or do tools somehow do something to work around the discrepancy?

8day · November 24, 2020, 12:06pm

Considering the required changes – both.

Mostly I wanted to just raise this issue.

First of all, I’m too bad with explaining things, not to mention that I have relatively poor understanding of English. Second, while I could write some code, I don’t have a required level of understanding of RFC822 or related standards to make an educated decision. Also, there’s too many solutions, some of which are unlikely to be accepted (like new format), therefore before anything can be done, at least it must be decided how this problem should be solved. E.g., we have roughly these options:

stick to RFC822-compliant metadata (no multiline fields) and use special subclass of email.Message with a layer of logic to encode/decode multiline and non-ASCII strings to/from base64.
stick to RFC822-compliant metadata (no multiline fields) and:
- use email.message.Message(policy = email.policy.strict) for metadata IO to raise exception for multiline strings in fields, etc.
- store Description only in a body of a message, which will be possible for metadata v2.2 because sdist will no longer have to store Description as an RFC822 header.
- instead of Description-Content-Type use Content-Type.
- allow use of fields required by RFC822, like Content-Transfer-Encoding: base64 needed to transfer non-ASCII strings.
- because only Description had some parameters, we could forbid use of parameters to simplify format.
- transfer non-ASCII strings in base64 or “quoted-printable”.
- use email.message.Message().get_payload(decode=True) when reading description.
stick to Core metadata specifications - Python Packaging User Guide (multiline Description with custom EOL formatting a.k.a. “folding”), but it’s likely that no software uses described folding of multiline strings.
stick to de-facto standard used by setuptools (mostly plain key-value pairs with raw multiline, non-ASCII, etc. fields for any field).
use new format, like the one in v2.0, but that’s unlikely.

BTW, Unicode and email - Wikipedia contains some information about Unicode in email, and it seems that document itself can’t be in Unicode, only its parts (email address, etc.).

Both. email.parser parses such files thanks to default policy email.policy.compat32 which passes field-values w/o “unfolding” them, thus passing multiline strings, Unicode (it just assumes ASCII, but Unicode just happens to work), etc. Here’s a related URL from a note in importlib_metadata._compat.py2_message_from_string: Issue 25545: email parsing docs: clarify that only ASCII strings are supported - Python tracker.

Anyway, the problem is that if someone would like to write a generator or interpreter of metadata, he must choose whether to be RFC822-compliant, follow Core metadata specifications - Python Packaging User Guide or emulate setuptools. That’s the only real complain I have.

pf_moore · November 24, 2020, 12:27pm

Thanks for clarifying - I think with this and the comments from others, I now understand your point. Apologies for being slow.

I agree that having a self-contained and precise spec would be an improvement. At the moment, I think we probably rely far too heavily on a general principle of “don’t try to do anything too clever”. But I am somewhat reluctant to get into a situation where we have to replicate substantial chunks of the RFC822 and associated specs.

Maybe, given that we’re converging on TOML as the standard packaging format, we should switch the metadata spec to use that format. That would be a big change, and should probably wait until the stdlib includes a TOML library. Or we could switch to JSON.

I’m less sure how far we should go in the shorter term, but I’m happy to leave that question to others who are more likely to have time to do the work.

Again, thanks for bringing this up, and for your patience explaining.

8day · November 25, 2020, 4:31pm

Below are some additional notes that may be of interest.

There’s one more thing: metadata in WHEEL is also RFC822-based. Seeing as names of distributions are limited to the set of ASCII chars, and that there’s no values with multiline strings, it’s unlikely that there will be any problems, but for the sake of consistency it will make sense to use the same format as for distribution metadata. Additionally, seeing as there are plans to modify wheel/sdist formats (something about *.dist-info), that may be a perfect time to switch WHEEL to a different format.

As a short term, and potentially a more long term solution, it may be best to accept metadata produced by setuptools/consumed by pip as a standard. Considering that the syntax of some metadata fields was relaxed when reality didn’t match expectation (syntax of values is completely different matter, but still), this shouldn’t be something new, just an unexpected side-effect of PEP 517. Thus, metadata will have to be re-defined as RFC822-based, as opposed to being compliant, parameters will have to be stripped in all fields except Description (not sure about stripping, maybe warning will be better), document will have to support both ASCII and UTF-8, and both CRLF and LF will have to be allowed as EOL chars. As for handling of Description, it seems that this setup.py

import setuptools
setuptools.setup(
    name = 'xxx',
    version = '0.0.0-dev0',
    long_description = 'a\nb\nc\n\nx\n',
)

results in sdist with this PKG-INFO (BTW, Metadata-Version for sdist is 1.0, when in reality it’s 1.2, as it should be (?)…)

Metadata-Version: 1.0
Name: xxx
Version: 0.0.0.dev0
Summary: UNKNOWN
Home-page: UNKNOWN
Author: UNKNOWN
Author-email: UNKNOWN
License: UNKNOWN
Description: a
        b
        c
        
        x
        
Platform: UNKNOWN

and wheel with this METADATA (even when installed from sdist with pip, meaning in a way pip understands EOL suffixed by 8 spaces)

Metadata-Version: 2.1
Name: xxx
Version: 0.0.0.dev0
Summary: UNKNOWN
Home-page: UNKNOWN
Author: UNKNOWN
Author-email: UNKNOWN
License: UNKNOWN
Platform: UNKNOWN

a
b
c

x

which shows that it may be possible to merge two types of Description – the one at https://packaging.python.org/specifications/core-metadata/ and the one used by setuptools, which will result in lesser impact on build tools. To do this, instead of suffixing EOL character with 7 spaces followed by |, EOL should be suffixed by 8 spaces, or make so that both versions were valid (after all, there’s 8 chars in total and 7 of them must be spaces, making it easy to support both versions). I suspect that the code in setuptools and pip responsible for that 8-space suffixing should provide more hints about how this can be handled. Just wish I figured this out sooner.

But don’t quote me on all of this – I wasn’t sleeping well in the past few days, so my mind is hazy…