Python metadata format specification and implementation

pyfisch · March 6, 2021, 11:14pm

The common format used in METADATA, PKG-INFO and WHEEL files is currently not well defined. PEP 241 written in 2001 referred to RFC 822, which specifies the email message format, for a description of the format.

The complexity of the email library has lead many metadata generators to implement the format themselves. However these implementations usually forget to check for line-breaks and as different writers are used for PKG-INFO and METADATA the contents often differ. This results sometimes in the generation of invalid metadata files. On the other hand parsers usually use the email package but don’t check for defects or logic errors in the metadata files. This leads to invalid files being accepted and uploaded to PyPI.

Because of the earlier discussion in Core metadata email fields & Unicode with @takluyver and @uranusjr, I had a closer look at the metadata format and tried to come up with a solution to these issues. Python packages don’t make use of the complex features of email messages, so a replacement should be feasible, although some churn is inevitable if one wants to improve on the status-quo as a few published metadata files for popular packages are invalid (see below).

I’ve drafted a written specification of the format, that is compatible with the metadata files already deployed on PyPI but does not depend on the email RFCs for the message syntax. In addition I’ve implemented a parser and a serializer for the format using only the standard library. Currently a dict-like API for accessing the metadata fields is missing as well as additional message validation. To test my implementation I collected metadata files from the top 4000 packages on PyPI. I can parse and serialize again all files without problems, except those that contain errors and aren’t correctly parsed by the email package as well.

Examples of invalid PKG-INFO files found on PyPI:

tendo-0.2.15: Each keyword on its own line, without leading whitespace. This breaks the message as each line should be a “key: value” pair, of if line folding is used start with a space to continue the previous value.
rstr-2.2.6: user put a long multi-line description in the Summary field. Same issue as above.
vaderSentiment-3.3.2: description contains for an unknown reason completely blank lines. A blank line without whitespace signals the end of the message header. The remainder of the message is erroneously considered to be the payload.
additional errors in passlib-1.7.4 and win_inet_pton-1.1.0

The METADATA files in the wheels for these packages were produced from the broken PKG-INFO files, and while syntactically valid contain mangled or incomplete data from the PKG-INFO.

Does anyone know of METADATA/PKG-INFO files containing a Description field in the piped format from the standard?
Is it currently possible to block the upload of invalid METADATA and PKG-INFO files to PyPI?

westurner · March 7, 2021, 12:07am

IMO, any breaking or substantial change to metadata should be incorporated into a move to CodeMeta / Schema.org JSON-LD.

https://codemeta.github.io/

Other than backward-compatibility, there’s no reason to continue with this custom METADATA format that basically nothing else uses.

westurner · March 7, 2021, 12:16am

(edit) you have a perfectly valid case for fixing the parser for this format.

dustin · March 7, 2021, 12:28am

PEP 566 defines a canonical way to transform Python package metadata to JSON. Why not just use JSON?

westurner · March 7, 2021, 3:49am

The CodeMeta Project defines a more useful mapping of some distutils package metadata attributes to standardized https://schema.org RDFS properties.

Would it be better to just have codemeta JSON-LD in the package and in a <script type="application/ld+json">{"@context": "", ...}</script> in the e.g. warehouse templates rather than to template existing metadata (also using a codemeta crosswalk) to RDFa or JSON-LD in pages on PyPi?

pyfisch · March 7, 2021, 11:04am

There seems to be some confusion. My goal is not to change the format Python package metadata is stored in, e.g. by switching to JSON. Neither is my goal to improve the metadata itself, e.g. by adopting the CodeMeta standard. Rather I want to specify the existing format and provide parsers and generators for this format to ensure interoperability between different tools.

Switching the package metadata format to JSON was proposed in PEP 426 but withdrawn because there is “no feasible migration plan”.
Actually I am working with the PEP 566 canonical, JSON-compatible representation in my implementation: METADATA files are parsed to this data structure and can be accessed with get_structured(). METADATA files are also written from the same structure.

takluyver · March 8, 2021, 4:20pm

Thanks for tackling this!

To echo what @pyfisch already said, this is a “write down what we’ve already got” effort, not a new metadata format. Any move to change the metadata format runs into a fairly obvious problem: there are millions of releases already on PyPI with the existing metadata format, so any tool that consumes packages will still have to handle the existing format for a long time.

From the spec draft:

Key-value pairs are written as “Key: Value”, terminated by a newline.

Except when the newline doesn’t terminate them. They’re terminated by a newline not followed by whitespace, if I’m reading it correctly. Apart from the one at the end of the file…

replace the newlines and every whitespace following it with a single space character

Note that your implementation also collapses the whitespace before the newline - "foo\t\n bar" becomes "foo bar", whereas this spec implies it should be "foo\t bar". IDK which is right.

What happens if you have an all whitespace line ("foo\n \n bar")? Does it collapse to a single space, or do you get two spaces?

If the field name is Description and you are in a METADATA or WHEEL file use a different algorithm:

Is it possible to define the line continuation rules in a way that doesn’t depend on the filename, and ideally doesn’t depend on the field name either? My reading of the core metadata spec is that the Description field is not a special case, but it’s the only field where you’re likely to have extended text with blank lines and indentation, which requires the format with the | character. When I went looking for parsing code in pip & pkg_resources, I didn’t see it doing anything special for the Description field.

The list of key value pairs is terminated by two newlines.

Is this mandatory if there’s no ‘body’ following the ‘headers’? Is even a single trailing newline mandatory in that case? I think your implementation allows for no trailing newline.

The rest of the file is an optional multi-line payload which is used for descriptions…

We probably need to be more precise about this. It’s clear, I think, that if the headers do not include ‘Description’, and there is a ‘body’, then it is used as this field. If both are present, does one take precedence, or should a tool reading such a file error/warn? And should the body be transformed in any way if it’s used? E.g. do any of the rules for processing multi-line values apply?

Kudos for checking a sizeable sample of packages from PyPI, that’s an important thing to do. Do existing tools (like pip) give any warning or error on the few broken cases you found? Can you access some or all of the metadata through importlib.metadata? I just looked at a couple of their projects, and there were no obvious ‘pip won’t install it!’ issues, which implies that at least pip is fairly tolerant of bad metadata files.

For the files that parse OK, a good extension would be to check if something like ~~importlib.metadata~~ email.parser gives identical results for the originals and the files you have rewritten.

To be clear, I see the Python code as primarily there to validate the spec. As the stdlib email module can already read & write compatible metadata, I wouldn’t expect much demand for a new Python library. But one advantage to nailing down the spec is the possibility of confidently creating/parsing metadata from other languages.

brettcannon · March 8, 2021, 10:28pm

Initial thinking about this has started at Add a metadata API · Issue #383 · pypa/packaging · GitHub and is partially awaiting me having more time to do the next step, but if you beat me to it then great. But a key point of doing this was to get an object model that could be used to read/write the various metadata formats that we have and have it centralized in ‘packaging’ so we all agree on how things should function, so same motivation as you.

pyfisch · March 8, 2021, 11:12pm

Thank for the thorough review of the spec draft.
You poked quite a few holes in my written description, so for the next draft I will write down a grammar of the metadata files in ABNF. Since ABNF is mainly used in IETF RFCs I am wondering if there is a different preferred grammar to specify formal syntax in PEPs and other Python standards?

RFC 5322 Section 3.2.2 says: “Runs of FWS, comment, or CFWS that occur between lexical tokens in a structured header field are semantically interpreted as a single space character.” FWS is folding whitespace, i.e whitespace at the end of a line \r\n and the whitespace at the start of a line. However all our metadata fields are technically unstructured header fields since they don’t appear in the RFC. For these header fields RFC 5322 Section 2.2.1 states " Unfolding is accomplished by simply removing any CRLF that is immediately followed by WSP."

In the ~5000 metadata files I collected, line folding in fields excluding “Description” happens a total of 9 times. And all of them are cases of what I would call “accidental line folding” which happens if a user specifies a value in setup.py, setup.cfg or another configuration file across multiple lines but the tool doesn’t remove the line breaks. (It doesn’t add the required initial whitespace for line folding either. If there is whitespace in the configuration it works fine, otherwise you have a broken metadata file.)

So either retain all whitespace or replace it with a single space. Because the space is insignificant I would suggest implementations should remove it.

I disagree, the description field is super-extra special. The quotes above describe how line-breaks are only used for folding in the header fields and that they aren’t part of the value. Its a wart of the specification that description needs special handling, but I don’t think I can avoid it.
(But you could split the parsing in two parts: One, parse the file into key-value pairs with values containing line breaks. Two, apply line folding and in the case of description, remove the 8 leading spaces.)

This is the function in wheel that removes leading whitespace from description: dedent_description Meanwhile distutils doesn’t do anything special with description which means it has 8 leading spaces in each line. Still haven’t found any tool implementing the “7 spaces and a pipe” format, it isn’t something from email but specific to the Core Metadata standard.

I treat it as an error if both are present, I will add it to the specification.

There is no error or warning in pip. If you just build your sdist and wheel, and then install them you won’t notice.

You can access the portion of the metadata that appears before the defect. importlib.metadata is based on email and it assumes you just forget to add an extra newline to indicate the start of the message body and interprets the remainder of the message as body.

I will check that when the specification is more complete. But I am very confident that it already works.

While the email module can write compatible metadata you have to configure it right. (At least utf8 for non-ASCII names, to avoid line folding refold_source="none", possibly more) For one reason or another tools actively avoid using email to generate metadata. Another motivation for directly implementing the specification and using this code is to ensure that metadata is actually written as specified. As we have seen emails parser is less than strict. There are some additional “features” in email messages like encoded words that aren’t used right now, but could easily creep into metadata if there isn’t a strict implementation rejecting these metadata files.

I think demand for the new Python library will increase when I point tool maintainers to the invalid metadata files their software produces. Ideally a compliant metadata implementation should be part of packaging, which many tools already depend on.

pyfisch · March 8, 2021, 11:26pm

I’ve read this discussion. I’m happy to contribute the implementation and documentation for METADATA/PKG-INFO files once there is agreement on the approach. If the metadata is centralized in packaging this has a greater impact on the Python ecosystem than if I implement a metadata reader/writer separately.

westurner · March 9, 2021, 2:16am

From a recent ml thread edit: Mailman 3 Python standardization - Python-Dev - python.org :

2. Lexical analysis — Python 3.12.1 documentation

From 10. Full Grammar specification — Python 3.12.1 documentation :

The notation is a mixture of EBNF and PEG […]

Is there anything that says PEPs must also be EBNF/PEG?

test_grammar.py linked above may be of use

uranusjr · March 9, 2021, 2:53am

I don’t think there’s a hard rule. PEP 508 used syntax of the Parsley library, which is pretty similar to (or just is?) EBNF. Honestly, IMO the differences are pretty neglectable here anyway, so anything would do. EBNF symbols probably read slightly less foreign to Python developers.

Or just reject those values as standard violation? I think all Core Metadata fields except Description are meant to be single-line anyway, so we can just mandate that all non-super-extra-special fields must fit in one line and cannot contain any newline characters, and get rid of folding rules altogether.

takluyver · March 21, 2021, 10:53am

The examples in the core metadata spec for Author, Maintainer and License show multi-line values, so I think we should probably allow that. Or if we don’t want to allow it, we should update those examples.

takluyver · March 21, 2021, 11:44am

I’ve just looked into using the email module to write the metadata in Flit, and… I agree with you, it’s not ideal. The good news is that for tools writing metadata, it looks like all that’s needed (other than checking for control characters) is indenting following lines in header fields. Reading the files is a bit more complex.

Should there be any maximum line length? I see references to a maximum line length in the email code, but it seems it will parse longer lines without complaint.

pyfisch · March 25, 2021, 6:34pm

I did finish the ABNF grammar for the metadata format. The grammar assumes that the files opened with universal newlines i.e. all newlines are a \n character. This grammar should answer your questions regarding the syntax @takluyver.

document  = fields (LF / (LF LF payload))
fields    = field *(LF field)
field     = key ":" value
key       = ALPHA *(ALPHA / DIGIT / "-")
value     = *CHAR / obs-value
payload   = *(CHAR / LF) 

obs-value = *CHAR 1*(LF SP *CHAR)

HTAB      = %x09 ; horizontal tab
LF        = %x0A ; linefeed
SP        = %x20 ; space character
ALPHA     = %x41-5A / %x61-7A  ; A-Z / a-z
DIGIT     = %x30-39  ; 0-9
CHAR      = HTAB / SP / %x21-7e / %x80-D7FF / %xE000-10FFFF
    ; any Unicode character
    ; excluding ASCII control characters, line endings
    ; but including tab and space

That’s true. However I would disallow line-folding in header fields for new metadata files, so they should store everything in one line.

This seems to be an email thing. There is no reason to have a maximum (recommended) line length in metadata files.

brettcannon · March 25, 2021, 8:22pm

If I’m reading this right, does this mean tabs don’t infer continuation of the previous key’s value?

Are we getting to the point that after we specify how to parse metadata 2.2 is we should think about moving towards JSON for metadata as outlined in PEP 566 to get away from these issues?

fdrake · March 26, 2021, 5:21am

I’m strongly in favor of moving away from the kinda-RFC822 format, mostly because of then common sloppiness of implementation, and moving toward something where a widely-understood syntax provides ways to deal with things like encoding and multiple-values in ways that are easy to work with.

Currently, projects regularly end up with a mix of setup.py, setup.cfg (“configparser syntax”; ugh!), and pyproject.toml, Poetry & flit look like they stick to pyproject.toml, but pipenv uses a separate toml-syntax file. I don’t even know how pip deals with encoding in the requirements.txt files.

While I’m not enamored of TOML syntax, adding another syntax to the mix seems… bad. If we do move to JSON-encodable metadata, let’s at least stick to TOML so we don’t raise the bar for packaging any higher.

-Fred

pyfisch · March 26, 2021, 10:58am

You’re right. Tabs infer continuation in the email RFCs.
The correct ABNF is: obs-value = *CHAR 1*(LF (SP / HTAB) *CHAR)

brettcannon · March 26, 2021, 6:38pm

I mean I’m fine with the idea of using TOML as an interchange format, but I wasn’t trying to be controversial since there’s already a PEP explaining how to translate the current metadata format to JSON. But I also don’t know how performant the parsing of metadata for an installed package needs to be as compared to something like entry points.

uranusjr · March 26, 2021, 6:52pm

One big advantage of TOML here IMO is multiline string. Long description in JSON format would be next to impossible to read.