Core metadata email fields & Unicode

takluyver · February 28, 2021, 1:09pm

PEP 621 defines authors and maintainers lists in the pyproject.toml file, and says of the transformation into core metadata:

If both email and name are provided, the value goes in Author-email/Maintainer-email as appropriate, with the format {name} <{email}>.

I’ve implemented this literally in Flit (PR) using Python string formatting. A contributor pointed out cases where it could go wrong, and pointed me to email.utils.formataddr() as a more careful way to achieve the same thing. However, I found that this produces odd looking results with non-ASCII names ('=?utf-8?q?Zo=C3=AB?= <zoe@example.com>'), and rejects non-ASCII email addresses altogether.

The core metadata spec says that both Author-email and Maintainer-email “can contain a name and e-mail address in the legal forms for a RFC-822 From: header.” RFC 822 dates from 1982, and unsurprisingly, doesn’t appear to mention anything beyond ASCII (as far as I can see; I confess I haven’t read it all). There are newer standards for email which do allow non-ASCII characters.

This also goes for the core metadata format as a whole. PEP 241 (approaching its 20th birthday!) describes the format as “a single set of RFC-822 headers parseable by the rfc822.py module”, and I can’t see any changes to that in the subsequent PEPs. Do we take the email.parser stdlib module as the successor to rfc822.py? And is there a good summary of what that expects, without reading the various RFCs?

To sum up:

How should non-ASCII characters be represented in core metadata in general? Is it safe to write it as UTF-8, as Flit currently does? Or should it be escaped into a pure ASCII form?
Are there special rules for non-ASCII characters in the Author-/Maintainer-email fields?
Should we update the core metadata spec to clarify this?
Should the wording I quoted from PEP 621 mention quoting/escaping?

uranusjr · February 28, 2021, 3:11pm

For the email formatting thing specifically, I believe the {name} <{email}> description was intended to only be a guide, not that it should be the exact thing to use when formatting the field. It’s probably a good idea to change it to just say formatted with the name-addr form described in RFC 2822 (using e.g. email.utils.formataddr()) or something like that.

As for the METADATA format as a whole, IIRC the issue has been brought up several times. The spec is largely based on RFC 822 because it’s convenient, but there are more and more holes as time progresses, as you’ve observed. email.parser with the compat32 policy is now the de facto standard for encoding issues, so it’s probably a good idea to try to rebase the packaging metadata to something we can have more control over going forward (and easier to grasp for everyone, since we don’t need most of the quirks in email specifications anyway). But that’s of course a lot of work nobody is particularly interested in doing

pyfisch · February 28, 2021, 8:24pm

I assume most tools (tested with pip) handle UTF-8 just fine and don’t need this garbling. I suggest instead to quote all names that contain problematic characters (e.g. ,<>" ) according to RFC 5322. Quoting is not necessary if the name just contains a non-ASCII character. Validating (internationalised) email-addresses is rather complicated and probably not worth it.

To ensure that the metadata format isn’t corrupted one could check for each field that it doesn’t contain linebreaks or control characters. The user probably knows that they shouldn’t put newlines in their name, so an error message is sufficient.

So I would no longer suggest using email.utils.formataddr().

Is it practical to specify a strict variant of the METADATA/PKG-INFO/WHEEL for programs producing these files? These files would be among other things UTF-8 encoded, not use line-folding and always store the multi-line Description at the end of the file. Existing parsers can read this files without any changes. Many producers already do something similar and would need only minor changes. To enforce the new strict variant PyPI should reject uploads that claim to use the strict variant (maybe indicated by a new minor version) but contain errors.

pyfisch · March 2, 2021, 1:10pm

I had another look at the documentation and email.utils.formataddr() is deprecated. The recommended new way is to use email.headerregistry.Address. It doesn’t encode non-ASCII characters in names but correctly quotes names that need it according to the RFCs.

>>> str(email.headerregistry.Address("Zoë", addr_spec="zoe@example.org"))
'Zoë <zoe@example.org>'
>>> str(email.headerregistry.Address("<\"Hack\"er>man", addr_spec="hacker@example.org"))
'"<\\"Hack\\"er>man" <hacker@example.org>'
>>> str(email.headerregistry.Address("᚛ᚈᚑᚋ ᚄᚉᚑᚈᚈ᚜", addr_spec="tomscott@example.org"))
'᚛ᚈᚑᚋ ᚄᚉᚑᚈᚈ᚜ <tomscott@example.org>'

uranusjr · March 2, 2021, 5:53pm

Yeah, I think this is the way to go. The difficult part, however, would be to come up with a text that doesn’t require implementers to read any Email RFCs in their entirety. The spec can probably refer to portions of an email RFC for details, but the references need to be very specific (so the required reading can be kept minimal), and come with examples (so common implementations can just reach for a ready-made solution like the stdlib email).

pf_moore · March 2, 2021, 7:05pm

I’d also say that an important prerequisite is that someone would need to implement the new standard. At the moment, people can reach for the stdlib email package. If we change the spec, what are they going to use? Expecting everyone to write their own implementation is impractical.

I assume the answer is that it would go into packaging, but someone still needs to write the code.

brettcannon · March 2, 2021, 8:10pm

On my very long TODO list (but not that far from the top in terms of big projects) is to eventually create packaging.metadata that is capable of reading and writing pyproject.toml/PEP 621, PKG-INFO, and METADATA so all of this is centralized.

takluyver · March 6, 2021, 12:09pm

In the meantime, I might try to document the current expectations in the core metadata spec (namely, as @uranusjr pointed out, that it is the email header format as processed by email.parser with the compat32 policy). It’s certainly not ideal for a spec, but I think it’s better than nothing.

takluyver · March 6, 2021, 1:05pm

My PR for the core metadata spec: Describe the file format of core metadata by takluyver · Pull Request #851 · pypa/packaging.python.org · GitHub

domdfcoding · August 12, 2024, 2:56pm

For author-email the spec it still says it must follow “legal forms for a RFC-822 From: header” (i.e. ASCII only, unless you take a liberal interpretation and allow RFC-1342). I’ve noticed PyPI fails to correctly parse RFC-1342 encoded names (see fastapi · PyPI) but I think the spec needs clarifying before that can truly be considered a bug in warehouse.