Combining author/maintainer names & emails in core metadata

takluyver · February 17, 2025, 10:00am

If pyproject.toml contains a list like this:

maintainers = [
  {name = "Person one", email = "person.one@example.org" },
  {name = "Person two"},
  {email = "person.three@example.org" },
]

(Everything I’m discussing applies equally to author lists, but I won’t keep repeating that.)

The relevant spec - based on PEP 621 - says:

If only name is provided, the value goes in Author or Maintainer as appropriate.

If only email is provided, the value goes in Author-email or Maintainer-email as appropriate.

If both email and name are provided, the value goes in Author-email or Maintainer-email as appropriate, with the format {name} <{email}>.

Multiple values should be separated by commas.

This implies that the resulting core metadata should look like:

Maintainer: Person two
Maintainer-email: Person one <person.one@example.org>, person.three@example.org

This is somewhat unintuitive, as neither field describes all the maintainers. And with both fields present, PyPI shows the names from the Maintainer field as a link to the emails from the other field - i.e. only the names without email addresses are visible (issue). This was brought to my attention by someone opening an issue on Flit.

For packages built following these rules, we could assume that the Maintainer & Maintainer-email fields are disjoint, and concatenating them gives the full list of maintainers. But packages built from setup.py or setup.cfg don’t follow this pattern - the corresponding input fields are passed through verbatim, which normally means a name in Maintainer & a bare email address in Maintainer-email.

What can we do about this? I’d suggest we change the instructions for interpreting pyproject.toml files, to something like this:

If all the people listed have an email address, fill the Author-email/Maintainer-email field, with name <email> where a name is provided, or a bare email address if not, in a comma separated list. Filling Author/Maintainer as well is not recommended, since the information is redundant.
If the list contains anyone without an email address, everyone in the list should be represented in the Author/Maintainer field - by name if available, by email if not - as a comma separated list. Anyone with an email address should also be in the Author-email/Maintainer-email field as above.
If people are listed but none specify email addresses, only the Author/Maintainer field should be filled.

I think this gives us as much of a structured list as possible in the -email metadata fields, with a minimum of compatibility concerns. I think it also means that how PyPI presents these fields becomes valid again. But on the down side, it would mean that packages built from pyproject.toml with mixed author/maintainer lists under the current spec are retrospectively wrong.

0cjs · February 17, 2025, 10:36am

I don’t have strong opinions about this, nor much experience with handling these fields in Python packaging. But I’m kind of feeling that the best option when building these fields would be to use either one or the other, but never both. That seems to me as if it would minimise confusion about which to display (and possibly parse).

(And here I’m using the Author: field as an example, because that’s where I happen to be having the issue, but it applies equally to Maintainer: of course.)

pyproject-build already seems to be doing this if all authors have an e-mail address: in my r8format package I have a pyproject.toml stating:

authors = [                                                                         
    { name = 'Curt J. Sampson', email = 'cjs@cynic.net' },                          
    { name = 'Stuart Croy', email = 'stuartcroy@mac.com' },                         
]

and the resulting output in r8format.egg-info/PKG-INFO is:

...
Summary: Retrocomputing 8-bit file format manipulation tools                        
Author-email: "Curt J. Sampson" <cjs@cynic.net>, Stuart Croy <stuartcroy@mac.com>   
Project-URL: homepage, https://github.com/mc68-net/r8format                         
...

(PyPI displays only the first of these entries; this is what prompted me to comment on the already existing pypi/warehouse issue #9400.)

However, it’s not clear to me what PyPI does (and what anybody should do) for entries in Author-email: without an e-mail address.

If we can manage it without too much disruption, I think the clearest option would be to deprecate Author-email: (which I am guessing was designed to hold just the e-mail address corresponding to the one name expected in Author:) and always just put the list of comma separated author names, without e-mail address or with it in <...@...> format, into the Author: line. Eventually I imagine everybody will come around to dealing with that correctly (including making the e-mail address a link, if appropriate), and those that still just display a list still now will include the e-mail address parsed as part of the “name.”

takluyver · February 17, 2025, 11:03am

In this case I think your metadata is fine - this looks like a related but distinct bug in PyPI.

I think that would be invalid metadata.

The first version of the metadata spec from 2001 already said that Author-email “can contain a name and e-mail address in the legal forms for a RFC 822 ‘From:’ header.” This field was originally required, while Author was optional.

I think automated tools stand a better chance of making sense of lists in the -email fields, because there has always been some structure to it, at least as a recommendation (it probably hasn’t always been enforced). That’s why I favour using that where possible.

0cjs · February 17, 2025, 11:51am

A good point, but the problem I see there is that the structure seems to be such that each entry in the list must include an e-mail address. (RFC822 says, in §C.3.2, “The “From” field must contain machine-usable addresses.”) Thus using Author-email: and adhering to RFC822 seems to have unfortunate consequences whichever route we take.

Use only Author-email: and deprecate Author:: we lose any authors that do not supply e-mail addresses.
Use both fields, and put only authors without an e-mail address in Author:: we now have two disjoint fields and someone who wants a list of all the authors has rather a bit of work to merge the two. (And the general case isn’t really automatically mergable; consider Author: Curt J. Sampson and Author-email: Curt Sampson <cjs@...>, or Author: Curt J. Sampson and Author-email: <cjs@...>.)
Add all Author-email: entries also to Author:. For the second case above, does that mean our full author list is one entry, Curt J. Sampson <cjs@...>, or two entries, Curt J. Sampson, <cjs@...>?

Now it’s nice to say that, “moving forward, tools shouldn’t produce those sorts of things,” but even if some tools are doing the right thing to generate only data that makes possible an automatic merge algorithm, there surely will still be plenty of existing packages out there in the old “both fields” format, and probably new ones generated in that format by other tools and older versions of existing tools.

Thus, I feel (though I’m willing to be convinced otherwise) that the easiest way forward on the tools that need to parse these fields is to have the latest tools that generate them switch to generating a list of “name or RFC-822 name/e-mail” entries in just Author: which will work for everybody right now albeit possibly without automatic e-mail address extraction until they implement that.

And yes, this is a spec change. I don’t know how hard that would be to push through, but the overall amount of work required from the community (including everybody updating stuff on both sides, and dealing with the backwards-compatible new spec—albeit with the loss of automatic parsing of e-mail addresses on old systems, until they update) seems less, to me. Or perhaps it’s better to say, I don’t see any reasonable way ever to have a parser reliably generate a list of all contributors, with their e-mail address if they have one but without if they don’t, without a spec change.

Yes, but it’s kind of what led me down this twisting path. I am working on the assumption that PyPI’s current code, if there is no Author-email: field, would display either the Author: field or the first element of the Author: field as text, which would include the e-mail address. So the loss there would just be that it’s not there in clickable form.

domdfcoding · February 18, 2025, 10:24am

I vaguely remember there being discussion about these fields when PEP 621 was proposed, but that was four years ago.

Am I right that with your proposed changes Maintainer-email where present will be able to go straight into an email’s To: line and reach all intended maintainers, versus the current setup where some maintainers may be in one field but not the other? That sounds like a worthwhile benefit.

Did you consider making Maintainer and Maintainer-email multiple use, with each one containing one author’s name or RFC 822 format name and email, without duplication between fields, and a bump in version number? That would remove ambiguity as to the format being used, but downstream consumers like PyPI would have to change.

takluyver · February 18, 2025, 4:53pm

In fact I think my proposal doesn’t change the contents of Maintainer-email from the current recommendation - either way, this includes all maintainers with email addresses.

I’m effectively proposing a change to only the Maintainer (& Author) field:

Classic meaning: all maintainer names
PEP 621: maintainers without email addresses
My proposal: all maintainers, may be omitted if redundant with Maintainer-email

I vaguely thought about it, but metadata changes always face the obstacle of existing packages. Even if we make these fields multiple-use in metadata 2.5, any tool consuming metadata still needs to handle two decades worth of packages where these are single use fields with multiple names in them; the new format is just another possibility to deal with. So unless there’s a concrete benefit to that, I think it’s more valuable to build a consensus on how to use the existing format.

(That consensus could be to double down on PEP 621’s suggestion that Maintainer & Maintainer-email are disjoint lists, ‘fix’ PyPI to treat them as such, ‘fix’ setuptools to make metadata like that, and accept that lots of older packages are wrong. But that feels less likely to me.)

0cjs · February 18, 2025, 10:07pm

Well, that’s one definition of “all intended maintainers”: “only those maintainers who have e-mail addresses.”

The issue I have with that is that what if the system parsing the headers has a different different purpose, say, to display the names of all maintainers? Now you have to do quite a bit of work, and you still, in the end, can’t know if you have the right list.

On the other hand, going from “single comma-separated list of all maintainers, with or without e-mail addresses” to “list of maintainers with e-mail addresses” is as simple as splitting the Maintainer: list by commas and removing all entries that are not a RFC822-valid e-mail address entry.

Perhaps the core question we’re debating here is:

Is it better to produce information easily parsable by clients, albeit some current simple auto-detection of e-mail addresses will break until they make a simple change; versus
Is it better to produce information that will cause a certain limited set of current clients to break less, but but make it pretty much impossible for clients to extract correct information even from future packages.