The core metadata specification stores identity in an ambiguous matter, and this is reflected by ‘importlib.metadata’. This problem is then compounded by the specification mapping core metadata to ‘pyproject.toml’, initially approved as PEP621. Summary of issues:
- In the core metadata, existence of separate ‘Author’ and ‘Author-email’ fields and separate ‘Maintainer’ and ‘Maintainer-email’ fields.
- In the core metadata, the “Author-email” and “Maintainer-email” fields are inspired by but do not adhere to RFC822/RFC5233.
- The PEP621 specification maps the separate ‘name’ and ‘email’ keys to a single core metadata key, either “Author-email” or “Maintainer-email”
Currently importlib.metadata.PackageMetadata
(implemented as importlib.metadata._adapters.Message
) reflects the core metadata without imposing any structure on the fields, which are all strings. This should continue, but PackageMetadata
should also provide a structured window into the metadata in an incremental as-needed fashion. The current API would remain as a subset, so these changes are optional. The core metadata fields need not adhere to any specification, so PackageMetadata
should provide structure without making any assumptions and with the smallest amount of transformation to the fields.
Now, on to why. In the good old days, everything was 100% python: your project would by 100% python, your build system 100% python, your build configuration 100% python, your distribution, testing, documentation… all 100% python. If you wanted to specify shared metadata such as package name, package version, package author, and so forth only once - a single “source of truth” - you’d dump all that into a python dictionary in, for example, ‘project_metadata.py’. Then you’d import ‘project_metadata’ into ‘setup.py’ to configure setuptools, into Sphinx for publishing, and even into your own project if necessary.
Nowadays if you use ‘pyproject.toml’ this metadata is exposed through importlib.metadata
but you have to write a ton of overly-complicated parsing code (as you’ll see below), and that’s just not right, not for getting your own name into your own documentation.
Now, onto the problems.
Well, the core metadata is set in stone and that’s not going to change.
Notice that the core metadata provides separate name and email fields, and ‘pyproject.toml’ also specifices separate name and email fields, but they’re exposed by PackageMetadata
as a single combined field that must be parsed. Ideally ‘pyproject.toml’ would map the “name” key to “Author” (or “Maintainer”) and the “email” key to “Author-email” (or Maintainer-email") so that zip(packageMetadata.get("Author"), packageMetadata.get("Author-email"))
would retrieve the original identities. But that’s only half the problem, and the ‘pyproject.toml’ specification is unlikely to change. The other half of the problem is that all the authors (or maintainers) are joined into a comma-separated string. So you’d need parsing regardless, just less of it.
Now that we know the specifications aren’t going to change the only choice is for PackageMetadata
to provide a thin veneer of structure. The key insight is that RFC822/5233 requires that special characters such as “,” or “<” be quoted when used in text. It so happens that “,” is the identity delimiter for all four keys (“Author”, “Maintainer”, “Author-email”, “Maintainer-email”) and “<” is the name/email delimiter for “Author-email” and “Maintainer-email”. That’s the only assumption necessary. The core metadata is not required to adhere to any standards - as someone on #python pointed out, PKG-INFO could be hand-written and/or complete junk. Some metadata may be parsed incorrectly but the structured interface will be optional. If you know that your own package uses incompatible metadata, use your own parser. However when metadata can be anything, a line has to be drawn in the sand somewhere. I’ve tried to make that line as small as possible.
Now, the details.
When parsing, we need to be as flexible as possible. When splitting comma-separated fields, we ignore commas in double or single quoted strings. We ignore commas not succeeded by whitespace, since python tooling seems to use ", ".join(...)
, and that’s common in prose too, so feel free to use unquoted commas in your name or email address. If a quoted string is unbalanced we treat the quote as any character, so if you use a single quote that’s fine too.
When splitting name-email entries, we split on the first unquoted “<” precedeed by a run of whitespace. You can omit your name, email address or both (none of which is RFC822/RFC5233 compliant) and that’s fine. The strings “”, “<”, and " <" omit both name and email. The string “a <” omits email. The string “<a” omits name. Once again, if a quoted string is unbalanced we treat the quote as any character. The string ’ a"b"c "d <end ’ splits into name: ’ a"b"c "d’ and email: 'end '. While the string ’ a"b"c "d <en"d ’ becomes name: ’ a"b"c "d <en"d ', no email. The closing “>” in an email is only removed if present, so feel free to omit it!
We do minimal merging or filtering of name-email entries. Entries which omit both name and email are dropped. Duplicate entries are dropped. That’s it. If the core metadata contains: “Author: Frankenstein\nAuthor-email: Frankenstein <mary.shelley@unnamed>” then these are considered two separate identies:
- name: Frankenstein
email: None - name: Frankenstein
email: <mary.shelley@unnamed>
I consider any further merging out-of-scope. What if one author uses multiple email addresses, or a given email is shared among multiple authors? What if Mary Shelley, using the non-de-plum “Frankenstein”, collaborated with (an actually different) Dr. Frankenstein? We don’t want to apply any transformations that run the risk of silently dropping data.
And now finally, the code (diffed against Python 3.13.0a0):
diff --git a/Lib/importlib/metadata/_adapters.py b/Lib/importlib/metadata/_adapters.py
index 6aed69a308..5f75536b9a 100644
--- a/Lib/importlib/metadata/_adapters.py
+++ b/Lib/importlib/metadata/_adapters.py
@@ -3,6 +3,7 @@
import re
import textwrap
import email.message
+import dataclasses
from ._text import FoldedCase
@@ -15,6 +16,77 @@
stacklevel=2,
)
+# It looks like RFC5322 but it's much much worse. The only takeaway from
+# RFC5233 is that special characters such as "," and "<" must be quoted
+# when used as text.
+
+# Split an RFC5233-ish list:
+# 1. Alt 1: match single or double quotes and handle escape characters.
+# 2. Alt 2: match anything except ',' followed by a space. If quote
+# characters are unbalanced, they will be matched here.
+# 3. Match the alternatives at least once, in any order...
+# 4. ... and capture them.
+# 5. Match the list separator, or end-of-string.
+# Result:
+# group 1 (list entry): None or non-empty string.
+
+_entries = re.compile(r"""
+( (?: (["']) (?:(?!\2|\\).|\\.)* \2 # 1
+ | (?!,\ ). # 2
+ )+ # 3
+) # 4
+(?:,\ |$) # 5
+""", re.VERBOSE)
+
+# Split an RFC5233-ish name-email entry:
+# 01. Start at the beginning.
+# 02. If it starts with '<', skip this name-capturing regex.
+# 03. Alt 1: match single or double quotes and handle escape characters.
+# 04. Alt 2: match anything except one or more spaces followed by '<'. If
+# quote characters are unbalanced, they will be matched here.
+# 05. Match the alternatives at least once, in any order...
+# 06. ... but optionally so the result will be 'None' rather than an empty
+# string.
+# 07. If the name portion is missing there may not be whitespace before
+# '<'.
+# 08. Capture everything after '<' with a non-greedy quantifier to allow #
+# for the next regex. Use '+','?' to force an empty string to become
+# 'None'.
+# 09. Strip the final '>', if it exists.
+# 10. Allow for missing email section.
+# 11. Finish at the end.
+# Result:
+# group 1 (name): None or non-empty string.
+# group 3 (email): None or non-empty string.
+
+_name_email = re.compile(r"""
+^ # 01
+ ( (?!<) # 02
+ (?: (["']) (?:(?!\2|\\).|\\.)* \2 # 03
+ | (?!\ +<). # 04
+ )+ # 05
+ )? # 06
+ (?: \ *< # 07
+ (.+?)? # 08
+ >? # 09
+ )? # 10
+$ # 11
+""", re.VERBOSE)
+
+
+@dataclasses.dataclass(eq=True, frozen=True)
+class Ident:
+ """
+ A container for identity attributes, used by the author or
+ maintainer fields.
+ """
+ name: str|None
+ email: str|None
+
+ def __iter__(self):
+ return (getattr(self, field.name) for
+ field in dataclasses.fields(self))
+
class Message(email.message.Message):
multiple_use_keys = set(
@@ -87,3 +159,33 @@ def transform(key):
return tk, value
return dict(map(transform, map(FoldedCase, self)))
+
+ def _parse_idents(self, s):
+ es = (i[0] for i in _entries.findall(s))
+ es = (_name_email.match(i)[::2] for i in es)
+ es = {Ident(*i) for i in es if i != (None, None)}
+ return es
+
+ def _parse_names(self, s):
+ es = (i[0] for i in _entries.findall(s))
+ es = {Ident(i, None) for i in es}
+ return es
+
+ def _parse_names_idents(self, fn, fi):
+ sn = self.get(fn, "")
+ si = self.get(fi, "")
+ return self._parse_names(sn) | self._parse_idents(si)
+
+ @property
+ def authors(self):
+ """
+ Minimal parsing for "Author" and "Author-email" fields.
+ """
+ return self._parse_names_idents("Author", "Author-email")
+
+ @property
+ def maintainers(self):
+ """
+ Minimal parsing for "Maintainer" and "Maintainer-email" fields.
+ """
+ return self._parse_names_idents("Maintainer", "Maintainer-email")
diff --git a/Lib/importlib/metadata/_meta.py b/Lib/importlib/metadata/_meta.py
index c9a7ef906a..572f7030c0 100644
--- a/Lib/importlib/metadata/_meta.py
+++ b/Lib/importlib/metadata/_meta.py
@@ -1,6 +1,8 @@
from typing import Protocol
from typing import Any, Dict, Iterator, List, Optional, TypeVar, Union, overload
+from ._adapters import Ident
+
_T = TypeVar("_T")
@@ -43,6 +45,18 @@ def json(self) -> Dict[str, Union[str, List[str]]]:
A JSON-compatible form of the metadata.
"""
+ @property
+ def authors(self) -> set[Ident]:
+ """
+ Minimal parsing for "Author" and "Author-email" fields.
+ """
+
+ @property
+ def maintainers(self) -> set[Ident]:
+ """
+ Minimal parsing for "Maintainer" and "Maintainer-email" fields.
+ """
+
class SimplePath(Protocol[_T]):
"""
Now for possible conflicts. The core metadata specification includes this example for the “Author” field, which presumably features a single author (or two authors on the first line, and a shared address and email on the second):
Author: C. Schultz, Universal Features Syndicate,
Los Angeles, CA <cschultz@peanuts.example.com>
While for multiple authors, ‘pyproject.toml’ can produce this:
Author: Another person, Yet Another name
It is impossible to disambiguate individual authors consistently across the two.