Structure for importlib metadata identities

orbisvicis · August 26, 2023, 3:47pm

The core metadata specification stores identity in an ambiguous matter, and this is reflected by ‘importlib.metadata’. This problem is then compounded by the specification mapping core metadata to ‘pyproject.toml’, initially approved as PEP621. Summary of issues:

In the core metadata, existence of separate ‘Author’ and ‘Author-email’ fields and separate ‘Maintainer’ and ‘Maintainer-email’ fields.
In the core metadata, the “Author-email” and “Maintainer-email” fields are inspired by but do not adhere to RFC822/RFC5233.
The PEP621 specification maps the separate ‘name’ and ‘email’ keys to a single core metadata key, either “Author-email” or “Maintainer-email”

Currently importlib.metadata.PackageMetadata (implemented as importlib.metadata._adapters.Message) reflects the core metadata without imposing any structure on the fields, which are all strings. This should continue, but PackageMetadata should also provide a structured window into the metadata in an incremental as-needed fashion. The current API would remain as a subset, so these changes are optional. The core metadata fields need not adhere to any specification, so PackageMetadata should provide structure without making any assumptions and with the smallest amount of transformation to the fields.

Now, on to why. In the good old days, everything was 100% python: your project would by 100% python, your build system 100% python, your build configuration 100% python, your distribution, testing, documentation… all 100% python. If you wanted to specify shared metadata such as package name, package version, package author, and so forth only once - a single “source of truth” - you’d dump all that into a python dictionary in, for example, ‘project_metadata.py’. Then you’d import ‘project_metadata’ into ‘setup.py’ to configure setuptools, into Sphinx for publishing, and even into your own project if necessary.

Nowadays if you use ‘pyproject.toml’ this metadata is exposed through importlib.metadata but you have to write a ton of overly-complicated parsing code (as you’ll see below), and that’s just not right, not for getting your own name into your own documentation.

Now, onto the problems.

Well, the core metadata is set in stone and that’s not going to change.

Notice that the core metadata provides separate name and email fields, and ‘pyproject.toml’ also specifices separate name and email fields, but they’re exposed by PackageMetadata as a single combined field that must be parsed. Ideally ‘pyproject.toml’ would map the “name” key to “Author” (or “Maintainer”) and the “email” key to “Author-email” (or Maintainer-email") so that zip(packageMetadata.get("Author"), packageMetadata.get("Author-email")) would retrieve the original identities. But that’s only half the problem, and the ‘pyproject.toml’ specification is unlikely to change. The other half of the problem is that all the authors (or maintainers) are joined into a comma-separated string. So you’d need parsing regardless, just less of it.

Now that we know the specifications aren’t going to change the only choice is for PackageMetadata to provide a thin veneer of structure. The key insight is that RFC822/5233 requires that special characters such as “,” or “<” be quoted when used in text. It so happens that “,” is the identity delimiter for all four keys (“Author”, “Maintainer”, “Author-email”, “Maintainer-email”) and “<” is the name/email delimiter for “Author-email” and “Maintainer-email”. That’s the only assumption necessary. The core metadata is not required to adhere to any standards - as someone on #python pointed out, PKG-INFO could be hand-written and/or complete junk. Some metadata may be parsed incorrectly but the structured interface will be optional. If you know that your own package uses incompatible metadata, use your own parser. However when metadata can be anything, a line has to be drawn in the sand somewhere. I’ve tried to make that line as small as possible.

Now, the details.

When parsing, we need to be as flexible as possible. When splitting comma-separated fields, we ignore commas in double or single quoted strings. We ignore commas not succeeded by whitespace, since python tooling seems to use ", ".join(...), and that’s common in prose too, so feel free to use unquoted commas in your name or email address. If a quoted string is unbalanced we treat the quote as any character, so if you use a single quote that’s fine too.

When splitting name-email entries, we split on the first unquoted “<” precedeed by a run of whitespace. You can omit your name, email address or both (none of which is RFC822/RFC5233 compliant) and that’s fine. The strings “”, “<”, and " <" omit both name and email. The string “a <” omits email. The string “<a” omits name. Once again, if a quoted string is unbalanced we treat the quote as any character. The string ’ a"b"c "d <end ’ splits into name: ’ a"b"c "d’ and email: 'end '. While the string ’ a"b"c "d <en"d ’ becomes name: ’ a"b"c "d <en"d ', no email. The closing “>” in an email is only removed if present, so feel free to omit it!

We do minimal merging or filtering of name-email entries. Entries which omit both name and email are dropped. Duplicate entries are dropped. That’s it. If the core metadata contains: “Author: Frankenstein\nAuthor-email: Frankenstein <mary.shelley@unnamed>” then these are considered two separate identies:

name: Frankenstein
email: None
name: Frankenstein
email: <mary.shelley@unnamed>

I consider any further merging out-of-scope. What if one author uses multiple email addresses, or a given email is shared among multiple authors? What if Mary Shelley, using the non-de-plum “Frankenstein”, collaborated with (an actually different) Dr. Frankenstein? We don’t want to apply any transformations that run the risk of silently dropping data.

And now finally, the code (diffed against Python 3.13.0a0):

diff --git a/Lib/importlib/metadata/_adapters.py b/Lib/importlib/metadata/_adapters.py
index 6aed69a308..5f75536b9a 100644
--- a/Lib/importlib/metadata/_adapters.py
+++ b/Lib/importlib/metadata/_adapters.py
@@ -3,6 +3,7 @@
 import re
 import textwrap
 import email.message
+import dataclasses
 
 from ._text import FoldedCase
 
@@ -15,6 +16,77 @@
     stacklevel=2,
 )
 
+# It looks like RFC5322 but it's much much worse. The only takeaway from
+# RFC5233 is that special characters such as "," and "<" must be quoted
+# when used as text.
+
+# Split an RFC5233-ish list:
+# 1. Alt 1: match single or double quotes and handle escape characters.
+# 2. Alt 2: match anything except ',' followed by a space. If quote
+#    characters are unbalanced, they will be matched here.
+# 3. Match the alternatives at least once, in any order...
+# 4. ... and capture them.
+# 5. Match the list separator, or end-of-string.
+# Result:
+#   group 1 (list entry): None or non-empty string.
+
+_entries = re.compile(r"""
+( (?: (["']) (?:(?!\2|\\).|\\.)* \2     # 1
+  |   (?!,\ ).                          # 2
+  )+                                    # 3
+)                                       # 4
+(?:,\ |$)                               # 5
+""", re.VERBOSE)
+
+# Split an RFC5233-ish name-email entry:
+# 01. Start at the beginning.
+# 02. If it starts with '<', skip this name-capturing regex.
+# 03. Alt 1: match single or double quotes and handle escape characters.
+# 04. Alt 2: match anything except one or more spaces followed by '<'. If
+#     quote characters are unbalanced, they will be matched here.
+# 05. Match the alternatives at least once, in any order...
+# 06. ... but optionally so the result will be 'None' rather than an empty
+#     string.
+# 07. If the name portion is missing there may not be whitespace before
+#     '<'.
+# 08. Capture everything after '<' with a non-greedy quantifier to allow #
+#     for the next regex. Use '+','?' to force an empty string to become
+#     'None'.
+# 09. Strip the final '>', if it exists.
+# 10. Allow for missing email section.
+# 11. Finish at the end.
+# Result:
+#   group 1 (name):  None or non-empty string.
+#   group 3 (email): None or non-empty string.
+
+_name_email = re.compile(r"""
+^                                           # 01
+  ( (?!<)                                   # 02
+    (?: (["']) (?:(?!\2|\\).|\\.)* \2       # 03
+    |   (?!\ +<).                           # 04
+    )+                                      # 05
+  )?                                        # 06
+  (?: \ *<                                  # 07
+      (.+?)?                                # 08
+      >?                                    # 09
+  )?                                        # 10
+$                                           # 11
+""", re.VERBOSE)
+
+
+@dataclasses.dataclass(eq=True, frozen=True)
+class Ident:
+    """
+    A container for identity attributes, used by the author or
+    maintainer fields.
+    """
+    name: str|None
+    email: str|None
+
+    def __iter__(self):
+        return (getattr(self, field.name) for
+                field in dataclasses.fields(self))
+
 
 class Message(email.message.Message):
     multiple_use_keys = set(
@@ -87,3 +159,33 @@ def transform(key):
             return tk, value
 
         return dict(map(transform, map(FoldedCase, self)))
+
+    def _parse_idents(self, s):
+        es = (i[0] for i in _entries.findall(s))
+        es = (_name_email.match(i)[::2] for i in es)
+        es = {Ident(*i) for i in es if i != (None, None)}
+        return es
+
+    def _parse_names(self, s):
+        es = (i[0] for i in _entries.findall(s))
+        es = {Ident(i, None) for i in es}
+        return es
+
+    def _parse_names_idents(self, fn, fi):
+        sn = self.get(fn, "")
+        si = self.get(fi, "")
+        return self._parse_names(sn) | self._parse_idents(si)
+
+    @property
+    def authors(self):
+        """
+        Minimal parsing for "Author" and "Author-email" fields.
+        """
+        return self._parse_names_idents("Author", "Author-email")
+
+    @property
+    def maintainers(self):
+        """
+        Minimal parsing for "Maintainer" and "Maintainer-email" fields.
+        """
+        return self._parse_names_idents("Maintainer", "Maintainer-email")
diff --git a/Lib/importlib/metadata/_meta.py b/Lib/importlib/metadata/_meta.py
index c9a7ef906a..572f7030c0 100644
--- a/Lib/importlib/metadata/_meta.py
+++ b/Lib/importlib/metadata/_meta.py
@@ -1,6 +1,8 @@
 from typing import Protocol
 from typing import Any, Dict, Iterator, List, Optional, TypeVar, Union, overload
 
+from ._adapters import Ident
+
 
 _T = TypeVar("_T")
 
@@ -43,6 +45,18 @@ def json(self) -> Dict[str, Union[str, List[str]]]:
         A JSON-compatible form of the metadata.
         """
 
+    @property
+    def authors(self) -> set[Ident]:
+        """
+        Minimal parsing for "Author" and "Author-email" fields.
+        """
+
+    @property
+    def maintainers(self) -> set[Ident]:
+        """
+        Minimal parsing for "Maintainer" and "Maintainer-email" fields.
+        """
+
 
 class SimplePath(Protocol[_T]):
     """

Now for possible conflicts. The core metadata specification includes this example for the “Author” field, which presumably features a single author (or two authors on the first line, and a shared address and email on the second):

Author: C. Schultz, Universal Features Syndicate,
        Los Angeles, CA <cschultz@peanuts.example.com>

While for multiple authors, ‘pyproject.toml’ can produce this:

Author: Another person, Yet Another name

It is impossible to disambiguate individual authors consistently across the two.

orbisvicis · August 26, 2023, 10:17pm

There’s an error in the proposal. RFC5233 allows email addresses outside of angle brackets if the accompanying name is missing. This is optional but the behavior chosen by the ‘pyproject.toml’ specification. I’ve updated the _name_email regex and if there are no objections I’ll update the proposal description with examples and the code with comments. The symbol “@” now replaces “<” as special character. As before, the regex requires preceding whitespace - just with a longer lookahead.

# ^ ( ( quote
#     | not: space* (quote | not:space-or-at)+ @ anything
#     )+
#   )?
#
# space* <? ( (quote | not:space-or-at)+ @ anything+? )? >? $

r = r"""
^
  ( (?: (["']) (?:(?!\2|\\).|\\.)* \2
    |   (?! \ *
            (?: (["']) (?:(?!\3|\\).|\\.)* \3
            |   [^ @]
            )+
            @ .
        ).
    )+
  )?
  \ * <?
  ( (?: (["']) (?:(?!\5|\\).|\\.)* \5
    |   [^ @]
    )+
    @ .+?
  )?
  >?
$
"""

orbisvicis · August 28, 2023, 7:23pm

Let’s build and examine the metadata of an example package, ‘test-author-package’. Populate ‘pyproject.toml’ with the illustrative metadata given by the PEP621 specification, section “authors/maintainers” (slightly tweaked for clarity), as follows:

[project]
name = "test-author-package"
version = "1.0"
authors = [
  {name = "Pradyun Gedam", email = "pradyun@example.com"},
  {name = "Tzu-Ping Chung", email = "tzu-ping@example.com"},
  {name = "Another Person"},
  {email = "different.person@example.com"},
  {name = "Yet Another Name"},
]
maintainers = [
  {name = "Brett Cannon", email = "brett@python.org"}
]

[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"

When built this will generate a ‘PKG-INFO’ file which contains the following core identity metadata:

Author: Another Person, Yet Another Name
Author-email: Pradyun Gedam <pradyun@example.com>, Tzu-Ping Chung <tzu-ping@example.com>, different.person@example.com
Maintainer-email: Brett Cannon <brett@python.org>

Before, this metadata would not be parsed.

...: from importlib.metadata import metadata
...: # None of the identity fields are marked as multiple-use (more
...: # than one entry), so we use '.get' rather than '.get_all'.
...: mdata = metadata("test-author-package")
...: author = mdata.get("Author")
...: maintainer = mdata.get("Maintainer")
...: author_email = mdata.get("Author-email")
...: maintainer_email = mdata.get("Maintainer-email")
...: 
...: print(f"Author:           {author!r}")
...: print(f"Maintainer:       {maintainer!r}")
...: print(f"Author-email:     {author_email!r}")
...: print(f"Maintainer-email: {maintainer_email!r}")
Author:           'Another Person, Yet Another Name'
Maintainer:       None
Author-email:     'Pradyun Gedam <pradyun@example.com>, Tzu-Ping Chung <tzu-ping@example.com>, different.person@example.com'
Maintainer-email: 'Brett Cannon <brett@python.org>'

Now in addition to the previous methods, we have two extra properties:

...: print("Authors (list of 'Ident' instances):")
...: print(mdata.authors)
...: print()
...: print("Maintainers (list of Ident instances):")
...: print(mdata.maintainers)
...: print()
...: print("Ident instances are iterable:")
...: for name, email in mdata.authors:
...:     print(f"  name: {name!s:20}email: {email!s:20}")
Authors (list of 'Ident' instances):
{Ident(name='Pradyun Gedam', email='pradyun@example.com'),
 Ident(name='Tzu-Ping Chung', email='tzu-ping@example.com'),
 Ident(name=None, email='different.person@example.com'),
 Ident(name='Another Person', email=None),
 Ident(name='Yet Another Name', email=None)}

Maintainers (list of Ident instances):
{Ident(name='Brett Cannon', email='brett@python.org')}

Ident instances are iterable:
  name: Pradyun Gedam       email: pradyun@example.com
  name: Tzu-Ping Chung      email: tzu-ping@example.com
  name: None                email: different.person@example.com
  name: Another Person      email: None
  name: Yet Another Name    email: None

That’s really about it.

pradyunsg · September 26, 2023, 8:57am

Flagging here that https://github.com/python/cpython/pull/108585 was filed for this and there’s some discussion there about whether such a higher level abstraction should live in importlib or packaging.

@jaraco seems on board with the idea of having this be a part of importlib.metadata there.

pf_moore · September 26, 2023, 9:12am

I’ll note here that I’m uncomfortable with the direction that PR is taking. I agree with @brettcannon that if there’s a problem with the identity data in the core metadata, it should be addressed in the metadata spec, not by individual parsers coming up with their own rules.

orbisvicis · September 27, 2023, 5:30pm

I’ve thought about this some more and still stand by my comment on the PR - I don’t think the metadata can be fixed, not without starting from scratch like PEP 426. This proposal and PR haven’t diverged structurally. The functionality is required and the only decision is where: as a third-party package or integrated into an existing package? A third-party package isolates opinionated code (more on this later) but complicates the packaging ecosystem and requires the user to pull in even more dependencies, while integration may lead to more work for packaging maintainers. If the choice really is contentious perhaps we should take a poll… but I’ll try to argue my case.

If I were to publish a third-party package, I would inherit my StructuredPackageMetadata class from importlib.metadata.PackageMetadata, and provide a top-level structured_metadata() function which wraps importlib.metadata.metadata() and patches the __class__ attribute. Regardless of which package, there would practically be very little difference in code.

To address this at the level of a metadata specification, why not just codify my proposal as a new PEP? I’m not very concerned with low-level structure or format - I’ve taken several ambiguous fields and by dictating how they’re parsed I’ve specified how their contents should be interpreted. By making as few assumptions as possible, my proposed changes are practical building blocks for parsing nearly all historical and non-standard fields. All you really need is a tool to parse comma-separated strings and a tool to parse name-email strings, both in the style of RFC-822. The difficulty is doing so with as few assumptions while also accepting malformed inputs.

Let me reiterate with an example - there is no single interpretation that unifies these real-world samples (notice that they all share the same ‘Metadata-Version’):

Metadata-Version: 2.1
Name: example_from_core_metadata_spec
Author: C. Schultz, Universal Features Syndicate,
        Los Angeles, CA <cschultz@peanuts.example.com>

Metadata-Version: 2.1
Name: wheel
Author-email: Daniel Holth <dholth@fastmail.fm>

Metadata-Version: 2.1
Name: importlib-metadata
Author: Jason R. Coombs
Author-email: jaraco@jaraco.com

If not then I’d say the simplest fix would be to amend the PEP621 specification so the separate ‘name’ and ‘email’ keys map to separate core metadata keys. The change is small and because ‘pyproject.toml’ is newer this has the fewest repercussions. While this brings the behavior of ‘pyproject.toml’ in line with ‘setup.cfg’, it complicates the packaging ecosystem. Packages that expect a name or email in one field and not another may break. It presents yet another permutation of the set of possible values returned by the metadata API. As an overwhelmed user I don’t want to handle metadata fields whose interpretation depends on the version of my packaging tools. Even with such a minor change users would still be required to parse comma-separated identity fields, and doing so correctly is harder than it may seem. Packages published before any changes to the PEP621 specification won’t vanish so you’ll still require my original proposal to parse their metadata. Even worse, the approach taken by ‘setup.cfg’ decouples name from email, which can be a massive problem and which is probably why the PEP621 specification diverges. If an interposed identity omits either name or email it becomes impossible to correctly reconstruct (zip) the identities.

My proposal mentiones that if your own package uses incompatible metadata, you should use your own parser. I’ve taken care to make as few assumptions as possible when parsing and every transformation is reversible, which means that my functions are useful tools when building your own parser. Let’s build some parsers using the examples above:

No parsing is required if the intended author is “C. Schultz, Universal Features Syndicate, Los Angeles, CA <cschultz@peanuts.example.com>”.

>>> from importlib_metadata import metadata
>>> authors = metadata("example_from_core_metadata_spec").get("Author")

I’ll avoid the possibility that the intended authors are (“C. Schultz”, “Los Angeles, CA”, “cschultz@peanuts.example.com”) and (“Universal Features Syndicate”, “Los Angeles, CA”, “cschultz@peanuts.example.com”).

Luckily the ‘wheel’ package is already in our preferred format:

>>> from importlib_metadata import metadata
>>> authors = metadata("example_from_core_metadata_spec").authors

Ironically, ‘importlib_metadata’ needs some finagling. I now understand that identity ordering is important so I’ll have to make some minor tweaks to support this workflow - returning a list rather than a set:

>>> from importlib_metadata import metadata, Ident
>>> authors = metadata("importlib_metadata").authors
>>> paired = zip(authors, authors[len(authors)//2:])
>>> authors = [Ident(i_name.name, i_email.email) for i_name, i_email in paired]

Or alternatively:

>>> from importlib_metadata import metadata, Ident
>>> md = metadata("importlib_metadata")
>>> paired = zip(md._parse_names("Author"), md.parse_idents("Author-email"))
>>> authors = [Ident(i_name.name, i_email.email) for i_name, i_email in paired]

While the core metadata specification doesn’t require valid email addresses, PEP621 does. Should I support obfuscated emails in which “@” is substituted by “_at_” or some variation thereof? All the build tools I’ve tried refuse to build such packages, which I think is a great example of the toolchain leading the specification rather than vice-versa. But it would be very simple to add such support:

Metadata-Version: 2.1
Name: obfuscated_email
Author: Cool Pseudonym
Author-email: cool_at_pseudonym_dot_com

>>> from importlib_metadata import metadata
>>> md = metadata("obfuscated_email")
>>> authors = md._parse_names_idents("Author", "Author-email", email_identifier="_at_")
>>> email = authors[0].email.replace("_at_", "@").replace("_dot_", ".")

To summarize

The specification can only be fixed by starting from scratch.
Even if starting from scratch opinionated parsing is required for historical packages.
Amending the specification would just fragment the packaging ecosystem even further without tangible benefits
My proposal provides the two necessary tools for parsing historical and custom identity metadata, so is not that opinionated after all.

pf_moore · September 27, 2023, 6:24pm

I don’t understand what the actual issue is here. The current metadata has

Author: An unstructured string
Author-Email: One or more RFC-822 addresses

plus the same for Maintainer. There’s no possibility of parsing Author accurately in all cases, as it has no defined structure. Parsing Author-Email can be done using email.utils.parseaddr.

So if you want to propose a structured format, you have to propose a new metadata version. And even then, consumers have to be prepared to parse the older, unstructured forms.

If, on the other hand, you want some way of getting whatever sense you can out of existing data, that’s a pretty hard problem. You can likely do a reasonably good job on most data, but there will be edge cases, and you have to decide how to handle them. That can be a 3rd party library, which means it’s entirely optional (and as a result can afford to ignore the worst forms of bad data).

But to include parsing in a core library like importlib.metadata or packaging would be more of a problem. It could of course be a utility function (which could work to similar constraints as a 3rd party version) but the compatibility policies - particularly of a stdlib module! - would probably make it far harder to evolve than would be ideal. And making structured data the default return would probably be impossible, because you’d still have to support non-structured values for data that doesn’t conform to what you’d like it to.

So honestly, I think your realistic options are one or both of:

Propose a new metadata version mandating a more structured format.
Write a 3rd party library that consumes existing Author/Author-Email pairs and returns structured data.

But I very definitely don’t think trying to impose structured values on the existing email metadata is something that should be going into importlib.metadata at the moment.

orbisvicis · September 27, 2023, 8:21pm

For context, the parsing of “Author-email” or “Maintainer-email” is not as simple as you make it seem. All the identity fields need to split the comma-separated values, and you can’t just do field.split(", "). According to [Core metadata email fields & Unicode] email.utils.formataddr has issues with internationalization and so I wouldn’t be surprised if email.utils.parseaddr is similarly plagued. Besides as I explain in the proposal and PR, the invocation of RFC-822 is more what you’d call “guidelines” than actual rules. And despite the specification, the actual metadata can be hand-written and completely noncompliant.

I’m not sure I understand your objection. If I wanted to create a new metadata version, then I could include my structured parsing code in importlib.metadata even though it wouldn’t be compatible with previous versions of the metadata? How’s that different from not including any parsing code because it could be incompatible with a small percentage of edge cases?

but the compatibility policies - particularly of a stdlib module! - would probably make it far harder to evolve than would be ideal.

I can’t speak to this so it could be a good argument. But why not land my code in ‘packaging’, which is not part of the standard library?

We are roughly in agreement, and I having no inclination to propose a new metadata version given the history of Python’s metadata specifications. I think it will be well worth your while to read what @jaraco has to say since it mirrors my feelings and from what I gather the general sentiment about Python packaging.

My main goal is to make it easier for package authors to get information about their own packages into Python. I think its a shame that Python will not be able to understand the metadata of its own packages without a third-party module.

The important point is that I’m not trying to make the structured interface the default interface. The PackageMetadata.get and PackageMetadata.get_all remain the default interfaces. I’m trying to add an additional interface that provides the tools to work with a given package. There will be edge cases but it remains the responsibility of the user to select an appropriate combination of tools suitable for the metadata of the given package.

My proposal was careful to point out that all changes are purely optional!

In the scheme of all things this is more a molehill than a mountain but there are unresolved issues before I feel comfortable committing to a third-party package.

steve.dower · September 27, 2023, 8:47pm

From my POV on the security team, I am very not thrilled about a heuristic-based parser going into the standard library.

If this goes in, please document it as “not for security sensitive purposes”. We’ve had to deal with too many CVEs recently because people treat our parsers as suitable for security work when they were never intended for it.

From a regular user POV, please also document the way to handle when the metadata can’t be parsed, which presumably will be to treat the string as a string. As Paul says, we’re a new metadata version and a decade away from being able to assume that the field is in a valid format. Anything purporting to be a parser should be very clear that it’s parsing values that are not required to follow any standard, and so fallback procedures are necessary.

dstufft · September 27, 2023, 9:09pm

We can always deprecate fields and add new fields if the existing fields are not spec’d thoroughly enough. We should not, in my opinion, have core APIs for working with metadata that assumes things that aren’t part of the spec.

pf_moore · September 27, 2023, 9:37pm

Yes, I was being generous. In reality any code that wants to parse the existing metadata will have to do a lot of work to come up with something sane. It may be necessary to have some sort of “either structured or free text as a fallback” approach.

The reason I was trying to be generous here is that it seems even less reasonable to try to put something that complex into the stdlib before it’s been developed and proven as a standalone package (with the freedom to rapidly iterate until a good API is reached).

No, a new metadata version would allow reliable parsing of future identity data. You’d still have the problem of parsing legacy data.

I’m not trying to argue for a way to add structured data to the stdlib (or packaging). You’re the one that wants to do that, I’m just pointing out that having a new metadata version with well-defined syntax for identity data would at least give you a chance (an API that populates structured data if metadata version is >= 2.x, and provides unstructured strings for when compatibility with older metadata versions is needed). But I honestly don’t see anyone designing a good API for importlib.metadata from scratch, without developing it independently first.

You could. But why? Isn’t it easier, at least while you go through the process of iterating on the design and getting experience with actual use cases, to write a standalone library that takes a packaging.metadata.RawMetadata value and returns structured identity data from it? I don’t see the benefit in making it a requirement that it goes into an extablished library, much less the stdlib, yet.

But if the packaging maintainers are OK with your PR, then sure - that’s their call.

I did read that, and while I agree with @jaraco’s distaste for the current situation, I don’t agree with the idea that this means we should rush something into importlib.metadata. Like it or not, that module is part of the stdlib, when APIs that get added there, we end up having to live with them for years, even if they have flaws.

All I’m saying here is the same as I’d say for any proposal to add an untested API to the stdlib. It should be published on PyPI first, as a standalone package, and only once it’s proved itself and demonstrated that it’s popular and stable, should it be moved into the stdlib.

Huh? Python (the stdlib) can’t even parse a package version without the 3rd party packaging module. This is a deliberate choice by the core developers, to keep all packaging functionality^[1] outside the stdlib. I really don’t think author information should be more privileged in this regard than versions, or dependency specifiers, or any other metadata values.

Quite honestly, if you’re not comfortable committing to a 3rd party package, why should the core devs be comfortable with you expecting the same code to go into the stdlib?

except for the bare minimum (ensurepip) to bootstrap an installer ↩︎

orbisvicis · September 28, 2023, 4:24pm

From my POV on the security team, I am very not thrilled about a heuristic-based parser going into the standard library.

It is heuristic in the sense that I inspected metadata files to create the parser, not in the sense that different code paths are chosen based on the input.

If this goes in, please document it as “not for security sensitive purposes”. We’ve had to deal with too many CVEs recently because people treat our parsers as suitable for security work when they were never intended for it.

The parsers are two regular expressions. I assume the worst that can happen is an attack on CPU and memory resources, much like the warning for the JSON module. I’ve tried to make the regexes proof against catastrophic backtracking but I can’t guarantee this. I’m not familiar with security concerns and will 100% document this as “not for security sensitive purposes”.

From a regular user POV, please also document the way to handle when the metadata can’t be parsed, which presumably will be to treat the string as a string.

There are no inputs that will not be parsed.

orbisvicis · September 28, 2023, 4:24pm

We can always deprecate fields and add new fields if the existing fields are not spec’d thoroughly enough.

Thanks for the suggestion! I had never considered deprecating a field to a JSON version of itself, and so now I have an alternative suggestion:

Deprecate “Author”, “Author-email”, “Maintainer”, “Maintainer-email” fields.
Introduce new “Authors” and “Maintainers” fields.
Discuss how to format these fields.

Structurally we’d need a list of identities where each identity is in turn either another list or a mapping containing name, email, and possible other values. Given a nested list the contents of each identity would be defined by index, ignoring additional entries containing user-defined values. Given a mapping the contents of each identity would be defined by key, ignoring additional keys containing user-defined values.

Either approach captures the user-defined free-form data formerly stored in the “Author” and “Maintainer” fields. These new metadata fields would be optional, but if specified at least name or email must be present. I prefer the mapping because it is self-documenting.

The simplest encoding, inspired by the existing format, is to delimit identities with ", " and identity lists with an as-yet unspecified delimiter. Any other characters are included, even extra spaces around the delimiters. I believe the core metadata spec requires unicode compatibility so any unicode codepoint is allowed, but delimiters must be quoted. The only backslash escape sequence in quoted strings is for the quote character itself, eg “'”. Since the quote character terminates the quoted string, the backslash character need not have an escape sequence. Unless added as another escape sequence, newlines must be directly embedded. This is not a multiline encoding.

A variation of this is to require all contents between delimiters to be quoted, allowing additional whitespace - possibly even newlines - around the delimiters.

To modify this encoding for a mapping rather than a list, a third key-value delimiter must be chosen, likely to be “:”. As before any codepoint is allowed aside from delimiters, even in the key.

At this point we’ve nearly arrived at my preferred encoding, JSON. Why not have the new fields just embed a JSON object? This comes with a lot of nice features including plently of escape sequences. JSON is well-known format, which is invaluable as we’ve seen the damage caused by obscure formats. Because a “JSON string may cause the decoder to consume considerable CPU and memory resources” we should fix the length of the new fields to some limit, say 4096 unicode codepoints? @steve.dower, do you have any additional recommendations? The main limit is that JSON does not support multiline strings.

There is also TOML which does support multiline strings, but I don’t think that TOML will embed as nicely as JSON.

I also want to bring in @westurner who in other threads made some interesting suggestions regarding structured formats.

steve.dower · September 28, 2023, 4:44pm

There’s no need to tag me again in this thread. I’ve made my contribution.

dstufft · September 28, 2023, 6:12pm

I don’t have special insight into what would be acceptable here, but I think that something like this would be OK, and it provides us the ability to better align the core metadata with PEP 621, to have less of an impedance mismatch between them.

I would guess that smuggling JSON itself into the the current RFC 822 structured METADATA files probably has a high bar, not to say that we can’t do that if it’s what makes sense, just that I think it’ll have a high bar so you’ll want to make sure that there is a good justification for why we need to do that, rather than something else.

My recommendation would be to settle on a single representation and standardize on that. So instead of saying it can either be a list of lists or a list of dicts, pick one and say that this is what it is, it’s a lot easier to transform these when the METADATA is being created than when it’s being read later on.

Since a list of lists can be represented as a list of dicts, where the inner list entry is a key of the dict, I would suggest just focusing on that. That also has the benefit of aligning with PEP 621 authors and maintainers better.

An interesting tidbit here is that PEP 621 defined the authors and maintainers “name” field as not being able to contain a comma, so we don’t really need to deal with escaping, we can just define the format as being:

# Both name and email
Authors: Grace Hopper,ghopper@example.com
# Just Name
Authors: Charles Babbage
# Just Email
Authors: ,anonymous@example.com

RFC 822 natively supports list of strings (through multi use fields), and we’ve already used a similar pattern in the core metadata for the project URLs, and this is easy to implement:

def name_email(inp: str) -> tuple[str, str]:
    parts = inp.split(",", 1) + [""]
    parts = [p.strip() for p in parts[:2]]
    return tuple(parts)

Of course, embedding JSON is more flexible, since it means we don’t have to impose the “no comma” limitation on names and it makes it easier to add additional fields in the future.

I don’t think multiline strings or anything is a concern, METADATA is intended first and foremost to be produced and consumed by machines, so single line strings with embedded \n characters is fine.

Also to set your expectations, it’s good to discuss this idea here and make sure that this is an idea that is worth pursuing before proceeding. However, eventually someone will have to write and champion a PEP following our specification update process, which if you’re willing to put in that effort, that’s great!

As part of that, you’ll want to think about forwards and backwards compatibility, both for the new fields and for the old fields (should it be legal to use both old/new together? What is the semantics if so? Is there recommendations for tools to populate the new fields from the old fields? etc).

Personally I’m +1 on doing what we can to move away from older, more problematic metadata fields and standardizing newer, easier to use fields, particularly when we can tie those in with existing newer standards so they all work well together. The idea of deprecating Author, Author-Email, Maintainer, and Maintainer-Email in favor of Authors and Maintainers seems sound to me, and I’m on the fence between doing the simple comma delimited solution above, or just chucking a json map in there.

pf_moore · September 28, 2023, 7:43pm

Agreed. I’d actually like to see us move away from the email-based format to using JSON directly. But that’s a broader question, and would impact far more than this proposal.

I’m slightly uncomfortable with embedding JSON in the email format, it feels vaguely wrong. But it may well be a good practical compromise. On the other hand, given that we already have the restriction that author names may not contain commas, I agree that name, email is easy to parse and multiple use fields are already possible, so I’d prefer we stick with that as a more “message-native” approach.

brettcannon · September 29, 2023, 12:29am

At least with packaging.metadata, we could provide some method that produced JSON from Metadata instances. Without much thought, I think we would need to make sure we agree on the fields and file name (which we sort of have via PEP 566 – Metadata for Python Software Packages 2.1 | peps.python.org), and then encourage build back-ends to start including the JSON along with the METADATA/PKG-INFO file. Whether we ever drop METADATA we can probably worry about at a later date.

+1 from me on the comma-separated format unless we can agree on the JSON file now, in which case starting the transition with these new fields makes it feel less icky.

jaraco · April 19, 2024, 6:53pm

I’ve been bitten by the same issue that @orbisvicis is trying to address here. I’m in the process of migrating projects from setup.cfg to PEP 621 pyproject.toml, but doing so has broken assumptions about the meaning of “Author*” and “Maintainer*” fields in multiple projects (jaraco.packaging, jaraco.media, and probably more). I had not realized that I could not rely on metadata['Author'] to get an author’s name or metadata['Author-email'] to get their email.

As we think about creating a new set of fields, can we avoid the “two fields” problem? In particular, let’s not create “Authors” and “Maintainers”, but instead create “Contributor” or “Contact”, to represent not only authors and maintainers but maybe also testers or documenters or support team or sponsors, etc? This approach would avoid the awkwardness in documentation/conversation “here’s how Authors is defined, plus same for Maintainers” but also has the benefit of being extensible for arbitrary purposes (similar to Project-URLs). I’d suggest the field to be a semi-structured object with the following fields:

name
email
role(s) (e.g. “author”, “maintainer”, “tester”, [“author”, “maintainer”])
social (e.g. “mastadon:name@fosstadon.org”, “https://www.linkedin.com/in/name”, “github:username”)
… (perhaps for future expansion by the spec or maybe left open for unspecified key/value pairs)

By creating a new name (Contact vs. Authors), it avoids the awkward and inconsistent approach of using singular for some multi fields (Project-URL) but plural for others (Authors). Also, Contact is more generic while still simply covering the previous scope of Author/Maintainer.

I’m liking Contact over Contributor, as a contact is more generic and doesn’t constrain the possible roles.

I’d propose that all fields should be optional (yes, even name, email, and role), matching with the current expectation that an author might not be named or addressed and that both author and maintainer are optional. The spec might state a default role (probably “author”), although I’m thinking not. Let there be bare contacts whose role is unknown.

Here are some example values:

Contact: name=Grace Hopper; email=ghopper@example.com; role=author
Contact: name=Charles Babbage; role=author
Contact: email=anonymous@example.com; role=sponsor
Contact: name="C. Schultz"; org="Universal Features Syndicate"; city="Los Angeles, CA"; email=cschultz@peanuts.example.com; role=author
Contact: name="Bill S. Preston, Esq."; role=lead

It’s not obvious to me what the encapsulation should be (JSON, RFC 822-based, or something else), but ideally it will be something that can support arbitrary unicode characters, possibly including quotes and semicolons and newlines, in the field values. In particular, it shouldn’t limit the characters just because it’s inconvenient for the encoding format used.

If this approach sounds nearly suitable (@pf_moore , @dstufft), I’ll start work on a PEP accordingly.

sinoroc · April 19, 2024, 8:29pm

I had suggested something along the same lines:

Things have changed since then, so I hope there are more chances for something like this to happen nowadays.

pf_moore · April 19, 2024, 8:32pm

This seems reasonable to me. I’d suggest that a PEP includes an example of how the new data can be displayed on PyPI, as that’s the most visible place where it will appear, just to ensure that there aren’t any pitfalls. For example, how would the “C. Schultz” record from your example appear on PyPI? Or something pathological that adds a 3-page “resume” field? Or something with name="Maintainer Team" and email=<a list of 40 email addresses each on a separate line>. Once you allow “arbitrary data”, displaying it can become a hard problem very fast…

Serialisation needs to be made clear (what you’re referring to as “encapsulation”, I think). The metadata spec doesn’t currently say a lot on this, simply

The standard file format for metadata (including in wheels and installed projects) is based on the format of email headers. However, email formats have been revised several times, and exactly which email RFC applies to packaging metadata is not specified. In the absence of a precise definition, the practical standard is set by what the standard library email.parser module can parse using the compat32 policy.

This is unfortunate, as the mandated format is not capable of storing arbitrary data for fields. This has generally not been an issue until now, because we’ve been careful to avoid allowing metadata values that might be problematic. But you want to allow arbitrary Unicode and punctuation. I’m uncomfortable with a hybrid “the value is an ASCII-encoded, single line JSON string” solution, as it will probably be quite difficult to read (and the readability of the existing format is a feature - it’s not critical, but it’s not something I’d give up casually). Also, it will make storing metadata in any other format much more complex.

For example, without special rules^[1] the unofficial JSON-compatible format from PEP 566 would require a “contact” field whose value is a JSON string:

{
...
"contact": "[{\"name\": \"Grace Hopper\", \"email\": \"ghopper@example.com\", \"role\": \"author\"}, {\"name\": \"Ren\\u00e9e\"}]",
...
}

For reference, I worked out that monstrosity using

 print(json.dumps({"contact": json.dumps([{"name": "Grace Hopper", "email": "ghopper@example.com", "role": "author"}, {"name": "Renée"}])}))

As long as you are willing to cover these details, I’d say it’s worth writing a PEP - it would be much easier to discuss specifics when there’s a full proposal to start from.

(I’d also be OK if you wanted to include in the PEP, or write as a separate PEP, a proposal to change the official serialisation method to something more robust than the existing email format. But that’s a much bigger change, with a much wider impact, and I’d completely understand if you didn’t want to touch it with a bargepole )

and including official special rules for an unofficial format in an official spec is problematic… ↩︎

Topic		Replies	Views
Clarification regarding the `authors` / `maintainers` field in `pyproject.toml` Standards	8	965	October 3, 2023
PEP 621: Storing project metadata in pyproject.toml Packaging	76	7633	October 18, 2020
Adding extra fields in the pyproject.toml authors/maintainers list Packaging	11	4149	July 2, 2022
Add pyproject-metadata to the PyPA Packaging	11	1738	March 19, 2024
The author/maintainer distinction problem (and PEP 621) Packaging	23	3400	November 18, 2022

Structure for importlib metadata identities

Related Topics