Almost all build backends assume metadata to be a plain mapping of strings (distutils
, setuptools
, flit
, poetry
, etc.), when in reality RFC822 is much more complex. E.g., values in fields like Description
can’t contain EOLs, but in reality they do, “email” body must contain only ASCII chars, but in reality there can be UTF-8 chars (missed bug after the switch from Python 2 to Python 3?), etc.
Even if you will use email
to serialize supposedly RFC822-compliant metadata (actually, compliant with the latest standard, which seems to be RFC2822), so that str(email.message.Message)
or str(email.message.EmailMessage)
produced only ASCII characters with non-ASCII strings encoded in quoted-printable
or base64
, things won’t go as smooth as you’d expect:
-
Message
encodes non-ASCII header values inbase64
. -
Message.set_payload(string)
passes content as is. -
Message.set_payload(string, charset="utf-8")
encodes content inbase64
. -
EmailMessage
passes non-ASCII header values as is. -
EmailMessage.set_payload(string)
passes content as is. -
EmailMessage.set_payload(string, charset="utf-8")
encodes content inbase64
. -
EmailMessage.set_content(string)
encodes content inquoted-printable
. -
EmailMessage.set_content(string, charset="utf-8")
encodes content inquoted-printable
.
The only valid solution seems to be Message.set_payload(string, charset="utf-8")
, but Message.add_header()
accepts multiline strings w/o raising exception, keeping user ignorant of invalid input (against The Zen of Python). To make Message
raise exceptions it must be created with special policy Message(policy = email.policy.strict)
. But that’s not all: to read decoded content you must call Message.get_content(decode = True)
.
While described issues may be fixed, it’s likely that there are much more. One example is that standard defines that only CRLF was used for EOLs, but people mention that some software doesn’t work with CRLF, but the one that works with CRLF also works with LF (well, these are email clients, so it may be not valid in case of Python’s metadata). Another example is that header values may contain parameters, and after parsing they become part of a value (Summary: Title; charset="utf-8"
). With a wider acceptance of custom build backends, esp. in-tree build backends, there’s bound to be more software that generates invalid metadata due to underestimation of complexity of RFC822: if such a sizable package manager like poetry
generates metadata from formatable strings, a project with lots of people working on it, there are likely to be many more packages with invalid metadata.
Ideally, it’d be best to switch to a completely new format because in 2020, when adoption of Unicode/UTF-8 continues to rise, even Python itself may switch to UTF-8-only mode, it’s weird when such a noob-friendly programming language like Python uses such complicated (not just complex) and error-prone format just to pass a mapping of strings… But since there’s little hope, the next best solution is this:
- write dedicated PEP as the only way to inform all interested parties about how broken current RFC822-based IO logic is, which will hopefully result in use of proper tooling.
- leave things as is for affected distros/package managers and introduce metadata v3.0 based on existing format (major version bump is needed to reflect significance of changes).
- provide working example of a valid metadata IO in related PEP(s), or better yet – add dedicated package/logic to stdlib/
packaging
/whatnot. - apart from dedicated logic for metadata IO, there probably should be some logic to verify sdists and wheels, like
auditwheel
, but cross-platform and probably w/o checking dependencies of platlib binaries to make implementation simpler (maybe as part ofpackaging
, or complementary to it?).
In case you have second thoughts about existing format, most likely TOML will make most sense as a new format from consistency POV, even though more limiting in comparison to JSON (e.g., homogeneous arrays, but supporting more native Python types out of the box). Though if metadata will be switched to TOML, it will make sense for PEP 621 to use same names of fields in pyproject.toml
as defined in core metadata, and possibly switch names to either lowercase-hyphen or snake-case names, which in its own stead may trigger desire to provide PEP with modernised aliases for core metadata…