Metadata format: issues with metadata fields

8day · November 23, 2020, 2:26pm

Some fields are explicit multiple-use fields, while others are implicit multiple-use fields (CSV). E.g., Keywords, Requires-Python, Author-email and Maintainer-email are examples of “implicit multiple-use fields”. Explicit definition of such feature as a “compact” multiple-use fields would’ve made things more transparent, generic and readily automatable. E.g., this way Keywords won’t be treated as a special case in email to JSON conversion rule.

Supported-Platform seems to have same purpose as a “platform tag” in a wheel, which makes it somewhat useless:

Binary distributions containing a PKG-INFO file will use the Supported-Platform field in their metadata to specify the OS and CPU for which the binary distribution was compiled.

Description field mustn’t contain EOLs/multiline strings.

PEP 345 states that

To support empty lines and lines with indentation with respect to the RFC 822 format, any CRLF character has to be suffixed by 7 spaces followed by a pipe ("|") char. As a result, the Description field is encoded into a folded field that can be interpreted by RFC822 parser [2].

In reality RFC822 and its successors don’t mention anything like that to create “folded field”: “7 spaces” can as easily be one space, and a pipe char can be any printable (?) char. Also

This encoding implies that any occurrences of a CRLF followed by 7 spaces and a pipe char have to be replaced by a single CRLF when the field is unfolded using a RFC822 reader.

is wrong too: CRLF followed by whitespace is replaced by a single whitespace.

RFC 822 - STANDARD FOR THE FORMAT OF ARPA INTERNET TEXT MESSAGES defines “long header fields” and what folding is

Each header field can be viewed as a single, logical line of ASCII characters, comprising a field-name and a field-body. For convenience, the field-body portion of this conceptual entity can be split into a multiple-line representation; this is called "folding".

I.e., “folding” is a means of formatting raw data, rather than the text it represents. Later it defines “unfolding”:

Unfolding is accomplished by regarding CRLF immediately followed by a LWSP-char as equivalent to the LWSP-char.

I.e., “unfolding” results in replacing of CRLF along with >=1 whitespace chars with a single whitespace char.

The only part of RFC822 that mentions anything remotely similar to preservation of EOLs is RFC 822 - STANDARD FOR THE FORMAT OF ARPA INTERNET TEXT MESSAGES that defines “structured field bodies”:

To aid in the creation and reading of structured fields, the free insertion of linear-white-space (which permits folding by inclusion of CRLFs) is allowed between lexical tokens. Rather than obscuring the syntax specifications for these structured fields with explicit syntax for this linear-white-space, the existence of another "lexical" analyzer is assumed. This analyzer does not apply for unstructured field bodies that are simply strings of text, as described above. The analyzer provides an interpretation of the unfolded text composing the body of the field as a sequence of lexical symbols.

From last sentence it becomes clear that “analyzer” provides “interpretation of the unfolded text”, meaning when “analyzer” receives field-body it is already unfolded, therefore any kind of EOLs (or only CRLF?) can’t be preserved.

RFC2822 contains only “minor” changes, like whitespace followed by CRLF must contain at least one printable character, etc., which doesn’t change the overall picture.

This is important for sdist which must use v1.2 where description can’t be stored as email payload/content to preserve EOLs, which creates perfect moment to write a proper PEP for sdist (it seems to be already in the works).

Probably the best way to support multiline strings in “field-body” is to encode them in base64.

Description-Content-Type accepts charset set only to UTF-8, so what’s the point in using it explicitly? Is it a Py2k remnant?

Keywords should’ve been multiple-use field. Maybe introduce multiple-use field Keyword, which BTW won’t be in conflict with a rule for email to JSON conversions that treats Keywords as a multiple-use field, as well as will be in a singular form like other multiple-use fields? Or treat fields with CSV as a “compact” multiple-use fields, as described in the very beginning.

Isn’t Home-page in conflict with Project-URL: Home page, https://hope.page? Maybe define them to be interchangeable?

Download-URL – same issue as with Home-page.

Author and Maintainer should’ve been multiple-use field because project may have multiple authors/maintainers. Core metadata specifications - Python Packaging User Guide doesn’t say anything, but judging by METADATA generated by setuptools these fields can contain CSV.

Author-email and Maintainer-email must be compatible with RFC822 header From, therefore must be able to contain CSV (CSV (?) “target-list”: https://cr.yp.to/immhf/sender.html).

License faces similar issue as Description, but can be stored only in a header.

Requires-Python should’ve been multiple-use field. Faces the same exact issue as Keywords.

No way to define “importable packages”. I think I’ve read somewhere about notation like Provides-Dist: {dist}:{pkg}, but can’t find sources. E.g., ATM pkg_resources is effectively unrelated to setuptools and its dist metadata, resources, etc. can’t be read with help of importlib.metadata. It seems that ATM pip emulates this with *.dist-info/top_level.txt, which is likely to be a result of parsing *.dist-info/RECORD.

Obsoletes-Dist, but for “importable packages”.

It is legal to specify Provides-Extra: without referencing it in any Requires-Dist:.

If this is related to “virtual” dist from Provides-Dist prior to v2.1, on which Provides-Extra seems to be based, then it implies that the mere fact of “virtual” feature being mentioned in Provides-Extra must satisfy requirement dist[virtual]. The problem with this is that because of complexity of “extra” in environment markers, package managers will be forced to check all environment markers to determine which extras are “virtual” (and not just by evaluating them… which is partially caused by branching), and that can be overwhelmingly complicated considering conciseness of that sentence. It’d be nice if that sentence was expanded to include meaning of such unreferenced extras.

New multiple-use field Extends-Package/Extends-Dist is needed to associate extensions with packages/dists, instead of requesting Classifier each and every time some extensible distro rises to popularity. There might be other types of relations, but I guess most of them are either about extending or replacing packages. This will simplify finding of extensions, as well as possibly revive interest in Keywords.

Requires-External is applicable only to wheels, according to

Each entry contains a string describing some dependency in the system that the distribution is to be used.

In the context of wheel it specifies run-time environment, but in the context of sdist it will specify build-time environment, which will result in two metadata files to be different, thus requiring two separate sets of metadata definitions in pyproject.toml (e.g., for PEP 621). That being said, considering use-case of this field, it makes sense to provide Requires-External for sdist as much as for wheel…

Overhaul of tagging of distros for the sake of finding relevant ones much easier. Classifiers require too much typing, thus useless for CLI, and keywords don’t seem to be used at all (maybe internally by some packages). Problem is, considering the role that classifiers play, they can’t be replaced by keywords (maybe split into separate keywords in a meaningful way (?))…

Project-URL – standard set of labels?

uranusjr · November 24, 2020, 6:01am

To me, most ideas¹ here boil down to: Yeah that sounds reasonable. The benefit is probably too minimal for most to bother, but feel free to write a PEP for that.

¹ Except the “implicit multiple use” one, which IMO is a wrong interpretation, and the Description point, which as you said is already in the works.

Also note that the canonical specification is on packaging.python.org, not the PEPs that proposed them.

That’s not how they works from my understanding. All of these fields are single use, but that single value is in a format that can be interpreted as a collection of values. This is the same as the To: field in email; most (all?) email clients don’t list each email in its own list when sending to multiple receipient, because one single use of the field can contain multiple addresses. Requires-Python parses to a PEP 440 version specifier, which a single value.

I kind of agree that Keywords should probably have been made into a multiple use field. But again the benefit is too minimal.

This is a left over from the now-defunct egg format. “Binary distributions containing a PKG-INFO” is not a standard thing anymore. It can probably be removed because it is no longer usable, but there’s really no harm keeping it there either.

I believe the point is to make all RFC 1341 formats parsable, which makes things easier for tools—they can source an existing tool and just check the encoding is indeed UTF-8. If charset must be omitted, tools would need to write a custom parser.

Those fields accept free-form text because that’s what some people need, and changing them to multiple use would break that. You’d want Author-email and Maintainer-email if you’re looking for structured values (using RFC 822 email-list format instead of multiple use). PEP 621 tries to somehow resolve the confusion by providing a “official” way to serialise declarations into these fields, but it would be nice if the specification can describe them better as well.

I believe this is somewhat deliberate, to separate the ideas of distribution and package (which you conflate here). Distribution metadata describe a collection of files to be installed, or already installed on disk. Whether these files result in a Python package is not their business. The fact that setuptools the distribution installs pkg_resources/__init__.py (among other files) is described in the RECORD file (sdist has an equivalent to that but I can’t recall), and the fact that it becomes an importable thing is not a part of the distribution. IMO the separation is a good thing, since the definition of “importable” can change over time (see implicit namespace packages), and distribution formats do not need to change with that. What files are considered importable packages and how is outside of the scope of packaging, and should be maintained elsewhere.

That’s not how I interpret the text, and seems to contradict to your own following sentence, which mentions what it means in sdist. Anyway, the specification also says there is no particular rule on the strings to be used so anything goes.

Sounds like a good idea to me. Warehouse already has some heuristic parsing the keys (mostly to put icons next to them). I’d imagine most would welcome this as well.

p.s. I find quoting text with ticks (`) very difficult to read. They are converted into <code> tags, which are also semantically wrong. Please consider using italics or blockquotes (>) instead in the future, both are wildly used and accepted writing styles.

8day · November 24, 2020, 7:53am

But what about Extends-Dist? I have mentioned Extends-Package, which can make it seem somewhat related to Provides-Dist: {dist}:{pkg}, but there’s no reason for it to be: Extends-Dist can be a fully dist-specific field. Alternative solutions require people to prefix/suffix/format names of dists (e.g., {parent_dist}_{ext_type}_{ext_name} for easier identification, which makes them almost unusable in CLI) and possibly write logic to use external database listing all available extensions (if people want to see all available extensions to choose from). Obviously, this doesn’t have to happen right now, but possibly some time in the future.

Also, would it be possible to leave some note about Supported-Platform being a left-over from the egg format? I left this out, but for a long time Binary distributions containing a PKG-INFO seemed like some copy-paste mistake.

uranusjr · November 24, 2020, 8:12am

I don’t really understand how the Extends-* should be used, or why. It would be nice if you could elaborate the idea more, either as a PEP or something a long the line to describe the rational, and how it can be used to solve problems.

Edit: Forgot to reply to the Supported-Platform part. A PR to the pypa/packaging.python.org GitHub repository is always welcomed (although I would suggest waiting for some others to confirm that it is indeed a leftover and should be avoided).

8day · November 25, 2020, 4:59pm

Sorry, I guess I didn’t think this through… This would allow to establish connections between dists and avoid population of classifiers with software implementations (e.g., Framework :: Plone :: Addon and in a way Topic :: Desktop Environment :: Window Managers :: IceWM :: Themes) so that classifiers could be used for a more abstract entities. Think about it as a way to support finding of extensions for all/less popular frameworks, etc. w/o a need to update classifiers.

Turned out that you must specify parent dist (and potentially type of extension) no matter what, which is more convenient when using GUI, but when using CLI there’s basically no change (almost the same amount of text to type). Alternative solution suitable for both GUI (PyPI) and CLI (pip) can be to allow use of wildcard character during search: this will allow to search for dists like distX_exttypeY_extZ with distX_exttypeY_* and will not require passing of any extra CLI args or using any GUI widgets because dist names can’t contain wildcard char * (equivalent of re.fullmatch used with *). Note that to avoid finding unrelated dists, match must be exact: xyz* must return dists starting from xyz and not the ones containing xyz. I.e., this is something to be asked on PyPA GitHub.

Topic		Replies	Views
Metadata format: metadata is not a plain mapping of strings Packaging	13	1408	November 25, 2020
Core metadata email fields & Unicode Packaging	8	1405	March 6, 2021
Add support for CRLF in textwrap.dedent Ideas	8	1533	February 20, 2023
Deprecate misleading escapes in strings Ideas user-friendly	8	980	February 17, 2023
IndentationErrorError: inconsistent reporting of inconsistent use of tabs and spaces in indentation in exception messages Ideas	9	2390	December 11, 2023

Metadata format: issues with metadata fields

Related Topics