Package metadata - Description field

pf_moore · June 21, 2020, 12:05pm

The Description field in package metadata is defined here as being either encoded in the “message body” part of the RFC 822 formatted file, or in the headers in “folded field” format, with CRLF replaced by CRLF plus 7 spaces and a pipe character.

In practice, wheels appear to use the “message body” format, whereas sdists created with setuptools use a non-standard “header with extra lines indented by 8 spaces but no pipe” format. Flit ignores the long description in the sdist metadata, but uses the message body for wheels. To further complicate the matter, Python’s email.parser.Parser class doesn’t seem to do anything with the “7 spaces plus pipe” format (or for that matter, the 8-space indentation), so client tools have to manually fix the indentation.

The result is frankly a bit of a mess for client tools. And unfortunately, there’s not much we can do to retrospectively fix older sdists. (On the plus side, I suspect no-one actually cares much about sdist metadata, and a bit of malformed whitespace isn’t a disaster anyway).

In the interests of making it easier to write conformant clients, while not making a mountain out of a molehill, I suggest the following:

The core spec states that the “description in the message body” form is canonical, and the header form is retained for backward compatibility only. All new tools should only write the body form.
The “indented without a pipe” form of header is noted as being a permitted variant (simply because we can’t do much about the fact that it exists in the wild).
Clients reading metadata must prefer the “body” form, and are allowed to ignore the indentation rules. I don’t like this particular provision, but nor do I like requiring clients to implement complex rules to ensure correct round tripping of a format that we are deprecating. If, on the other hand, there’s a way to get the stdlib email package to do this parsing for us, then I’d be more than happy with the spec including sample code that showed how to write a fully-compliant parser with the email library (it may be possible to do this with policy objects, but I’ve no idea how).

For all practical purposes, I suspect that most clients would simply read the record and ignore the indentation-preservation rules for the header form anyway, so we’d really just be codifying what’s actually happening.

What do people think? In particular, @jaraco how much effort would be involved in fixing setuptools’ sdist code? I suspect this might actually originate in distutils, so it might need new code to be written.

dholth · June 21, 2020, 1:30pm

The description-as-body format came from wheel. Here’s where it converts from the PKG-INFO description-as-field format. The description-as-body format is much easier to read in a text editor. https://github.com/pypa/wheel/blob/251a0939a0e09352ae930f480cadbabfc103d240/src/wheel/metadata.py#L86

I wonder about the efficiency of loading a very long description in this file, that also has to be read at runtime for things like the version number. In the future you could stop parsing the message after the \r\n\r\n body separator or you could move the description to a separate file.

jaraco · June 21, 2020, 4:00pm

Not to derail the topic too much, but it feels a little impractical to me to allow one field (Description) to have newlines, but to effectively prohibit newlines from any other metadata field. The “email message” format is not well-suited for general-purpose content serialization and that’s why the “multipart” content type was created. I’d like to acknowledge that moving the Description into the only content part is limiting and unsustainable.

That said, I know there have been other metadata formats proposed and they’ve stumbled onto other challenges, so at least for metadata 1.2, it makes sense to try to refine (tighten) the spec and limit the variance based on real-world examples from today.

For (3), I’d suggest we recommend that clients rely on importlib-metadata and that library should provide a best-in-class experience, including support for whatever multiline fields may be present (with or without help of the email package).

how much effort would be involved in fixing setuptools’ sdist code?

I guess it depends on the scope and the approach. I looked briefly at setuptools’ code base, and it seems the metadata writer is already a backport of the metadata writer from Python 3.5, so it should be straightforward to change that (though it will now diverge from distutils).

jaraco · June 21, 2020, 4:01pm

Perhaps it would be better for sdists to simply get the same dist-info that wheels get.

dholth · June 21, 2020, 4:15pm

There’s also the JSON transformation of https://www.python.org/dev/peps/pep-0566/#json-compatible-metadata

It would be neat if setuptools generated dist-info by itself. Setuptools accepts plugins to write any metadata into the egg-info directory. This is one reason why bdist_wheel generates dist-info by converting an egg-info directory. The difficulty might depend on whether these plugins are modifying PKG-INFO or just (more common?) writing extra files into the directory.

pf_moore · June 21, 2020, 5:08pm

Agreed, treating Description specially is both limiting and annoying. However, I don’t want to start a whole “next generation metadata” discussion at this point, so I’ll limit myself to saying that I agree, but for now all I’m interested in is tidying up the inconsistency between spec and practice with the current standard.

For me, that’s the ideal solution going forward for writers.

Yes, and that’s well-defined, so for my own purposes I’d tend to treat that form (or rather the Python data structure it represents) as canonical, with the RFC 822 format as a particular serialisation (that has some quirks around handling of multi-line data in line-oriented headers).

For readers, all I’m really interested in is allowing client code to be able to write standard-compliant code without needing complicated parsing. Sadly, the RFC 822 spec and the Python email module don’t really help much over preserving indentation in mult-line data, so clients have some work to do there. The pipe format was a way to handle that, but it’s not worked out because the main producer never actually used it. So we have to accept the reality, and allow the non-pipe format and permit readers some flexibility in how they handle the field.

Given that the metadata format is defined by the PyPA spec, and not by a PEP these days. I think what I’ll do is just propose a PR to the spec that allows the non-pipe form (for backward compatibility only) and gives some guidance on what clients reading the data can do.

More extensive revisions can be handled separately (although I won’t be proposing any myself, as this isn’t a big enough deal for me to want to spend time on it).

dholth · June 21, 2020, 5:46pm

No one is missing multi-line Medium-Description: or multipart/alternative text/html metadata with attached images.

steve.dower · June 21, 2020, 9:55pm

This all sounds good to me, though I didn’t think it SAS too hard to parse the file anyway, even with the different formats (startswith’s tuple form came to the rescue for me).

I wouldn’t be too impacted by making the JSON form canonical, but it seems unnecessary given that we need to be able to parse the old format correctly anyway. A good implementation in packaging ought to be enough.

pf_moore · June 22, 2020, 11:49am

How hard it is basically comes down to “how much variation do you want to allow once you allow things that aren’t in the spec” combined with “how important is it to you to recover what the user actually specified in the original setup.py”. I have a bad habit of trying to be too liberal when there’s a gap in the spec

Good point. I think I’ll focus on writing a packaging.metadata implementation, and we can tidy up the spec some other time.

dustin · June 22, 2020, 2:27pm

See Add metadata validation · Issue #147 · pypa/packaging · GitHub and https://github.com/pypa/packaging/compare/master...di:metadata-validation?expand=1

This branch includes logic for metadata canonicalization as well as utilities to read metadata directly from source and built distributions (obviating the need for pkginfo).

I think an important part of this would also be externalizing the medatada validation that PyPI already does, so that PyPI can reuse packaging.metadata as well.

The thing that I got hung up on when I last had the time to work on this is how much of the WTForms validators should be “reimplemented” inside of packaging. They don’t lend themselves towards being easily reused, and they bring in a lot of extra “stuff” because they’re designed to be used with HTML forms.

I am planning to finish this in the next couple weeks to unblock work on setuptools and twine (I was planning to pair-program with @bernatgabor on this on Friday), but @pf_moore if you are interested in contributing to the branch in the meantime, I can move it to the pypa repo.

pf_moore · June 22, 2020, 2:49pm

That diff link doesn’t seem to work - can you fix it? I’m curious to see what you’ve done here. (I was looking at validation, but it didn’t go beyond the level of “is it a valid field name, is it a list or a string” in my mind - I’ve no idea what WTForms is or how it would fit in here…)

FWIW, the API I’m looking at is something like

meta = Metadata(name="foo", version="1.0")
meta.as_json()
meta.as_rfc822()

meta = Metadata()
meta["Description"] = "blah, blah, blah"
# defaultdict(list) style API for multi-use
meta["Classifiers"].append("Something")

meta = Metadata.read_rfc822(...)

If you’re looking at validating what values fields can have, then I’m not even considering that yet. IMO, step one should be to have a reader/writer API that works in terms of strings. I don’t object to validation, but I prefer to take a “let’s walk before we run” approach (And I’d rather not have the design of the basic API be overshadowed by the complexities of validation).

dustin · June 22, 2020, 2:57pm

The diff link works for me, even in incognito – what are you seeing instead of a diff?

pf_moore · June 22, 2020, 3:00pm

Oh wait, sorry - it gives me an “Open a pull request” screen, which I assumed was wrong. The actual diff is off the bottom of my screen, so I missed it. Sorry. (Although is that really the only way to post a link to a diff between two branches? Seems clumsy!)

pf_moore · June 22, 2020, 3:02pm

Looking at what you’re doing, it looks like’we re working on independent aspects of a metadata API, so there’s no real clash, which is good

FWIW, here’s my initial prototype implementation: https://gist.github.com/pfmoore/20f3654ca33f8b14f0fcb6dfa1a6b469

dustin · June 22, 2020, 3:40pm

It looks like there’s some overlap to me. You might be missing some of the files in that diff because GitHub (un)helpfully decided to minimize them. E.g. https://github.com/pypa/packaging/compare/master...di:metadata-validation?expand=1#diff-fd9d8caa5b7e0a1dbf31f15a58f1f87bR392-R422

pf_moore · June 22, 2020, 3:52pm

Oh wow, yes. Thanks Github

I think I’m going to continue with my implementation, at least for now, and we can work out how to merge the two APIs once we’re closer to complete. But FWIW, the key use case that prompted me to start writing this is that I want to be able to do:

metadata = Metadata(name="foo", version="1.0", requires_dist=["foo", "bar>2.0"])
content = metadata.as_rfc822() # Don't care about the method name

The rest is just what happened when I decided I didn’t want to maintain that code myself, and it should go into packaging

dustin · June 22, 2020, 4:27pm

I don’t see any real conflicts between the two. I might gently suggest passing a dict with the metadata instead of kwargs, but we could probably just make both work if we need to.

pf_moore · June 22, 2020, 6:24pm

No, nor do I, I just hadn’t looked very closely yet. I am fairly keen on the keyword argument approach, though. I definitely don’t want to use the punctuation-heavy {'name': 'foo', 'version': '1.0'} version for my use case (a lightweight syntax for building test package files). Yes, I could write a wrapper round the library version, but that seems counter-productive.

I used different classmethods to ensure the different constructor forms didn’t conflict. I like that style better than a bunch of keyword arguments each triggering different behaviour.

dustin · June 22, 2020, 7:24pm

I see, that does seem a bit more ergonomic. I’m convinced.

Since all the validation logic can be added after the fact, perhaps I’ll try to split my branch in two and put the first half up today as a draft PR we can collaborate on?

ofek · June 22, 2020, 7:41pm

FYI GitHub’s pull request form is triggered by the expand=1. Just remove it: https://github.com/pypa/packaging/compare/master...di:metadata-validation