Python metadata format specification and implementation

I suspect we’re in agreement here, then.

Performance doesn’t seem terribly important on this. I must admit I’m not sure why it came up.

Agreed. JSON really isn’t intended for humans even if cleanly indented.

-Fred

2 Likes

FYI we are getting close to an API for metadata in Add a metadata API · Issue #383 · pypa/packaging · GitHub . One question the API will need to answer, though, is how to decode/encode the data for files? Since none of the packaging specs have a way to know the encoding upfront I suspect we will just assume UTF-8.

I did some work on that a few weeks back. From what I recall, the reality is that metadata was typically encoded with what looks like the user’s default encoding (often not UTF-8, although for many cases you can’t tell).

I have a database of metadata for all wheels on PyPI, so I can do some research. Probably not until the weekend, though.

One thing I would like is for the API not to fail with decoding errors. Falling back to something like bytes would be better, IMO. Alternatively, the email module (which is what you use to read metadata) can return “undecodable” values, as email.header.Header objects. Header objects aren’t very usable, but they are at least safe to process.

OK, that took less time than I thought :slightly_smiling_face: The 2039243 metadata files I have from PyPI break down as follows:

ascii: 1684799 (83%)
utf-8: 354416 (17%)
other: 28 (negligible)

So I guess UTF-8 is a reasonable choice. I must have got a skewed picture because those 28 caused my code to break and I spent way too long trying to work out what had gone wrong :wink:

We have no standard that says metadata must be in UTF-8 but some quick experiments suggest that current setuptools and wheel use UTF-8, so I think the 28 are probably historical outliers. However, my data is all from wheels, so I can’t say for certain the proportions hold true for sdists.

What do people think about an amendment to the metadata spec to require that whenever tools serialise metadata to bytes, they must use UTF-8? I’m inclined to think that’s a clarification and doesn’t need a PEP, but I don’t know if that’s something I can say, and in any case I’d rather not make such a statement without consensus…

4 Likes

Thanks for checking that! We will definitely assume UTF-8 then.

+1 to updating the spec to say UTF-8 w/o a PEP. Based on your analysis I think the community has already reached consensus for us. :grin:

2 Likes
3 Likes

I know in a few places in code we’ve made it try UTF-8 and then fall back to a “default” encoding, since invalid UTF-8 can be easily detected and very few languages remap the 0-127 ASCII section.