Encoding of files in the .dist-info directory

I’m writing some code that reads package distributions and returns their metadata. I was somewhat surprised to find that I couldn’t locate anything that defines the encoding to be used for the METADATA file in a wheel.

importlib.metadata assumes UTF-8 for metadata files in installed .dist-info directories, and I honestly can’t imagine wanting it to be anything other than UTF-8, but I think we should be explicit and mandate that somewhere.

Broadening the scope somewhat, I propose that we mandate that all files in .dist-info directories must be UTF-8 encoded text files. If we make that a blanket rule, we don’t have to worry about it in future. (JSON and TOML count as “UTF-8 encoded text” so I don’t see there being an issue if we want structured formats). I believe this is simply codifying existing practice, so I hope it’s not controversial.

So, to be explicit, I would like to make a change to Binary distribution format — Python Packaging User Guide and Recording installed projects — Python Packaging User Guide to state that all files in the .dist-info directory must contain UTF-8 encoded text.

Does anyone have any objection to this? Does anyone feel this needs a PEP?

2 Likes

I’m sure some parser behind some company firewall is parsing it as ASCII/windows locala/etc… maybe just because it’s kinda breaking change let’s put it in a new PEP, though I’m not expecting it to be controversial. Or maybe doesn’t need PEP, but should be some big red warning somewhere and everyone notified about it.

I don’t think this needs a PEP.

If someone has a custom non-public parser, they’re probably also capable of debugging the situation. Sure, they might have different priorities or whatever, but that’s what you opt for when you roll out your own. :upside_down_face:

For such users, having this be documented in a PEP doesn’t really change anything for them – they won’t notice until we make a change somewhere in the tooling, which will end up referencing the spec update anyway.

3 Likes

IIRC I raised this issue specifically to pip and was told to not worry about it. pip already open all these files with utf-8 (some via pkg_resources), so anything that works with it should already be compliant.

In the light of the above, I’d suggest just go ahead and amend the existing PEPs (as a clarification).

3 Likes