I’m writing some code that reads package distributions and returns their metadata. I was somewhat surprised to find that I couldn’t locate anything that defines the encoding to be used for the METADATA file in a wheel.
importlib.metadata assumes UTF-8 for metadata files in installed .dist-info directories, and I honestly can’t imagine wanting it to be anything other than UTF-8, but I think we should be explicit and mandate that somewhere.
Broadening the scope somewhat, I propose that we mandate that all files in .dist-info directories must be UTF-8 encoded text files. If we make that a blanket rule, we don’t have to worry about it in future. (JSON and TOML count as “UTF-8 encoded text” so I don’t see there being an issue if we want structured formats). I believe this is simply codifying existing practice, so I hope it’s not controversial.
I’m sure some parser behind some company firewall is parsing it as ASCII/windows locala/etc… maybe just because it’s kinda breaking change let’s put it in a new PEP, though I’m not expecting it to be controversial. Or maybe doesn’t need PEP, but should be some big red warning somewhere and everyone notified about it.
If someone has a custom non-public parser, they’re probably also capable of debugging the situation. Sure, they might have different priorities or whatever, but that’s what you opt for when you roll out your own.
For such users, having this be documented in a PEP doesn’t really change anything for them – they won’t notice until we make a change somewhere in the tooling, which will end up referencing the spec update anyway.
IIRC I raised this issue specifically to pip and was told to not worry about it. pip already open all these files with utf-8 (some via pkg_resources), so anything that works with it should already be compliant.