In warehouse#10072, flit is attempting to apply PEP 503 normalization rules to distribution filenames but this shift from the status quo where Setuptools allowed
. in metadata file names and distribution artifacts means that Warehouse is ill-equipped to handle the divergence.
It’s not obvious from the PEP why normalization is necessary.
When I read that PEP, it was my understanding that name normalized names are meant primarily to:
- force a collision of names that vary only by case or
- allow distribution names to be referenced by the normalized form within the API.
And this approach was fine because it did not impose constraints on a project using those allowed characters in package names. Projects like
backports.ssl_match_hostname could continue to use dots in the package name and the user experience would match the external experience. The names with the dots would appear in metadata filenames, project names, and distribution artifacts.
Since then, both sdist and wheel specifications have evolved to include a PEP 503 normalization (but modified), in spite of the status quo using a less aggressive form of normalization.
If this normalization is enforced or the tools begin honoring the specification, it’s going to lead to a situation where projects with dots in the names become second-class citizens. The names, in addition to being lowercased, will also appear in user interfaces (pip install logs, PyPI files listing, …) with dots replaced with underscores. This inconsistency in presentation will inevitably lead to confusion (is it zope.interface or zope_interface?) and will incentivize projects (current and prospective) not to use a dot in the name. Such an experience will also provide an additional reason to avoid namespace packages (which inherently have a dot in the name).
This same disadvantage was at play when
- was normalized to
_. As an early user of setuptools, I remember being confused by the swapping of
_. I wanted to use
- because I wanted to create a separation, whereas with
_, as a Python programmer felt like a combining of two tokens. But I saw that Setuptools would produce egg-info with
_ in it, so I was unsure if I was using
- incorrectly. Because of this confusion, I avoided using
_ in my project names.
When I dug into it, I learned there was a rationale behind the mangling of
_ in metadata filenames: to allow
- to be used reliably as a separator for fields in the filename. I understood the reasoning behind it and so became more accepting of
- in a project name.
I’d like to minimize the amount of mutation that happens to a name as a project passes through the packaging ecosystem.
One thing that bugs me about the PEP 503 normalization is I don’t even understand the rationale behind normalizing
_ (“dot normalization”). I read through the discussion on the PEP, but saw no justification, only the declaration. Does anyone know why the dot was included in those normalization rules (@dstufft)?
By my understanding, there are no outstanding issues with including the dot in any of the places where it currently appears (metadata filename, distribution artifacts).
To that end, I propose to consider:
- Remove the dot from normalization rules entirely, allowing packages to ultimately vary by
_. This change would require a transition from the current expectation in the repository API and would likely cause a lot of disruption, but would ultimately allow for simpler normalization rules.
- Limit the scope of PEP 503 to normalization for repository APIs (as declared). Advise implementations not to mangle/normalize names except in internal implementations or where the design requires it (normalizing
_in metadata filenames to allow
-as the separator, lowercasing to avoid variable behavior based on file system sensitivity).
- Use a different separator that’s not part of a valid distribution name. Since
_are explicitly allowed in the distribution name (per PEP 426), use another separator (
or spaceor …) in metadata filenames and artifacts. Avoid normalization altogether or limit it to lowercasing for package metadata and distribution artifacts.
Users already have a highly constrained space of characters for distribution names. Can we come up with a solution that doesn’t encumber the few non-alphanumeric characters that are available?