Revisiting distribution name normalization

In warehouse#10072, flit is attempting to apply PEP 503 normalization rules to distribution filenames but this shift from the status quo where Setuptools allowed . in metadata file names and distribution artifacts means that Warehouse is ill-equipped to handle the divergence.

It’s not obvious from the PEP why normalization is necessary.

When I read that PEP, it was my understanding that name normalized names are meant primarily to:

  • force a collision of names that vary only by case or ., -, or _.
  • allow distribution names to be referenced by the normalized form within the API.

And this approach was fine because it did not impose constraints on a project using those allowed characters in package names. Projects like zope.interface and backports.ssl_match_hostname could continue to use dots in the package name and the user experience would match the external experience. The names with the dots would appear in metadata filenames, project names, and distribution artifacts.

Since then, both sdist and wheel specifications have evolved to include a PEP 503 normalization (but modified), in spite of the status quo using a less aggressive form of normalization.

If this normalization is enforced or the tools begin honoring the specification, it’s going to lead to a situation where projects with dots in the names become second-class citizens. The names, in addition to being lowercased, will also appear in user interfaces (pip install logs, PyPI files listing, …) with dots replaced with underscores. This inconsistency in presentation will inevitably lead to confusion (is it zope.interface or zope_interface?) and will incentivize projects (current and prospective) not to use a dot in the name. Such an experience will also provide an additional reason to avoid namespace packages (which inherently have a dot in the name).

This same disadvantage was at play when - was normalized to _. As an early user of setuptools, I remember being confused by the swapping of - to _. I wanted to use - because I wanted to create a separation, whereas with _, as a Python programmer felt like a combining of two tokens. But I saw that Setuptools would produce egg-info with _ in it, so I was unsure if I was using - incorrectly. Because of this confusion, I avoided using - and _ in my project names.

When I dug into it, I learned there was a rationale behind the mangling of - to _ in metadata filenames: to allow - to be used reliably as a separator for fields in the filename. I understood the reasoning behind it and so became more accepting of - in a project name.

I’d like to minimize the amount of mutation that happens to a name as a project passes through the packaging ecosystem.

One thing that bugs me about the PEP 503 normalization is I don’t even understand the rationale behind normalizing . to _ (“dot normalization”). I read through the discussion on the PEP, but saw no justification, only the declaration. Does anyone know why the dot was included in those normalization rules (@dstufft)?

By my understanding, there are no outstanding issues with including the dot in any of the places where it currently appears (metadata filename, distribution artifacts).

To that end, I propose to consider:

  • Remove the dot from normalization rules entirely, allowing packages to ultimately vary by . and -/_. This change would require a transition from the current expectation in the repository API and would likely cause a lot of disruption, but would ultimately allow for simpler normalization rules.
  • Limit the scope of PEP 503 to normalization for repository APIs (as declared). Advise implementations not to mangle/normalize names except in internal implementations or where the design requires it (normalizing - to _ in metadata filenames to allow - as the separator, lowercasing to avoid variable behavior based on file system sensitivity).
  • Use a different separator that’s not part of a valid distribution name. Since ., -, and _ are explicitly allowed in the distribution name (per PEP 426), use another separator (@ or ! or + or , or space or …) in metadata filenames and artifacts. Avoid normalization altogether or limit it to lowercasing for package metadata and distribution artifacts.

Users already have a highly constrained space of characters for distribution names. Can we come up with a solution that doesn’t encumber the few non-alphanumeric characters that are available?

3 Likes

Why?

I’d consider that to be a bug in the UI. Any time a package name is reported to the user, the display form of the name should be used. I will concede that there are a lot of such bugs around, but I suspect that’s mostly because most people don’t really care that much about such issues. But I would completely support anyone who wanted to work through the various tools and make them consistently report package names in the form used by the package (in its Name metadata entry). It’s probably a lot of work, but UX is important, so that isn’t an argument for not doing it. (It may be an argument for not having the resources to do it, of course…)

I agree that presentation should be consistent. However, consistency shouldn’t be blind - it should be clear where display names are used, and where normalised names are used. So, for example, normalised names should be used (consistently!) in wheel and sdist names, and in package index pages. Display names should be used (again, consistently) whenever a project name is reported to the user.

Maybe the normalisation rules should have allowed a dot. But then the question is, are zope_interface and zope.interface two different packages? If so, then there’s a serious risk of confusable name attacks. If they aren’t, then what is the normalised form for them both? Why is a dot more reasonable than an underscore? Wouldn’t making the normalised form a dot, simply make names with underscores feel like second class citizens?

And of course, that’s history in any case. The normalisation rules chose to not allow a dot, so the current reality is that foo.bar and foo-bar represent the same package, which will be recorded in a package index as foo_bar. Changing those rules now would be a major disruption, and would definitely involve a PEP, and an extensive transition process, at the minimum.

There should be no mutation. The correct form of the package name must always be recorded in the Name metadata. For me, “mutation” means that the correct name of the project is no longer recoverable, but I don’t think that’s what you mean. However, I’m not sure what you do mean, beyond “some tools don’t display the project name correctly”.

The key issue is that normalisation implies uniqueness¹. Two package names with the same normalised form must be the same package. And conversely, two package names that normalise differently must be different packages.

So “including dots when normalising” changes the set of valid and distinct package names, and that is an issue.

As you say, this would involve significant disruption. If you feel it’s worth it, I suggest you raise a PEP. I’d be against such a PEP myself, for reasons I’ve mentioned above.

We’d still need normalisation rules for sdist filenames, wheel filenames and dist-info directory names. And maybe other places, too. Why would we have different rules? Having a single normalisation rule is the only reasonable approach to take.

Again, this would be a huge disruption. A PEP would be needed (and any such PEP would have to describe the transition process).

To be absolutely clear, I do have a lot of sympathy with the idea that projects should be allowed to make their own choices around the format of the project name. And I definitely think that tools should work a lot harder to respect those choices, and report the correct name for a project (call it the “display name”, if you want). It’s basic politeness, IMO. But I don’t think any of that impacts normalisation, and I don’t think changing the normalisation rules will fix the issue (speaking as a pip maintainer, I know we display normalised names to the user when we should use some sort of “display name”, and no matter what the normalisation rules are, that’s wrong).

¹ You might argue that this isn’t what normalisation should be for. The term “normalisation” means different things in different contexts, and maybe as a mathematician I am putting too much stock in the idea of a canonical form representing all equivalent names. But regardless of that, the reality is that tools do make the assumption that the normalised form of a name is canonical, so even if it’s not necessarily the interpretation, it is the de facto behaviour.

5 Likes

Just sharing my own personal experience as a Python package user, developer and maintainer in both the PyPI and conda ecosystems, consistent display of a package’s display name vs. its normalized name is certainly important.

However, particularly in the large scientific Python community where both users and package authors are typically less well versed in the details of Python packaging but require large and diverse dependency stacks across different package managers to do their work, package names not following a consistent, standardized convention has posed no end of practical problems, well beyond mere aesthetics. I’ve lost count of the number of times colleagues (and myself) have wasted time and effort over trying to remember whether the import, PyPI package or Conda package name did or didn’t contain a _, - or ., or was UpperCamelCase or lowercase (since each can be different).

Within the Conda ecosystem, names are generally normalized to lowercase, no dot, - as separators (though for common cases, auto-gendered metapackages exist as aliases for _ vs -), same as Linux and other package managers and I’ve found it to be much easier and more consistent to recall package names than on PyPI. And in many cases, (e.g. QtPy, a top-200 PyPI download package I maintain that sees heavy use on conda as well) the normalized name (qtpy) is actually the import name, not the project name in the metadata (QtPy) that someone long-forgotten set nearly a decade ago, when packaging conventions and knowledge were not as established as they are now.

Certainly, I don’t suggest requiring existing projects change or normalize their names, but at least as both a package user and author, normalizing user-provided names more aggressively on input, rather than less, to reduce the chance of package name confusion over aesthetic differences and the amount that users need to recall and worry about such things, is preferable to always having the display name aesthetically match whatever I (or the original author, who’s long since moved on) typed into the name field many years ago (though of course, tools still can and should display that name to users).

To add, as a package user, I’d rather work with a package with a consistent name following standard conventions that was easy to remember, than one with oddball aesthetics. As a package author, I’d much rather minimize the frustration and maximize the ease at which users install and update my package than impose particular aesthetic sensibilities, and there being an established standard to follow is much preferable to having to Google and bikeshed over how I should capitalize and punctuate the name that I will be stuck with. In fact, more normalization rather than less actually would, if anything give me more confidence rather than less if I really did want to use less conventional punctuation or capitalization, as I would be more confident that users would still find my package and not one benignly or maliciously similar.

Finally, the risk of dependency confusion, typosquatting and infrastructure attacks are not merely theoretical, it has already caused major trouble for npm, there have been attacks on PyPI and it is only likely to increase. In my view, opening the door to a whole new class of such attacks, never mind a greatly increased chance of benign developer confusion and wasted effort, is simply not worth it for a small amount of additional “creativity” (or as many would see it, the lack of a consistent convention) in package naming.

2 Likes

A previous related discussion: Clarify naming of .dist-info directories

This has significant risk of typo-squatting, deviates from all of the existing tooling/user expectations and is an exceedingly difficult change to communicate about.

I think this is not something that we should do. There are certainly people who specify a dependency on zope.interface as zope-interface / zope_interface, and it is not clear to me if there’s any way to transition without just breaking them, or requiring a significant effort put toward transitioning.

For display names, sure. I’m fine with that.

For normalised names that are used as part of the transport and distribution mechanisms (eg: file names of distribution files, or dist-info folders), that doesn’t seem like a good idea. Expanding this scope now means that we effectively create a surface for typosquatting attacks and such.

I’d go so far as to say that build tooling should record the original user intent as the metadata value. And that any tooling that wishes to display names should display those names as-is, with no modifications.

This does not mean that we need to relax normalisation rules. This also means that file names should follow the PEP 503 normalisation rules. Basically, I think we should be doing what the standards say we should do, in this case.

I’m +1 with requiring PyPI / pip etc to use metadata names as-is for presentation (they already do, I believe, and we should treat situations where they don’t as bugs).

I’m a strong -1 on changing what the normalisation rules are.

5 Likes

Can we move this forwards? There are PRs open on Warehouse and on the spec which depend on the outcome of this discussion, and there will be a change to make in Flit either way as well.

Obviously this isn’t a vote, but it seems like most people interested in this topic are in favour of making normalisation more consistent (the position @pf_moore and @pradyunsg have set out) over preserving as much of the distribution name as possible in wheel filenames (@jaraco’s argument). I’m pointing this out partly so that if there are more preservation-over-normalisation fans who have been quiet so far, they can speak up.

I think the biggest argument against normalisation is that tools consuming wheels won’t be able to assume normalised names, because of all the existing wheels which aren’t named that way. I’m still in favour of making normalisation consistent going forwards everywhere a distribution name gets embedded in a filename, though.

1 Like

What precisely is needed? Presumably a change to the wheel spec is the key here. The standard PyPA processes apply here, and they say

If a change being considered this way has the potential to affect software interoperability, then it must be escalated to the distutils-sig mailing list for discussion, where it will be either approved as a text-only change, or else directed to the PEP process for specification updates.

But it doesn’t say how the list will approve such a change, so if we don’t get consensus here, what happens next? I’m reluctant to suggest that the question goes to the relevant PEP delegate, as that’s me and I’ve expressed my opinion here, so obviously that’s what I’m going to approve…

I guess there’s always a PyPA vote as a fallback option, though.

I guess my immediate aim is to prod @jaraco and anyone else who prefers the name preservation option, to see whether they want to continue to make that case, or accept PEP 503 normalisation (even if grudgingly). At the moment, I feel like the discussion is in limbo - there’s a rough consensus, but not a decisive one, so I’m holding off on e.g. fixing Flit to lowercase the name. I hope we can reach an accepted consensus without a formal vote.

I hadn’t spotted that note. Do you think it’s appropriate to replace ‘distutils-sig’ with a pointer to this forum? There hasn’t been any significant discussion on distutils-sig since some posts on the ‘archive this list’ thread in May, so it seems pretty clear that this is where packaging discussions now happen in practice. But I’m happy to send a message to distutils-sig about it if you like.

Yes, absolutely. IMO, everywhere that we refer to distutils-sig should be redirected to Discourse these days. Someone just needs to update the document, as far as I’m concerned (not that I have any more authority than anyone else). I don’t know where the source of that document is stored - maybe someone who does can sort it out?

I’m happy to get to it at some point, but if anyone else wants to update it sooner, the source of that page is here: pypa.io/specifications.rst at main · pypa/pypa.io · GitHub

To help drive towards a decisive consensus, I’m for normalization.

FYI, I opened an issue and a pull request to update this across the site.

1 Like

I apologize that I kicked off this conversation (twice it seems), but haven’t had the time to follow up. I have a lot I’d like to share/communicate around this issue but just have too many other things going on right now, so I’ll concede the status quo, which I’ll summarize:

  • PEP 503 normalized names should be used for package indexes.
  • For distribution filenames and metadata filenames, the alternate spec should be used.
  • Setuptools should adapt to honor this spec (I welcome a bug report or pull request).

In the future, I hope to revive this effort, but I fear it may be too late, as Flit and Setuptools will have adopted the name mangling and it will be a more difficult transition to support.

Thanks Jason for letting us move on. :slightly_smiling_face:

I think you’re right that change is likely to be a very tough sell later, once the specs and the relevant implementations are in agreement.

1 Like

I’ve been investigating this today, and this doesn’t need any changes in setuptools to “fix” this. The dist-info name is generated in wheel:

It should be a +1/-1 patch to fix (outside of any tests that this might need).

2 Likes

Filed Follow PEP 503's normalisation for dist-info folder name · Issue #440 · pypa/wheel · GitHub for it.

1 Like