Clarify naming of .dist-info directories

jaraco · November 22, 2020, 10:19am

I disagree. The name as it appears in the metadata is currently satisfactory and I would strongly oppose storing a normalized version. Instead, normalization, for any given purpose should be to suit that purpose and should be an implementation detail, applied for that purpose, and hidden otherwise. Introducing a display name would dramatically complicate the user’s (packager and consumer) experience.

uranusjr · November 22, 2020, 10:29am

Perferring normalisation in .dist-info has practical benefits to tools, since all the variants are in fact the same package. This is especially important for wheels, since the name you get from the wheel’s file name already went through the lossy normalisation process. If the .dist-info directory name is unnormalised, tools would need to iterate through the entire list of the zip content to pick out the .dist-info directory that looks like it. If the .dist-info name is guarenteed (or at least preferred) to be normalised, on the other hand, implementers can provide a happy path that uses ZipFile.getinfo() to locate METADATA directly, which is both quicker and more reliable.

It may sound unintuitive, but encouraging unnormalised names actually makes users feel obliged to not use separators—not the other way around—for this reason. Since the name variants ultimately refer to the same package, there are situations when tools only have access to the normalised version (e.g. Simple API and wheels). Finding a directory on a filesystem (or in a wheel) is like random access, where “unnormalisation” requires non-trivial logic, and unevitably some implementers would implement some things wrong sometimes, not matter how clear the standard spells out the rules. Users see that some tools fail to handle these special charaters, and therefore start to avoid them. It would save everyone energy to prefer the normalised version as keys, since tools are more likely to get things right this way, giving users more confidence in using those characters.

I can understand the author to a package would want the package to be referred to as its original name. And that information is currently available, in METADATA. The pipx issue is not really tied to .dist-info normalisation IMO, since it does have access to the unnormalised name, only chooses not to use it. And like the user obligation situation, pipx does not want to do that partly because matching jararco.clipboard-1.0.dist-info to pipx install jaraco-clipboard (which package users would expect to work, no matter what name the package author prefers) is non-trivial. IMO tools may actually be more willing to read METADATA if we can make that file be more reliably found.

pf_moore · November 22, 2020, 10:50am

OK, so .dist-info is an implementation detail, and normalising is fine. I’m missing your point here. Is it just that “pipx doesn’t use the metadata Name field”? If so, then that’s something to discuss on the pipx tracker.

A comment in the definition of the Name metadata field that it should be the preferred form for interacting with the user, maybe?

I’d be -1 on any proposal for anywhere that tools use the name to be left unnormalised. It makes the tools harder to write, and ends up with a tendency in the code to repeatedly normalise, “just in case”. This typically makes it harder to present the unnormalised name to the user, as you lose track of what’s normalised and what isn’t very fast. With hindsight, it would be a lot better in pip if we’d used “name” consistently for the normalised name, and “display_name” for the unnormalised name. But it’s probably too late for that.

jaraco · November 22, 2020, 5:32pm

Early in my (re-)exploration of this issue, I would have disagreed, but after verifying that these names are the same package, I agree, and I’ve updated the importlib_metadata package to further capture that expectation.

I don’t think this suggestion satisfies the optimal user experience. Consider, for example:

$ pip uninstall foo.bar
WARNING: Skipping foo.bar as it is not installed.

In this case, there is no foo_bar-1.0.0.dist-info file to resolve foo_bar -> foo.bar. All you have is what the user specified. If the error message said:

WARNING: Skipping foo_bar as it is not installed.

because pip intends to use that name to look up in a dist-info specification, it’s still going to surprise the user and incentivize them to use foo_bar as the name and discourage others from using a dot.

But then consider:

$ pip install foo.bar
ERROR: Could not find a version that satisfies the requirement foo.bar
ERROR: No matching distribution found for foo.bar

In this case, if we follow the direction above, we would expect pip to emit something like:

$ pip install foo.bar
ERROR: Could not find a version that satisfies the requirement foo_bar
ERROR: No matching distribution found for foo-bar

The first error message indicates that (dist-info normalized) foo_bar doesn’t exist locally, so it tried to install (PEP 503 normalized) foo-bar and didn’t find it. Normalizing early doesn’t help here because normalization is a domain-specific behavior. I think of normalizing as congruent with encoding where you want to use the canonical (indicated) name wherever possible and only normalize as close to the interface that demands the normalization as possible.

All for a package whose author intentionally named it foo.bar because it implements the Python package foo.bar.

The current behavior in pip seems correct to me (use the user’s indicated spec and don’t normalize until necessary). It’s the behavior in pipx that I want to discourage.

pf_moore · November 22, 2020, 10:45pm

OK. So we come back around to the fact that you feel that pipx has a bug. I’m not going to disagree with you on that, but I don’t think it’s particularly relevant here (unless pipx is claiming that the dist-info name is the “correct” name for a project, but if that’s the case, then they are wrong)