Clarify naming of .dist-info directories

This is first raised in pipxproject/pipx#528. I’m copying some of the comments I left there so people don’t need to read through the whole thread containing tangentally related things.

The only mention I can find regarding .dist-info names are in PEP 376 and PyPA Specifications. All other specifications (including the wheel specification) only refers to one of these. PEP 376 defines the directory name as

name + '-' + version + '.dist-info'

but does not otherwise say what values can be used for either name or version. The PyPA specification expands on this:

This directory is named as {name}-{version}.dist-info , with name and Distribution versions fields corresponding to Core metadata specifications. The name field must be in normalized form (see PEP 503 for the definition of normalization).

The problem is, none of the wide-spread tools producing .dist-info directories actually do this. Instead, both the name and version parts have their dashes replaced by underscores (persuambly to avoid ambiguity since the dash is used to separate name and version), and that’s it. The dots are not replaced, and the name not lower-cased (both mandated by PEP 503). Existing tools inspecting installed packages (pkg_resources and importlib.metadata) also use these rules to discover packages.

Since it is not realistic to fix all the tools out there to follow the specification (for more than one reason), I intend to propose a pull request to the PyPA specification to define the rules as the followings instead, to match reality:

  • The name part should replace any running dash (-) and underscore (_) sequences by a single underscore (_), and any running dot (.) sequences by a single dot (.). This is similar to the normalisation rule in PEP 376, but with two differences. First, the underscore character is used instead of dash, to avoid ambiguity when parsing the directory name. The dot character is also treated differently for backward compatibility reasons.
  • The version part should always use the normalised form according to rules defined in PEP 440. This means that the version part never contains a dash (-) character, again eliminating ambiguity for parsers.
4 Likes

Would the library that would normally take of this be https://github.com/pradyunsg/installer?

The dist-info directory is also used in wheels, so the responsibility of naming the dist-info directory correctly needs to fall on wheel generating tools. Wheel installers can perform sanity checks, but shouldn’t need to implement the normalisation logic, since a valid wheel should already contain a properly formatted dist-info directory that the installer can copy.

1 Like

Currently, there are not a lot of wheel installers, so we could easily control the behavior there. What about getting the installers to normalize the name and raise a warning. Everything would still work and I think people would gradually start fixing the normalization.
The main issue I see with your proposal is that there is no mechanism in place to get people to adopt the correct behavior.

Also, perhaps we should update PEP 376 to ask for normalization?

PR on GitHub: https://github.com/pypa/packaging.python.org/pull/781

I reworded the proposal in the PR to try to convey that

  1. We acknowledge the existing tools don’t do normalisation consistently, and allow that to happen.
  2. We encourage tools to use PEP 503 normalisation.
  3. Tools inspecting .dist-info directories should take this backwards compatibility issue into consideration, and perform additional normalisation so distributions can be found as expected.

There are literally millions of installed distributions out there, and the non-normalised includes some extramelh high-profile packages, including Django, Flask, etc. It would be extremely disruptive to show a warning, even if everything “works”. And as I mentioned above, wheel installers are not responsible for naming .dist-info directories, the wheel builders are, and the migration would be extremely slow and painful unless we mass-patch wheels on PyPI, both existing and new ones (because people publishing packages tend not to upgrade the build tools all the time). The workaround to make the status-quo work, however, is both straightforward enough and proven (pip implements it) to work, so it would be far easier to patch the .dist-info readers (which are even fewer than wheel builders and installers—there are really only two with any significant usage, and a distant third in distlib).

1 Like

Yeah, I don’t think we can change anything without a long deprecation period. virtualenv too processes wheels for example and uses the .dist-info for it. And then we probably have multiple usages within the enterprise system where we don’t know about it. It’s unrealistic to change any behavior without a long deprecation period.

I haven’t taken the time to read through all of the relevant PEPs but I would hope these two principles could be honored:

  1. As much as possible, users should have control over the names of their packages. Any constraints on the names of original (user-chosen) names of the package should be for good intrinsic reasons and not for convenience of the implementation. Think of allowed characters in filenames as a model.
  2. If normalization is applied for safety or consistency (such as is dist-info filenames), that’s fine, but it shouldn’t preclude the user-interfaces from being able to use the original names. That is, if - is normalized to _ (or vice versa), the user interfaces should support querying by the original name. The normalization should be an internal implementation detail.

This behavior was in fact honored in the repository spec. Normalized names are used internally, but PyPI presents the original names in the UI and pip presents the original names when installing the packages.

I believe pipx is providing a degraded experience by presenting the “normalized” form to the user when installing/referencing a package.

I believe the dist-info spec is misguided by specifying the PEP 503 normalized form and the approach proposed by uranusjr seems sane to me.

I think before accepting pypa/packaging.python.org#781, we should have a proof-of-concept implementation in importlib_metadata to ensure that the implementation of such a spec doesn’t cause massive complexity or performance degradation.

2 Likes

Edit: I’ve figured out how the test suite works (it requires CPython’s test suite to run, which I did not install to save disk space). PR filed: Fix dot handling in dist-info directory name by uranusjr · Pull Request #253 · python/importlib_metadata · GitHub


Original reply:

I was going to submit a patch today:

diff --git a/importlib_metadata/__init__.py b/importlib_metadata/__init__.py
index 7031323..9155ee0 100644
--- a/importlib_metadata/__init__.py
+++ b/importlib_metadata/__init__.py
@@ -473,7 +473,7 @@ class FastPath:
         for child in self.children():
             n_low = child.lower()
             if (n_low in name.exact_matches
-                    or n_low.startswith(name.prefix)
+                    or n_low.replace('.', '_').startswith(name.prefix)
                     and n_low.endswith(name.suffixes)
                     # legacy case:
                     or self.is_egg(name) and n_low == 'egg-info'):

but couldn’t figure out how to run importlib-metadata’s unit tests.

In my tools (which are mostly just for my use) I tend to distinguish “name” (normalised, usable as, for example, a database primary key) and “display name” (not normalised, copied from original data).

I find that “display name” is lamentably inconsistent, though, so I’m not that convinced by arguments about “respect the developer’s choice for a name”. However, I do tend to think that the UI should work in terms of display names. I just feel that “any one of the different display names you encounter” is fine, and I don’t care too much about inconsistent display names.

1 Like

Also discussed briefly here: PEP 625: File name of a Source Distribution

Very true, but then I still wonder what library should contain the code to calculate that directory name so everyone uses the same code to get consistent results? GitHub - pypa/pyproject-hooks: A low-level library for calling build-backends in `pyproject.toml`-based project isn’t doing the actual construction of the wheel and GitHub - pypa/wheel: The official binary distribution format for Python doesn’t have an API, so it should probably be somewhere else.

Historically GitHub - pypa/packaging: Core utilities for Python packages has not had anything directly related to building artifacts like this, so I don’t know if that’s the correct place for this sort of stuff either. But then again I don’t know if it’s worth trying to split “utilities for installing artifacts” and “utilities for building artifacts” (e.g. if you have code to read metadata, then wouldn’t that code also be able to write it, which then makes that code straddle that divide).

importlib.metadata currently holds logic to look for .dist-info in site-packages, and should be a good place to keep doing this due to its stdlib status. I’m not sure whether it’d be within its scope to also inspect wheels; if not, IMO it should go to the WIP wheel installer project (or something it depends on; but personally I feel that’d be too fine-grained, it’s better to keep code interacting with an existing wheel in one package).

That make sense to me, but what about tools writing wheels? Where do they get code for calculating the name of the .dist-info directory they are expected to construct?

Not directly, but this would work:

def dist_info_name(project: str, version: packaging.version.Version) -> str:
    name = packaging.utils.canonicalize_name(project).replace("-", "_")
    vers = str(version).replace("-", "_")
    return f"{name}-{vers}.dist-info"

Maybe we can put this snippet in a PEP?

I don’t see why this couldn’t just go into packaging.utils - it’s got just as much reason to be there as canonicalize_name (which I hadn’t noticed was there - I’ve cut and pasted that code so many times :slightly_frowning_face:)

1 Like

ACK. Assuming this PEP comes through, I’m on board for putting the function that Tzu-Ping has written here (or the appropriate equivalents) in packaging.utils, as suggested by Paul.

I’m fine putting it there as well.

1 Like

Thanks for the PR to importlib-metadata and packaging docs. While I am fine accepting the importlib-metadata PR - it doesn’t add a lot of complexity, I still feel like it’s unnecessary complexity and ambiguity.

If I understand correctly, the purpose behind normalizing these names was to prevent collisions, so that backports.abc and backports-abc and backports_abc are equivalent names in a package index.

My preference would be to mangle names only when necessary and to honor the package’s original name wherever possible.

I believe it’s a bug in pipx that someone attempting to install jaraco.clipboard would ever see the normalized name jaraco-clipboard or jaraco_clipboard (yuck). In my opinion, pipx should store the name indicated by the user instead of aggressively normalizing it. And while it may be worthwhile for something like importlib.metadata to leniently accept normalized names, I’m a little reluctant to accept the patch because it will encourage tools like pipx to aggressively normalize the names rather than relying on the user’s specified name or the package’s proper name.

Is there a way in the guidance that while the dist-info name might be normalized, we discourage the use of normalized names in user interfaces and instead encourage the use of the package’s indicated name or the name specified by the user?

I’d specifically like to avoid the situation where users feel obliged not to use separators at all and we ultimately end up in a world where the defacto standard is just [a-z].

Ideally we should have two metadata fields - Name (normalised) and Display-Name. Or maybe Name and Identifier (normalised name). But retro-fitting that would be extremely hard.