Change in PyPI upload behavior. Intentional, accidental, pebkac?

pf_moore · June 16, 2023, 11:13am

Personally, I find it much more difficult to have multiple normalisation rules depending on context. Ideally, there should be one rule, and the choice is “normalise, or don’t”. If I have to remember what rules apply in what context, I’ll make mistakes, introduce bugs, and get confused.

Also, normalisation (IMO) is all about one thing - taking a user-supplied value and identifying one representative form, out of all the forms that have been specified as equivalent, and using that as the canonical value^[1]. We cannot have a normalisation rule that gives different results for a.b and a_b, because those two names are the same name according to the rule about how names are compared. If we had wheels a.b-1.0-py3-none-any.whl and a_b-1.0-py3-none-any.whl in the same directory, and someone runs pip install a.b a_b (more realistically, the two come up somewhere in the dependency tree of a more sensible install command), which wheel is the “right” one? Given that the pip command I showed is by definition the same as pip install a.b (by the name comparison rules), how can we make the two do the same thing?

To repeat - I absolutely support using unnormalised forms in project names as exposed to the user. But that’s different.

But you haven’t answered my question. Why do we need to keep that information in a wheel name? Nobody should be reading the wheel name to determine the official name of the project, they should look at the metadata in the wheel.

As a mathematician, I view this as “picking a representative value from an equivalence class” ↩︎

abravalheri · June 16, 2023, 3:10pm

If we are talking specifically about needs, I don’t think we need to keep that information on the wheel name, but to some extent we also don’t have a need to replace . with _… That was the main point on the setuptools issue tracker, to not do transformations that we don’t need to. I guess those are different views: “normalise everything that we don’t need to keep” vs “keep everything that we don’t need to change”.

As far as I understand (and please correct me if I am wrong), the main points seem to be:

PyPI (as a public package index) has very strong reasons for enforcing strict uniqueness checks (security reasons, competition between publishers that might confuse users, etc…). Therefore it is not viable to differentiate between “normal packages” and namespace packages on PyPI.
pip, whose primary use case is to download from PyPI, prefers to rule out the possibility of treating namespace packages and “normal packages” as two different packages. This is compatible with PyPI and also helps users to fix unintentional typing errors and avoid downloading wrong/malicious packages.
Having one normalisation rule to be applied everywhere would be simpler.
There is some advantage in normalising the .dist-info directory (as pointed out by Pradyun), and if I understood correct this would also help to optimise the checks for conflicting packages already installed (since .dist-info serves as a database).

If I understood correctly, although not strictly necessary, the idea is to rule out the coexistence of namespace packages and “normal” packages even in the private index scenario.

pf_moore · June 16, 2023, 3:27pm

IMO we do, as . and _ are equivalent when comparing project names, and we don’t want to have two possible wheel names for the same project (to avoid the “which is the correct one” question I pointed out before).

That’s not even remotely the idea. Namespace and normal packages (by which I mean import packages) are both perfectly acceptable and can coexist just fine. What isn’t possible is to use the import package name as the project (distribution) name for both a normal package foo_bar and a namespace package foo.bar, because they would both count as the same distribution name. That’s because of the comparison rules, not because of normalisation.

I think you’re confusing import package names and distribution package names. Or maybe just trying to extend a guideline for naming distribution packages the same as the underlying import packages too far. After all, there’s PIL and Pillow, and for that matter setuptools itself (and pkg_resources).

This is really no different from not allowing two modules XXX and xxx to exist, even on case sensitive filesystems where the .py files would have different names.

dstufft · June 16, 2023, 3:30pm

Pradyun Gedam:

My opinion on this is that we basically have 2 ways to go about this:

Push normalisation responsibilities to the tools that generate stuff, so that none of the tooling downstream of that need to cater to non-normalised names in the transport mechanisms.

Push normalisation responsibilities to(ward) the point of use, so that all of the tooling needs to handle normalisation. Notably, this may trigger bespoke errors at point of use, depending on whether the input is valid for the normalisation process, and may require bespoke mechanisms for each of the points where we exchange information for doing the normalisation (loops to search for files, creating normalised mappings in-memory in place of direct lookups on the filesystem, etc). We can choose to (a) set in stone what’s being done or (b) evolve it.

At the moment, I’m firmly in favour of 1 – I think it’s slightly disruptive but it is easier to reason about on the other side and will be easier to communicate and reason about for everyone involved.

The non-normalised names have a clear place where we should place them – in the METADATA’s Name key. That can serve as the display name. All other spots would benefit from being normalized “fully” (the existing re.sub) with the target symbol changing based on what’s relevant in the context (use _ for distribution filenames, - everywhere else).

(and, yes, we won’t get there fully but catering to a smaller and smaller set of non-normalised transport-time names is a good thing and will make easier for certain types of analysis)

I don’t think (1) is possible? Or rather, I don’t think it’s possible to remove (2) and rely on 1. Not unless you have tools hard failing when encountering non-normalized names.

abravalheri · June 16, 2023, 3:43pm

Sorry Paul, I think I expressed myself badly (argh, Python nomenclature can get a bit messy sometimes). Let me paraphrase myself bellow:

As far as I understand (and please correct me if I am wrong), the main points seem to be:

PyPI (as a public package index) has very strong reasons for enforcing strict uniqueness checks (security reasons, competition between publishers that might confuse users, etc…). Therefore it is not viable to differentiate between distributions named after “normal packages” and namespace packages on PyPI.
pip, whose primary use case is to download from PyPI, prefers to rule out the possibility of treating distributions named after namespace packages and “normal packages” as two different distributions. This is compatible with PyPI and also helps users to fix unintentional typing errors and avoid downloading wrong/malicious distributions.
Having one normalisation rule to be applied everywhere would be simpler.
There is some advantage in normalising the .dist-info directory (as pointed out by Pradyun), and if I understood correct this would also help to optimise the checks for conflicting distributions already installed (since .dist-info serves as a database).

If I understood correctly, although not strictly necessary, the idea is to rule out the coexistence of distributions named after namespace packages and “normal” packages even in the private index scenario.

dstufft · June 16, 2023, 3:47pm

Private indexes cannot treat zope.interface and zope-interface as different packages regardless of what happens with wheel filenames.

PEP 503 requires the normalized form of the name to be used in URLs when pip requests the Simple API for a given project from an index. So pip install zope.interface means pip does GET /simple/zope-interface/.

This was done to solve several real problems at the time, but in effect it means that an index server cannot treat names that normalize the same as different projects.

abravalheri · June 16, 2023, 4:06pm

Thanks for the clarification Donald.

Given that the differentiation between distribution named after namespace packages and “normal” packages is already ruled out in private indexes regardless, there is no much point in keeping producing different files for them… If we drop it we can comply with the optimisations mentioned earlier.

I will go back and summarise these points in the setuptools issue I mentioned earlier. If anyone in the community would like to submit a PR I will try to review it (although specifically talking about .whl files, it might be implemented in pypa/wheel).

dstufft · June 16, 2023, 4:22pm

Ultimately what I care about is that the spec isn’t just some aspirational document of things we think it would be cool if they were followed. They should be documents that clearly define what is a hard requirement, what is highly recommended, and what is fully optional, so that implementers through the chain know what things they can depend on, and what is required of them.

When a change is made that restricts something that was valid to make it no longer valid, there’s always going to be a transitional period. However, there needs to be some plan to move that transitional period along and to move us into a state where people can actually depend on the things that we spell out as hard requirements.

When we don’t do that, and we interleave hard requirements with things that are in effect, fully optional, it makes the specs unable to be relied upon. It forces every implementer to sit there and carefully figure out what the real, de facto spec is, because it differs from the specs as written.

For filenames, so many projects being released to PyPI fail the requirement of the spec that we cannot actually enforce it. However, we have to enforce something, otherwise even the most basic of requirements like name cannot have - will regress.

The situation is already crummy with the spec and reality not matching up, in this case in an obvious way. This creates a scenario where again, the spec on paper and the de facto spec are different, because what PyPI accepts is different than what the spec says. We could “just” fix PyPI to be more permissive, but all that does really is change the de facto spec to be differently different from the real spec.

This isn’t just some hypothetical problem of purity, but it has real practical implications. Otherwise you end up with the mess that HTML is

pf_moore · June 16, 2023, 4:37pm

Absolutely agreed. However, we cannot dictate when tools will implement particular standards, and as a result tools “later” in the chain have to be more lenient than we might otherwise like.

To give a concrete example here, none of this would be an issue if backends had been the ones to promptly enforce the change to the spec, rather than PyPI. It sucks that PyPI can’t be strict yet, and as a result we continue to get non-standard filenames being uploaded. But it sucks just as much that PyPI won’t accept metadata 2.2 yet, and so backends can’t produce it and installers can’t use it to optimise.

I agree that we need a transition plan. But I don’t see what’s so bad about relaxing PyPI’s requirements until backends catch up. That’s a transition plan, and all it relies on is people being patient with each other (and specifically with the extended timescales involved in volunteer open source projects).

We did have an issue with the way the current spec was created, in that it didn’t go through “due process” and as a result setuptools objected to what we ended up with (which meant they weren’t willing to implement it). Hopefully that’s resolved now, but if not, we should focus on getting a normalisation standard that we do all agree on.

dstufft · June 16, 2023, 8:51pm

Until this thread, there was no evidence that the backends were going to catch up. AFAICT setuptools had an ideological disagreement with that requirement, and so there was not “until backends catch up”, it was just going to be “relax requirements… forever”.

It’s still not clear that there is agreement from setuptools that they’re willing to implement normalization of filenames. At least one maintainer seems to still be hard against it in the issue tracker. Until that disagreement gets actually resolved It feels very much like relaxing PyPI’s requirement is just allowing the spec to continue to diverge from reality, and potentially makes a final resolution more complicated because it adds yet another axis of preexisting behavior to consider.

It’s hard for me to express just how little I care about what normalization requirements we have, I just want there to be an actual agreed upon spec that we can implement, and the reality of the situation is that setuptools is a large enough constituent that if they’re unwilling to implement something, then we can’t consider it an agreed upon spec.

CAM-Gerlach · June 17, 2023, 3:25am

For anyone else who wants to either chime in or provide a different perspective, I tried to summarize the responses to Jason’s original question in a post on that issue.

mgorny · June 17, 2023, 5:09am

From Gentoo (i.e. downstream packager) perspective normalization makes things easier. Our package naming rules diverge from those for Python projects (and they’re over 20 years old, and changing them would be a major backwards compatibility hassle). The current normalization rules make it possible for a clean 1:1 mapping from Gentoo package names to PyPI filenames.

The fact that setuptools diverge is a hassle but it’s a minor hassle because it simply implements the old specification. We need to support it anyway because of old package versions, so it’s a matter of having a switch to restore the old behavior. It’s somewhat inconvenient because packagers now have to remember “you have to disable normalization if it’s setuptools or old”.

If setuptools finally started normalizing, the rule would eventually be simplified to “you may need to disable normalization if it is an old package”.

If the specs were changed again today, things would get really messy for Gentoo. For a start, maintainers would have to remember to switch between 3 normalization schemes now. What’s worse, Gentoo package names can’t have full stop character in them, so we won’t be able to do 1:1 normalization and instead we’d have to keep manually defining whether the - there converts to _ or to . (the problem is already there for non-normalized case but normalization gives us hope that it will eventually disappear).

CAM-Gerlach · June 18, 2023, 11:41am

It sounds like the Setuptools maintainers involved have now indicated they are okay with following the lead of other tools and the general consensus here per pypa/setuptools#3777.

Additionally, Flit >=3.9 (released a month ago) now normalizes sdist names following PEP 625 (i.e. same as wheels) per pypa/flit#628

Combined with pdm-pep517 being deprecated and no longer developed and pdm-backend having replaced it, which normalizes both sdist and wheel names, this leaves only Setuptools as the outlier, and it looks like that might change soon-ish. Therefore, it seems reasonable to update PEP 716 to reflect this new reality and mandate normalization, to be consistent with what existing tools do (or plan to, at least).

ofek · June 19, 2023, 3:02pm

You can use Hatchling if you want. There is also a strict-naming option that can be disabled independently for both source distributions and wheels.

barry · June 22, 2023, 4:04pm

I’m now seeing what might be a related bug due to the broken package name normalization in some situations. I don’t have a root cause yet, but it seems related. At the very least, PyPI needs to honor the package name in its UI.

barry · June 22, 2023, 4:19pm

For tracking purposes, I’ve opened a bug on PyPI.

barry · June 22, 2023, 4:45pm

@dustin responded over on the PyPI bug, with what I suspected would be the case. pdm-backend normalized my pyproject.toml name in the package’s metadata, and PyPI honors that. So it does sounds like @ofek 's suggestion of using hatchling might be a temporary solution, although in my case, that will require pyproject.toml churn for the non-standard tool settings. Does pdm-backend need to support a similar option? @frostming

barry · June 22, 2023, 4:54pm

What if we want normalization of file names but not of the project name in the metadata? Or will that break the world?

barry · June 22, 2023, 4:55pm

github.com/pdm-project/pdm-backend

pdm-backend should not normalize names in project metadata

opened 04:55PM - 22 Jun 23 UTC

warsaw

In the [long thread on discuss.python.org](https://discuss.python.org/t/change-i…n-pypi-upload-behavior-intentional-accidental-pebkac/27707/1) several problems related to normalization of package names have been identified. Specifically, `pdm-backend`'s behavior of normalizing the package name in the project metadata causes PyPI to incorrectly display the package's intentional name and `pip install` instructions. Context is contained in the thread linked above. FWIW, I actually don't care about the file names - normalize away! I just want the metadata project name to match what I put in `pyproject.toml`!

ofek · June 22, 2023, 5:18pm

Hatchling does not normalize the name found within distribution metadata files so PyPI and other consumers have the raw text the user defines. Is that what you’re asking?