Change in PyPI upload behavior. Intentional, accidental, pebkac?

pf_moore · June 14, 2023, 3:01pm

The installed package includes a directory Jinja2-3.1.2.dist-info. According to the spec, it should be jinja2-3.1.2.dist-info (lower case). I imagine this is because the build backend isn’t normalising the name before creating the directory in the wheel. As I said, it’s not 100% explicit that they need to do this in the wheel, but I’m certain pip (and likely other installers) won’t normalise the name when installing, but will just take what’s in the wheel.

dstufft · June 14, 2023, 3:07pm

Oh the behavior isn’t also just ., that just happens to be the character that trips up the PyPI bug that caused this to be noticed. Normalization also requires lower-casing names, so if we required normalization on PyPI we’d also break projects like Jinja2 or Django.

dstufft · June 14, 2023, 3:23pm

I’m working on a PEP that implements this (an earlier copy of it rolled back the backwards incompatible change, but I’ll be updating it to just relax the rules).

Again, I don’t particularly care what the filenames look like, as long as there’s a path to our destination without widespread breakage. So the requirement to normalize isn’t important to me, and the open questions on how to roll it out on PyPI require more work to figure out than I’m willing to invest in this particular problem.

I don’t like the solution of PyPI accepting non-normalized names while the spec says that they should be normalized because I think it means in practice that the de facto standard is that normalization is not required. IME, the only standards that get followed universally are standards that some major component of the ecosystem enforces, PyPI is often in the role of enforcing that because it is able to do it only for new uploads, which is often more tenable than enforcing it for everything ever produced.

If someone feels strongly that our end state should be fully normalized filenames, then I hope they do the work to figure out how to get us there.

davidism · June 14, 2023, 3:34pm

I was already thinking of switching everything except MarkupSafe (which has a C extension) from setuptools to Flit, if that helps. And if I ever finish the Rust rewrite for MarkupSafe I’ll hopefully be able to use Maturin.

To be clear, if setuptools were to be fixed to normalize the file names, or if there was another simple backend that I could switch to easily, then I would be fine using that to be able to upload new releases to PyPI. But I’m sure other projects might not be able to do that so easily.

dstufft · June 14, 2023, 3:39pm

I don’t think this is a problem for individual projects to solve FWIW. It’s something the packaging tools need to solve.

abravalheri · June 14, 2023, 3:51pm

Please note that there was a problem the last time that the name normalisation rules were changed.

According to the discussion in Escaping versions for wheel, sdist, and .dist-info names, the main idea was that normalisation would happen for - into _, but I don’t think the conclusions of the initial discussion included .. It seems that the PR that was merged ended up accidentally increasing the scope of the changes. Of course I might be wrong, and I apologize if I that is the case.

As far as I understood, one of the main reasons for the change in the normalisation was that installers use the file name to infer version and normalised project name without requiring extra queries to the package index.

This logic does not justify replacing . with -, because the heuristic for extracting parts of the file name involve splitting the string on the - character.

Previously, the setuptools maintainers manifested preference for not mangling unnecessarily with file names. It is understandable the need for replacing the - character, but . seems unnecessary. Ideally we should leave the file name as close as possible to the name the user intended for the distribution package. Under the hood we can use normalisation for comparisons (or is there a major downside in doing that?).

dstufft · June 14, 2023, 4:21pm

I think the problem with requiring non-normalized names (e.g. exactly what the user entered, minus escaping) is that there are a non trivial number of tools and projects out there that already implement forced normalization in the filename.

It’s basically the same problem with requiring normalized names.

In both cases we have widely deployed tools that assume one of the two mutually incompatible rules.

I think that realistically we have 3 options:

Roll forward, and figure out a path to getting everyone onto fully normalized names.
Roll backwards, and figure out a path to getting everyone onto fully non-normalized names.
Allow backends to emit whatever they want, and require normalization only on comparison.

I suspect that (3) is the most tenable:

Both (1) and (2) require breaking some subset of users.
There are a lot of projects out there ^[1] whose name on PyPI is not equal to their normalized name.
Anyone wanting to deal with arbitrary artifacts is going to have to deal with names that normalized and names that are not normalized, unless they are able to restrict themselves to only artifacts produced after some flag day. So we’re not saving ourselves much effort by requiring normalization.
There are strong opinions on both “sides” of the normalization debate, and it’s unlikely that either side is going to be able to convince the other side.

So to me, the standards caring about the normalization (or lack of normalization) of names inside of a filename costs a lot, and gets us very little. It feels very much like a bikeshed that doesn’t actually matter, except that either choice breaks things.

So allowing anything lets each individual backend to paint the bikeshed whatever color they want, at minimal cost.

Looks like slightly above 75k projects on PyPI whose name on PyPI does not match their normalized name, I don’t have an easy way to query how many of them are using a backend that preserves their name vs normalizes their name in filenames. ↩︎

abravalheri · June 14, 2023, 4:26pm

Thank you very much Donald. I agree that (3) seems to be the best way forward.

Could we please take this opportunity to fix any accidental change of scope of the published normalisation rules for file names and return them to the original consensus (i.e. it is necessary to normalise -, but not .)? This is something that has been requested by setuptools maintainers.

If required I volunteer to submit a PR.

pf_moore · June 14, 2023, 4:40pm

Can I note that:

The spec for installed projects is explicit that installed dist-info directories must use normalised names. That definitely isn’t happening right now, which is almost certainly because installers just copy what’s in the wheel, and…
The wheel spec makes no mention of normalising the project name part of the dist-info directory name. Plus…
Wheel producing tools quite likely^[1] just make the dist-info directory follow the same rule as used for the wheel filename…
Which says to normalise, but which isn’t followed by setuptools and the older PDM backend, as we’ve seen here.

Your PEP needs to cover the whole of that trail, I think, or we’ll just end up with a different inconsistency. Sorry.

I don’t think requiring installers to rename the dist-info directory when installing will be practical. Firstly because they simply won’t do so, and secondly because the wheel spec pretty much says that unzipping is all you need to do.

citation needed ↩︎

encukou · June 15, 2023, 8:41am

Handling non-normalized names has an implementation cost and run-time cost. Yes, pip and PyPI do need to pay it. But any new tool, specialized enough to not require handling all of the historical backlog as-is, can consider its own trade-off: is it better to support non-normalized names, or require new releases (with modern tools) for the the packages that don’t have them yet?
IMO, the long-term direction should be toward having normalized names where possible, even if we can’t fully reach the goal in our lifetimes.

Full disclosure: I work on pyproject-rpm-macros, adapting Python packages to another ecosystem, where we can afford to normalize everything (even iff we currently don’t).

But they should be allowed do it if they want, and then anything than needs a .dist-info file can just optimistically open it. Even everyone does need the fallback of listing everything on PATH and comparing each item, more tools renaming dist-info would be a win.

And that, I think, gives us a way for the future. Allow backends to emit what they want, accept anything in PyPI, accept that setuptools will keeping its old ways, but be clear about what the right way is. Provide some carrots to people doing things the right way, and inevitably some apologies to people who get breakage or unexpected changes, as in the original post.

“Normalized” (or canonical) means that you can compare with a byte-by-byte comparison. Changing the normalization rules would mean changing what names are considered equivalent.
We can allow non-normalized names, invent a new scheme for filenames, or start considering . and - distinct (which has wider implications, especially since some tools already normalize to -). I see no simple option here.

pf_moore · June 15, 2023, 10:11am

I agree, with the proviso that this is for anything intended primarily or solely for machine consumption. For user facing names, the project author should be able to define the preferred capitalisation/punctuation, and have that be respected in the UI.

Under the rule that names are compared in normalised form, then yes, certainly. Tools may do anything they want, as long as the result isn’t distinguishable after normalisation (and satisfies any additional constraints we might specify).

My point here was simply that because we currently have a requirement (which isn’t yet followed ) that the installed dist-info directory must use a normalised name, we cannot as a practical issue allow unnormalised names in the dist-info name in the wheel, because it’s not realistic to expect that nobody will ever just unzip a wheel without renaming (particularly as the wheel spec suggests that doing so is valid!) Either the rule for installed distributions has to be dropped (which I don’t think is the right choice) or the content of wheels must be required to be normalised. For practical reasons - of course the current situation is theoretically implementable.

This seems fair to me. We should define rules that are strict, but which have the pretty common “but tools should be prepared to deal with unnormalised names, because we can’t force people to follow the standards” qualification. We have a pretty solid track record in packaging of dealing with this sort of thing, I don’t think it’s impossible to do so here as well - even if the timescales might be long.

Normalisation happens in a lot of contexts, and is defined independently, not just wheel filenames. Changing those rules would be a huge change.

But I’m confused why it matters so much to some people that the wheel filename needs to allow unnormalised forms. It’s important to remember that the wheel filename is first and foremost a machine readable representation of key metadata for the wheel, and only secondarily a human-readable name. Maybe that isn’t ideal, but there were good and practical reasons for it (notably the ability to determine compatibility without downloading/opening the file). If it helps, think of the project name part of the filename as somewhat like a “slug” used in blogging software as an identifier for a post - it’s recognisably derived from the title, but not intended to be a copy of, or a substitute for, it.

Can anyone give me a use case where it matters whether a wheel for the project Foo.Bar-_BAZ is named foo_bar_baz-1.0-py3-none-any.whl or Foo.Bar_BAZ-1.0-py3-none-any.whl? Neither form is exactly the same as the project name, both normalise to the same form as the project name. What reasonable process needs one rather than the other?

While this is true to an extent, I don’t think it’s necessary for PyPI to set itself up as a gatekeeper for standards as a result. The focus in PyPI should (like for any other component of the system) be user experience, and if that means being lenient as a practical choice so as to not block progress, then so be it. Someone else will, at some point, step up and request standard compliance of tools that are lagging. As long as we, as a community, agree that standard compliance is important (and if we don’t, then why are we bothering at all?) then things will move (maybe slowly) towards full compliance.

Yes, setuptools^[1] is a major factor in rollout of any standard. And their view has weight when it comes to setting standards. But once we have a standard, they shouldn’t be able to overturn it just by ignoring it. Which is why we should simply agree what we want to happen in a PEP. We shouldn’t need to have PyPI act as a compliance enforcer, and if we do, then the community consensus approach has frankly failed.

Having PyPI add a warning when non-compliant uploads are attempted, and later upgrading that to an error, but with both of those timed to match with how fast key producers implement the new standard, is fine. But that’s different.

And pip and Warehouse. ↩︎

barry · June 15, 2023, 11:22pm

It’s fine with me too, as long as pip install flufl.i18n continues to work. It would normalize the filename choice to flufl_i18n-*.

barry · June 15, 2023, 11:28pm

I think at least in the case of pdm-pep517 that’s still true, and it’s why I started this thread ;). Happily, that build backend is deprecated ^[1] and pdm-backend is normalizing and making PyPI happy.

to some degree of notice to package authors, maybe warnings could be made more explicit ↩︎

CAM-Gerlach · June 16, 2023, 12:02am

Yup, that was the original goal of the package name normalization originally defined in PEP 503—FULFL-i18n, fulfl_i18n and fulfl.i18n will all continue to work for pip install and other conforming tools just as they do today.

barry · June 16, 2023, 12:13am

That’s great, because really all of those spellings are used in one place or another to refer to the package which I call flufl.i18n.

My biggest gripe right now is that the PyPI UI advertises my project as flufl-i18n and worse, tells people to pip install flufl-i18n. That works but it’s not how I want my project to be advertised. As @pf_moore said earlier, where UI is involved, the tools should honor my preferences as package owner. All the rest is IMHO an implementation detail.

abravalheri · June 16, 2023, 9:41am

Paul Moore:

But I’m confused why it matters so much to some people that the wheel filename needs to allow unnormalised forms. It’s important to remember that the wheel filename is first and foremost a machine readable representation of key metadata for the wheel, and only secondarily a human-readable name. Maybe that isn’t ideal, but there were good and practical reasons for it (notably the ability to determine compatibility without downloading/opening the file). If it helps, think of the project name part of the filename as somewhat like a “slug” used in blogging software as an identifier for a post - it’s recognisably derived from the title, but not intended to be a copy of, or a substitute for, it.

Can anyone give me a use case where it matters whether a wheel for the project Foo.Bar-_BAZ is named foo_bar_baz-1.0-py3-none-any.whl or Foo.Bar_BAZ-1.0-py3-none-any.whl? Neither form is exactly the same as the project name, both normalise to the same form as the project name. What reasonable process needs one rather than the other?

I think that the main point is that . is commonly used to indicate namespaces, and normalising . to _ gets rid of this information.

In the end of the day, I believe no project is going to object to normalisation rules if we understand what is the practical motivation behind them. As I commented before, it is easy to understand that normalising - to _ unlocks an optimisation on the installer side, but what benefit do we get when we get rid of the possibility of optimising for inferring namespaces via file name^[1]? It would be nice to clarify it.

The normalisation rules for the files names do not have to be the same as the normalisation rules for verifying uniqueness in the package index. It is not impossible to imagine a scenario where uniqueness_normalisation(x) = filename_normalisation(x).replace(".", "_") is calculated at upload time and added to the database, not having a major impact on the general performance of the package index. Also other private package indexes other than PyPI, may allow for “regular packages” (e.g. a_b) to coexist together with a namespace package (e.g. a.b)^[2].

Please note that I am not saying that there is not a good reason for that. I am just saying that I haven’t come across with it yet (I might just have missed this discussion). ↩︎
In the case of PyPI, as a public package index, I understand that the uniqueness verification have to be more strict because it can create a security breach. But maybe other private indexes do not need to enforce this restriction. ↩︎

ntessore · June 16, 2023, 10:08am

That would be great, if at all possible, and worth some breakage IMO. There is usually quite the difference between the man.eating_chicken plugin and the man_eating_chicken package.

pradyunsg · June 16, 2023, 10:38am

I understand that, but it’s still unclear to me why it matters what the name in the transport mechanism is here. Having a fully normalised name makes multiple things easier for tooling that needs to handle these files (eg: locating dist-info is a lookup instead of a loop on all files in the zip, having exactly 1 answer to what the name of the package will be in the distribution, being able to check if there’s a conflicting package already installed, and so on).

Having a dot isn’t going to allow you to be publishing to a “namespace” either. zope.interface and zope-interface will both be publishing to the same place on PyPI, and I’d argue it is better to have the wheel filename reflect the underlying reality of what names are considered equivalent rather than leaving it as a murky “normalise, but not fully, and with extra steps” bespoke rule that we’re basically inventing to cater to this desire.

pradyunsg · June 16, 2023, 10:49am

My opinion on this is that we basically have 2 ways to go about this:

Push normalisation responsibilities to the tools that generate stuff, so that none of the tooling downstream of that need to cater to non-normalised names in the transport mechanisms.
Push normalisation responsibilities to(ward) the point of use, so that all of the tooling needs to handle normalisation. Notably, this may trigger bespoke errors at point of use, depending on whether the input is valid for the normalisation process, and may require bespoke mechanisms for each of the points where we exchange information for doing the normalisation (loops to search for files, creating normalised mappings in-memory in place of direct lookups on the filesystem, etc). We can choose to (a) set in stone what’s being done or (b) evolve it.

At the moment, I’m firmly in favour of 1 – I think it’s slightly disruptive but it is easier to reason about on the other side and will be easier to communicate and reason about for everyone involved.

The non-normalised names have a clear place where we should place them – in the METADATA’s Name key. That can serve as the display name. All other spots would benefit from being normalized “fully” (the existing re.sub) with the target symbol changing based on what’s relevant in the context (use _ for distribution filenames, - everywhere else).

(and, yes, we won’t get there fully but catering to a smaller and smaller set of non-normalised transport-time names is a good thing and will make easier for certain types of analysis)

pradyunsg · June 16, 2023, 11:05am

I’ll quickly note this before stepping away for a bit…

All normalised names are also valid non-normalised names and we have a restriction on what is a valid non-normalised name that can be provided as input to the normalisation process, so these two bullets are equivalent as I’m reading them. I might be missing something, or assuming something incorrectly though; so…

If 2 means that we can’t have pip and installer be valid package names, that’s the most disruptive option IMO.