Namespace support in pypi

In terms of transition strategy for how this would be implemented, the best parallels I’ve found for existing flat namespaces that adopted some sort of namespace strategy are NuGet (.NET packages) and npm (nodejs packages). The two communities took very different approaches. I’ll attempt to summarize them here.

NuGet’s adoption of namespaces was inspired by concerns about integrity and trust of NuGet packages - trying to make it easier for package consumers to determine who had produced a given package and gauge whether the provider is trustworthy. They took an “in place transition” approach using a dot-delimited syntax, where all existing packages were grandfathered, even if they happened to sit smack-in-the-middle of a namespace claimed by some group. (It reminds me of Chinese Nail Houses) They did indicate that those grandfathered packages would be delisted if they exhibited malicious behavior that took advantage of their status within that namespace, and as @dstufft mentioned also had visual indicators indicating the verified identity of the package provider that’s distinct from the package namespace.

NPM took a different approach, and described their motivation as being largely driven by having over 140,000 packages in their global namespace and devs having difficulty coming up with original yet meaningful package names. NPM introduced a completely new syntax for what they called “scoped modules”, using a @<namespace>/<packagename> syntax. This was a much more invasive change, as it required changes with both their package repository and the package installers. Each npm user received a namespace based on their username, and paying organizational customers could create an org namespace (for those that aren’t aware, npmjs.org has paid tiers of service) This syntax introduction also enabled “private” packages - which reside within a reserved namespace but weren’t shared publicly.

Since NPM introduced scoped packages, they’ve been widely embraced by the NodeJS community. But the global namespace is still used as well. NPM also implemented a “deprecated” flag on their package repo to give package maintainers moving to scoped namespaces a way to signal to consumers what had changed.

To me, NuGet’s approach is simpler (and we all know that “simple is better than complex”), but NPM’s approach is more explicit (and we all know that “explicit is better than implicit”). The bottom line for me (as I think back on Russell’s keynote last week) is which approach best postures Python for another 25 years of growth, innovation, and collaboration??

I think a crucial difference between Python and JS here is that Python also has a global package namespace after installation, and JS doesn’t. For JS, if you let different people register @user1/foo and @user2/foo, then that’s fine; for Python, they probably both want to use the foo name at runtime, and thus couldn’t be installed in the same environment.

Given that we do have a global namespace at runtime, and are stuck with that (barring major changes to the Python language itself), it’s probably simpler if packaging also keeps a global namespace.

1 Like

My sense is the more likely use-case there is folks wanting to use @user2/foo as a “drop-in-replacement” for @user1/foo, vice wanting to use two different packages, both of which happen to be named foo. I’ll admit that’s a bit of a double-edged sword, as it could encourage forking vice collaboration, but also enable more innovation across the ecosystem.

The other thing I’ve observed with JS is (in at least a few cases I’ve observed) the move towards scoped modules seems to have encouraged more modular code with smaller individual packages than existed previously, as it was easy to split out ancillary functionality alongside the core package in the same namespace. (While this can be done in a flat namespace as well, the scoping provides a much stronger cue that the packages are developed and intended to work together).

FWIW I think that the NPM style is likely a non-starter. I don’t think it’s worth the likely confusion users will have when the PyPI package name and the import package name doesn’t match. This isn’t a new thing, in that they don’t have to match today, but the general recommendation is that they do, but with a NPM style namespace the requirement would have to be that they don’t match.

From my POV, the Nuget style is the best match for PyPI going forward as it doesn’t require any real changes anywhere but PyPI itself.

I’ve been doing some analysis of the top 5000 packages (by download count), and have found a few relatively clean namespaces (e.g. azure- contains nearly 200 packages, all but 2 of which are maintained by Microsoft), and others that are very polluted (I’m sure nobody will be surprised to find that there are over 300 packages in the top 5000 in the django- namespace, from many different contributors).

Similarly, some groups have packages scattered across the global namespace with completely unrelated names. On this piece, I am a bit leery of unintended consequences of introducing namespaces…will it encourage maintainers to relocate/consolidate existing packages under “their” namespace? And if so, how do we support that in a way that doesn’t break existing users? Do we implement some kind of package redirect config/support in pypi? Or a deprecation warning like what NPM did? Agree that the NuGet approach is simpler in terms of limiting the impact, but there are definitely still devils in the details.

I did stumble across something in my research where somebody suggested a convention where if the namespace and the package had the same name they could use a shorthand reference of just name vice @name/name. Unfortunately I can’t find that link now that I’m looking for it. I’m not sure that changes the argument against NPM-style here, but did strike me as an interesting way of keeping things simple from a user perspective while giving power to package devs.

This seems like a non-starter outside of niche packages that are only used in a single organization, because even if I want to use @user2/foo as a drop-in replacement for @user1/foo, then I probably can’t convince everyone else on PyPI to update their requirements to replace @user1/foo with @user2/foo.

You’d typically only see this in the case of abandoned packages, but it does also exist with “private” packages that aren’t shared publicly. I have a customer right now that does this because they have a slightly non-spec OAuth endpoint…so they drop in a replacement package that they forked from the upstream to accommodate that situation.

That sounds like they want to overlay their own packages on the existing names, not create whole new names.

I’ve seen scenarios for both. The piece about overwriting existing names (in a global namespace construct) is it makes mirroring much more of a PITA, as you have to either maintain separate repos with the “internal” versions (and remember to point to those) or do other annoying things to mirror these packages but not those.

My point is that namespaces don’t help this case. If you need a drop-in replacement for the package foo, and calling it mycompany-foo doesn’t work, then calling it @mycompany/foo won’t work either.

What you call “polluted” is actually very common for open source
projects. It’s clear that a vendor will aim to use a more consistent
approach, simply because their internal policies typically require
this.

Please also take into account that in Python PyPI package name !=
Python package name, i.e. a PyPI package is well able to install to
a completely different Python package on PYTHONPATH.

Your example with drop-in PyPI packages for a particular Python
package can easily be made to work, because of this.

The only benefit I see from having prefixes reserved to vendors
is to make it easier for users of those PyPI package to quickly
identify the source of the package.

But then again: they can have the same easy of use by simply looking
at the package maintainer field, so it’s only a very minor win.

The biggest advantage for me is that a maintainer can be at peace, confident that their package naming scheme on PyPI is guaranteed to be future-proof.

PyObjC, for example, publishes a number of packages under the pyobjc-framework-[name] prefix, each corresponding to a macOS framework called [name].framework. If Apple announces today a new framework called FooKit, there’s nothing stopping me to publish pyobjc-framework-fookit, and put both PyObjC and its users in an awkward position. Yeah, they can publish their FooKit binding under pyobjc-framework-fookit-real, and it’s easy to identify the actual package by reading the maintainers list, but that’s just annoying for everybody.

Granted, nobody actually does this, but it’s nice to be able to not rely solely on good intent. I feel there are not really any downsides in this proposal, and those minor wins accumulate to make it worthwhile in the end.

One thing that seems to get missed that is a major need for namespaces is to allow entities to be able to run internal package repositories. Without some method for the companies to set up a company namespace there is no way to prevent conflicts between company internal packages and those on the public python repositories.

Simple Example, say there is a large company called largeco that has internal packages under the largeco package namespace.

So they create a package like largeco.coolutils and they have a ton of internal packages that use that package as a dependency.

Then at some point someone creates a totally different package called largeco.coolutils and publishes it to the public repositories.

This breaks all of largeco’s internal packages that use the largeco namespace. Of course if the public package is malicious it could do other things. Many of which could be serious security related issues.

So namespaces aren’t only about knowing the package source but also about being able to prevent package conflicts.

2 Likes

This would also provide a solution to people raising issues like pypa/pip#3454 and pypa/pipenv#2159, where the fundamental problem is pip does not have a way to prefer a package source.

The problem IMO, however, is how to work out a balanced policy. The general criteria mentioned above (can manually apply if the entity has a significant number of packages) is likely not useful to companies wanting to reserve name for internal packages, but a liberal approach (e.g. allow reservation of <name>-* if the entity owns <name> package and/or username) would be very vulnarable to name-squatting. Maybe some compromise would be possible? Say, automatically reserve <name>-internal-* for the owners of package <name>.

I’ve propose something similar ~1 years ago on pypa/warehouse, for more informations see my discussion and ideas here (a sort of idea draft)

So it sounds like we’ve identified three potential use cases for namespaces so far:

  1. Expanding the space of available package names to reduce conflicts and make it possible to publish forked packages without renaming everything.

    • Comment: IMO this doesn’t seem very promising right now, because we don’t have good ways to manage the resulting conflicts at the Python import level. Maybe it’s worth revisiting after we have a robust resolver and Conflicts metadata?
  2. Accurately signaling the origin of public packages. For example, if a package is called largeco-blah, end users might appreciate knowing whether the package is maintained by LargeCo Inc. or not.

    • Comment: this is essentially the same issue that classic trademark is trying to address – giving people accurate information about what they’re getting. We already have some relevant policies here – in particular, PEP 541 has mechanisms for handling trademark disputes – but they’re fairly ad hoc; this would be systematizing them. Some challenges include: how do we handle the tension between names that designate origin vs names that describe usage (e.g. pygithub is a package for working with github, so it’s an accurate descriptive usage, but it’s not maintained by GitHub Inc.)? How do we effectively communicate the difference to users? If PyPI is going to be in the business of promising to users that azure-storage comes from Microsoft, then how do the PyPI administrators figure out that they’re actually talking to Microsoft and not some scammer? (This is basically the same problem as Certificate Authorities have to solve, and it’s highly non-trivial.)
  3. Reserving portions of the namespace for private usage. Lots of organizations have internal packages; they definitely don’t want to accidentally get a public package that happens to use the same name, and they would prefer that no such public package exist (since it’s awkward to have unrelated packages where you can’t install both of them at the same time, and maybe their package will become public later).

    • Comment: This is essentially asking for PyPI to create a formal, blessed way to squat names. So the challenge would be to find a way to balance the public’s desire to keep names available to use and not be locked up by speculation or some opaque and unaccountable process, versus organizations’ desire to avoid accidental conflicts. One approach might be to carve out a specific namespace for this usage, e.g. prohibit packages on PyPI that start with private- and then document that everyone’s internal packages should use this. In the mean time, there are other options like using devpi (as noted up thread). This is clearly a common problem though, so at a minimum we should have some docs addressing it.
3 Likes

Thanks for the summary, @njs!

For those who haven’t been following it, here’s the GitHub issue about planning the rollout of the new pip resolver.

I believe @dustin is working on the PEP 541 process (and, towards that goal, on a user support ticket for PyPI.) Perhaps he could speak more to how frequently we see trademark questions come up currently among those support requests?

Perhaps this could be on https://packaging.python.org – anyone want to take a stab at writing this up as a guide and improve https://packaging.python.org/guides/hosting-your-own-index/ along the way? People do want clearer and more discoverable recommendations for the intersection of private stuff and PyPI.

The typeshed project, which provides PEP 484 type stubs, is currently discussing to distribute non-standard-lib type stubs via pypi (https://github.com/python/typeshed/issues/2491). Currently all stubs are vendored by the type checkers, but this approach doesn’t scale. Similar to DefinitelyTyped in the JavaScript world, we’d like users of a package foo to be able to install the corresponding stubs by just typing pip install types.foo or something similar.

But we’d need to ensure that people can’t squat these names for security reasons. As opposed to other Python packages, people will just try to install the type stubs without previously checking them, and they should be able to. But without namespacing this would open a wide door to attackers. So for us, namespacing is essential.

From the descriptions above, I’d agree with that. It sounds like the problem statement they started with was very similar to ours, and so the solution ended up offering characteristics we consider desirable: genuinely opt-in (so the folks for whom the existing flat namespace is working well don’t need to care), and with a centrally administered approval process so you didn’t get a proliferation of vanity namespaces producing install time package conflicts.

Slightly related since it’s relevant for internal-only packages, pypi.org will never have a classifier that starts with "Private :: " and it rejects uploads with invalid classifiers. (PR w/ link to a tweet)