Namespace support in pypi

Short thought:

I think that we don’t need to do anything but change PyPI here and make it possible for someone to claim a prefix. As far as pip, setuptools, etc is concerned nothing has changed. All we should change IMO is that on PyPI we can claim entire prefixes.

A concrete example is a project I’ve recently worked at for AWS where we have like a hundred packages named something like aws-cdk.something and it would be awesome if we can claim anything thst starts with aws-cdk.* on PyPI and prevent other users from using thst namespace.

I do not think that pip or other projects would have to care at all about this. It is possible that some companies may whitelist entire namespaces, but I suspect that they would never require it. It would just be a optimization plus allow some level of control so people can say “Oh I know everything under this prefix on PyPI is the same org, I don’t have to vet them individually”.

2 Likes

Seems reasonable, although I’d like it if there were some sort of rule that “you must have a certain number of packages with that prefix already uploaded to claim the prefix”, to deter pre-emptive name squatting. It’s hard enough to find good names that are unused on PyPI already.

Right, so I’ve mentioned before (though not on this thread) that I really like how Nuget has handled this, and basically it works like this:

  • If you register a namespace, any packages you own that use that namespace get alittle icon that says “hey, this is part of this namespace”.
  • Existing packages that use a newly registered namespace can continue to exist, but do not get that namespace icon.
  • New packages that use that namespace are rejected unless the person trying to create them is a member of that namespace.
  • Namespaces can be flagged as “public”, which basically means the third item in this list doesn’t exist for those packages, and anyone can continue to upload packages, but only ones by the “official” namespace get the little icon thing.

For Nuget, getting a namespace requires applying for the namespace, and getting it approved, they have some rough guidelines for criteria they use:

  1. Does the package ID prefix properly and clearly identify the package owner?
  2. Are a significant number of the packages that have already been submitted by the owner under the package ID prefix?
  3. Is the package ID prefix something common that should not belong to any individual owner or organization?
  4. Would not reserving the package ID prefix cause ambiguity and confusion for the community?
  5. Are the identifying properties of the packages that match the package ID prefix clear and consistent (especially the package author)?

They have one more thing about licenses, but that doesn’t really apply to us.

So the fact it requires an application + approval and that there are some criteria that can roughly be boiled down to “is your use of this prefix notable enough for us to block all future people from using it”, it means it’s a lot harder for their to be a land grab kind of situation going on.

We could also expose this information in the APIs, for projects that want to take advantage of it (hypothetically, you could imagine pip search telling you if something is part of a reserved namespace or not) but accepting that information should be compeltely optional.

For more information on how Nuget works, you can read https://docs.microsoft.com/en-us/nuget/reference/id-prefix-reservation and personally I’d just wholesale steal borrow text from that, adjusting where it makes sense.

1 Like

What I am missing is why we need namespaces when all we really want
is an easier way to say “I know these developers and trust their
packages”.

If pip/setuptools were to allow for a list of trusted authors,
I think it’d be more helpful to use case of preventing typosquatting
or trust inference.

Namespaces or prefixes are a form of regulation, with all strings
attached, including the need for a workforce to take care of reviews,
appeals, revocation, etc. etc.

1 Like

To my mind the primary use case isn’t the compliance one, though it can be used for that. The primary use case is really about making it easier for projects that have a large number of packages to communicate with users about which packages are theirs or not.

I gave the example earlier of aws_cdk.* where AWS has just about 100 packages published using that namespace, one per service roughly. Right now any random person can grab a new aws_cdk.* package, and pretend to be AWS. Now you can, with a little bit of digging, figure out it’s not AWS by looking at the users associated with the package but that introduces a risk of confusion for users because it will be pretty easy to miss one of a dozen of such packages they might be using that is a little bit different than the rest.

This isn’t really just an AWS problem either, Azure also ships a large number of packages that follow a common pattern like this and I’m not sure if Google does or not.

Roughly, when projects ship a large number of related projects, there’s currently no way to strongly link those projects together so that end users can easily differentiate between projects that are part of that “set” of packages, versus projects that are not. There is maybe another way of solving it, but namespaces do solve it pretty cleanly and I think solve it in the most unambiguous way for users such that they are least likely to inadvertently fall into a footgun trap.

1 Like

For cases like this, it would probably be worth the PEP discussing how the transition would work - particularly if people are using foo.bar as package names right now, but the namespace version would be foo-bar (or vice versa, or whatever). I don’t think that keeping the old names and publishing new versions under the new names is a very good idea - it makes the name clutter on PyPI even worse, and adds to user confusion (should I use aws_cdk.foo or aws_cdk-foo?)

The transition discussion should also address where there’s a clash right now - with aws_cdk.official owned by AWS, and aws_cdk.impostor that’s not owned by AWS, how would transitioning to an AWS owned namespace work? Obviously the name of the non-owned package wouldn’t change. So what would change? And how would users perceive that change?

For this I think the answer is it wouldn’t change? Packaging considers -, _, and . as exactly the same character, so I don’t think Namespace support would be any different.

Agreed. The Nuget solution to this is “official” packages get some visual indicator and non-official ones do not. In the case of existing packages there isn’t really a perfect solution, but the general suggestion would be for orgs that want a reserved namespace to select one that isn’t already in use by anyone, even if we technically allow it.

Yep, ultimately I just think it needs to be spelled out in the PEP. The actual approach taken won’t make much difference, as long as it’s all clearly stated. (Specifically, so that other tools like devpi which want to reflect PyPI’s model can do so).

In terms of transition strategy for how this would be implemented, the best parallels I’ve found for existing flat namespaces that adopted some sort of namespace strategy are NuGet (.NET packages) and npm (nodejs packages). The two communities took very different approaches. I’ll attempt to summarize them here.

NuGet’s adoption of namespaces was inspired by concerns about integrity and trust of NuGet packages - trying to make it easier for package consumers to determine who had produced a given package and gauge whether the provider is trustworthy. They took an “in place transition” approach using a dot-delimited syntax, where all existing packages were grandfathered, even if they happened to sit smack-in-the-middle of a namespace claimed by some group. (It reminds me of Chinese Nail Houses) They did indicate that those grandfathered packages would be delisted if they exhibited malicious behavior that took advantage of their status within that namespace, and as @dstufft mentioned also had visual indicators indicating the verified identity of the package provider that’s distinct from the package namespace.

NPM took a different approach, and described their motivation as being largely driven by having over 140,000 packages in their global namespace and devs having difficulty coming up with original yet meaningful package names. NPM introduced a completely new syntax for what they called “scoped modules”, using a @<namespace>/<packagename> syntax. This was a much more invasive change, as it required changes with both their package repository and the package installers. Each npm user received a namespace based on their username, and paying organizational customers could create an org namespace (for those that aren’t aware, npmjs.org has paid tiers of service) This syntax introduction also enabled “private” packages - which reside within a reserved namespace but weren’t shared publicly.

Since NPM introduced scoped packages, they’ve been widely embraced by the NodeJS community. But the global namespace is still used as well. NPM also implemented a “deprecated” flag on their package repo to give package maintainers moving to scoped namespaces a way to signal to consumers what had changed.

To me, NuGet’s approach is simpler (and we all know that “simple is better than complex”), but NPM’s approach is more explicit (and we all know that “explicit is better than implicit”). The bottom line for me (as I think back on Russell’s keynote last week) is which approach best postures Python for another 25 years of growth, innovation, and collaboration??

I think a crucial difference between Python and JS here is that Python also has a global package namespace after installation, and JS doesn’t. For JS, if you let different people register @user1/foo and @user2/foo, then that’s fine; for Python, they probably both want to use the foo name at runtime, and thus couldn’t be installed in the same environment.

Given that we do have a global namespace at runtime, and are stuck with that (barring major changes to the Python language itself), it’s probably simpler if packaging also keeps a global namespace.

1 Like

My sense is the more likely use-case there is folks wanting to use @user2/foo as a “drop-in-replacement” for @user1/foo, vice wanting to use two different packages, both of which happen to be named foo. I’ll admit that’s a bit of a double-edged sword, as it could encourage forking vice collaboration, but also enable more innovation across the ecosystem.

The other thing I’ve observed with JS is (in at least a few cases I’ve observed) the move towards scoped modules seems to have encouraged more modular code with smaller individual packages than existed previously, as it was easy to split out ancillary functionality alongside the core package in the same namespace. (While this can be done in a flat namespace as well, the scoping provides a much stronger cue that the packages are developed and intended to work together).

FWIW I think that the NPM style is likely a non-starter. I don’t think it’s worth the likely confusion users will have when the PyPI package name and the import package name doesn’t match. This isn’t a new thing, in that they don’t have to match today, but the general recommendation is that they do, but with a NPM style namespace the requirement would have to be that they don’t match.

From my POV, the Nuget style is the best match for PyPI going forward as it doesn’t require any real changes anywhere but PyPI itself.

I’ve been doing some analysis of the top 5000 packages (by download count), and have found a few relatively clean namespaces (e.g. azure- contains nearly 200 packages, all but 2 of which are maintained by Microsoft), and others that are very polluted (I’m sure nobody will be surprised to find that there are over 300 packages in the top 5000 in the django- namespace, from many different contributors).

Similarly, some groups have packages scattered across the global namespace with completely unrelated names. On this piece, I am a bit leery of unintended consequences of introducing namespaces…will it encourage maintainers to relocate/consolidate existing packages under “their” namespace? And if so, how do we support that in a way that doesn’t break existing users? Do we implement some kind of package redirect config/support in pypi? Or a deprecation warning like what NPM did? Agree that the NuGet approach is simpler in terms of limiting the impact, but there are definitely still devils in the details.

I did stumble across something in my research where somebody suggested a convention where if the namespace and the package had the same name they could use a shorthand reference of just name vice @name/name. Unfortunately I can’t find that link now that I’m looking for it. I’m not sure that changes the argument against NPM-style here, but did strike me as an interesting way of keeping things simple from a user perspective while giving power to package devs.

This seems like a non-starter outside of niche packages that are only used in a single organization, because even if I want to use @user2/foo as a drop-in replacement for @user1/foo, then I probably can’t convince everyone else on PyPI to update their requirements to replace @user1/foo with @user2/foo.

You’d typically only see this in the case of abandoned packages, but it does also exist with “private” packages that aren’t shared publicly. I have a customer right now that does this because they have a slightly non-spec OAuth endpoint…so they drop in a replacement package that they forked from the upstream to accommodate that situation.

That sounds like they want to overlay their own packages on the existing names, not create whole new names.

I’ve seen scenarios for both. The piece about overwriting existing names (in a global namespace construct) is it makes mirroring much more of a PITA, as you have to either maintain separate repos with the “internal” versions (and remember to point to those) or do other annoying things to mirror these packages but not those.

My point is that namespaces don’t help this case. If you need a drop-in replacement for the package foo, and calling it mycompany-foo doesn’t work, then calling it @mycompany/foo won’t work either.

What you call “polluted” is actually very common for open source
projects. It’s clear that a vendor will aim to use a more consistent
approach, simply because their internal policies typically require
this.

Please also take into account that in Python PyPI package name !=
Python package name, i.e. a PyPI package is well able to install to
a completely different Python package on PYTHONPATH.

Your example with drop-in PyPI packages for a particular Python
package can easily be made to work, because of this.

The only benefit I see from having prefixes reserved to vendors
is to make it easier for users of those PyPI package to quickly
identify the source of the package.

But then again: they can have the same easy of use by simply looking
at the package maintainer field, so it’s only a very minor win.

1 Like

The biggest advantage for me is that a maintainer can be at peace, confident that their package naming scheme on PyPI is guaranteed to be future-proof.

PyObjC, for example, publishes a number of packages under the pyobjc-framework-[name] prefix, each corresponding to a macOS framework called [name].framework. If Apple announces today a new framework called FooKit, there’s nothing stopping me to publish pyobjc-framework-fookit, and put both PyObjC and its users in an awkward position. Yeah, they can publish their FooKit binding under pyobjc-framework-fookit-real, and it’s easy to identify the actual package by reading the maintainers list, but that’s just annoying for everybody.

Granted, nobody actually does this, but it’s nice to be able to not rely solely on good intent. I feel there are not really any downsides in this proposal, and those minor wins accumulate to make it worthwhile in the end.