Namespace support in pypi

(Donald Stufft) #16

To my mind the primary use case isn’t the compliance one, though it can be used for that. The primary use case is really about making it easier for projects that have a large number of packages to communicate with users about which packages are theirs or not.

I gave the example earlier of aws_cdk.* where AWS has just about 100 packages published using that namespace, one per service roughly. Right now any random person can grab a new aws_cdk.* package, and pretend to be AWS. Now you can, with a little bit of digging, figure out it’s not AWS by looking at the users associated with the package but that introduces a risk of confusion for users because it will be pretty easy to miss one of a dozen of such packages they might be using that is a little bit different than the rest.

This isn’t really just an AWS problem either, Azure also ships a large number of packages that follow a common pattern like this and I’m not sure if Google does or not.

Roughly, when projects ship a large number of related projects, there’s currently no way to strongly link those projects together so that end users can easily differentiate between projects that are part of that “set” of packages, versus projects that are not. There is maybe another way of solving it, but namespaces do solve it pretty cleanly and I think solve it in the most unambiguous way for users such that they are least likely to inadvertently fall into a footgun trap.

1 Like
(Paul Moore) #17

For cases like this, it would probably be worth the PEP discussing how the transition would work - particularly if people are using foo.bar as package names right now, but the namespace version would be foo-bar (or vice versa, or whatever). I don’t think that keeping the old names and publishing new versions under the new names is a very good idea - it makes the name clutter on PyPI even worse, and adds to user confusion (should I use aws_cdk.foo or aws_cdk-foo?)

The transition discussion should also address where there’s a clash right now - with aws_cdk.official owned by AWS, and aws_cdk.impostor that’s not owned by AWS, how would transitioning to an AWS owned namespace work? Obviously the name of the non-owned package wouldn’t change. So what would change? And how would users perceive that change?

(Donald Stufft) #18

For this I think the answer is it wouldn’t change? Packaging considers -, _, and . as exactly the same character, so I don’t think Namespace support would be any different.

Agreed. The Nuget solution to this is “official” packages get some visual indicator and non-official ones do not. In the case of existing packages there isn’t really a perfect solution, but the general suggestion would be for orgs that want a reserved namespace to select one that isn’t already in use by anyone, even if we technically allow it.

(Paul Moore) #19

Yep, ultimately I just think it needs to be spelled out in the PEP. The actual approach taken won’t make much difference, as long as it’s all clearly stated. (Specifically, so that other tools like devpi which want to reflect PyPI’s model can do so).

(Dave Ashby) #20

In terms of transition strategy for how this would be implemented, the best parallels I’ve found for existing flat namespaces that adopted some sort of namespace strategy are NuGet (.NET packages) and npm (nodejs packages). The two communities took very different approaches. I’ll attempt to summarize them here.

NuGet’s adoption of namespaces was inspired by concerns about integrity and trust of NuGet packages - trying to make it easier for package consumers to determine who had produced a given package and gauge whether the provider is trustworthy. They took an “in place transition” approach using a dot-delimited syntax, where all existing packages were grandfathered, even if they happened to sit smack-in-the-middle of a namespace claimed by some group. (It reminds me of Chinese Nail Houses) They did indicate that those grandfathered packages would be delisted if they exhibited malicious behavior that took advantage of their status within that namespace, and as @dstufft mentioned also had visual indicators indicating the verified identity of the package provider that’s distinct from the package namespace.

NPM took a different approach, and described their motivation as being largely driven by having over 140,000 packages in their global namespace and devs having difficulty coming up with original yet meaningful package names. NPM introduced a completely new syntax for what they called “scoped modules”, using a @<namespace>/<packagename> syntax. This was a much more invasive change, as it required changes with both their package repository and the package installers. Each npm user received a namespace based on their username, and paying organizational customers could create an org namespace (for those that aren’t aware, npmjs.org has paid tiers of service) This syntax introduction also enabled “private” packages - which reside within a reserved namespace but weren’t shared publicly.

Since NPM introduced scoped packages, they’ve been widely embraced by the NodeJS community. But the global namespace is still used as well. NPM also implemented a “deprecated” flag on their package repo to give package maintainers moving to scoped namespaces a way to signal to consumers what had changed.

To me, NuGet’s approach is simpler (and we all know that “simple is better than complex”), but NPM’s approach is more explicit (and we all know that “explicit is better than implicit”). The bottom line for me (as I think back on Russell’s keynote last week) is which approach best postures Python for another 25 years of growth, innovation, and collaboration??

(Nathaniel J. Smith) #21

I think a crucial difference between Python and JS here is that Python also has a global package namespace after installation, and JS doesn’t. For JS, if you let different people register @user1/foo and @user2/foo, then that’s fine; for Python, they probably both want to use the foo name at runtime, and thus couldn’t be installed in the same environment.

Given that we do have a global namespace at runtime, and are stuck with that (barring major changes to the Python language itself), it’s probably simpler if packaging also keeps a global namespace.

1 Like
(Dave Ashby) #22

My sense is the more likely use-case there is folks wanting to use @user2/foo as a “drop-in-replacement” for @user1/foo, vice wanting to use two different packages, both of which happen to be named foo. I’ll admit that’s a bit of a double-edged sword, as it could encourage forking vice collaboration, but also enable more innovation across the ecosystem.

The other thing I’ve observed with JS is (in at least a few cases I’ve observed) the move towards scoped modules seems to have encouraged more modular code with smaller individual packages than existed previously, as it was easy to split out ancillary functionality alongside the core package in the same namespace. (While this can be done in a flat namespace as well, the scoping provides a much stronger cue that the packages are developed and intended to work together).

(Donald Stufft) #23

FWIW I think that the NPM style is likely a non-starter. I don’t think it’s worth the likely confusion users will have when the PyPI package name and the import package name doesn’t match. This isn’t a new thing, in that they don’t have to match today, but the general recommendation is that they do, but with a NPM style namespace the requirement would have to be that they don’t match.

From my POV, the Nuget style is the best match for PyPI going forward as it doesn’t require any real changes anywhere but PyPI itself.

(Dave Ashby) #24

I’ve been doing some analysis of the top 5000 packages (by download count), and have found a few relatively clean namespaces (e.g. azure- contains nearly 200 packages, all but 2 of which are maintained by Microsoft), and others that are very polluted (I’m sure nobody will be surprised to find that there are over 300 packages in the top 5000 in the django- namespace, from many different contributors).

Similarly, some groups have packages scattered across the global namespace with completely unrelated names. On this piece, I am a bit leery of unintended consequences of introducing namespaces…will it encourage maintainers to relocate/consolidate existing packages under “their” namespace? And if so, how do we support that in a way that doesn’t break existing users? Do we implement some kind of package redirect config/support in pypi? Or a deprecation warning like what NPM did? Agree that the NuGet approach is simpler in terms of limiting the impact, but there are definitely still devils in the details.

I did stumble across something in my research where somebody suggested a convention where if the namespace and the package had the same name they could use a shorthand reference of just name vice @name/name. Unfortunately I can’t find that link now that I’m looking for it. I’m not sure that changes the argument against NPM-style here, but did strike me as an interesting way of keeping things simple from a user perspective while giving power to package devs.

(Nathaniel J. Smith) #25

This seems like a non-starter outside of niche packages that are only used in a single organization, because even if I want to use @user2/foo as a drop-in replacement for @user1/foo, then I probably can’t convince everyone else on PyPI to update their requirements to replace @user1/foo with @user2/foo.

(Dave Ashby) #26

You’d typically only see this in the case of abandoned packages, but it does also exist with “private” packages that aren’t shared publicly. I have a customer right now that does this because they have a slightly non-spec OAuth endpoint…so they drop in a replacement package that they forked from the upstream to accommodate that situation.

(Nathaniel J. Smith) #27

That sounds like they want to overlay their own packages on the existing names, not create whole new names.

(Dave Ashby) #28

I’ve seen scenarios for both. The piece about overwriting existing names (in a global namespace construct) is it makes mirroring much more of a PITA, as you have to either maintain separate repos with the “internal” versions (and remember to point to those) or do other annoying things to mirror these packages but not those.

(Nathaniel J. Smith) #29

My point is that namespaces don’t help this case. If you need a drop-in replacement for the package foo, and calling it mycompany-foo doesn’t work, then calling it @mycompany/foo won’t work either.

(Marc-André Lemburg) #30

What you call “polluted” is actually very common for open source
projects. It’s clear that a vendor will aim to use a more consistent
approach, simply because their internal policies typically require
this.

Please also take into account that in Python PyPI package name !=
Python package name, i.e. a PyPI package is well able to install to
a completely different Python package on PYTHONPATH.

Your example with drop-in PyPI packages for a particular Python
package can easily be made to work, because of this.

The only benefit I see from having prefixes reserved to vendors
is to make it easier for users of those PyPI package to quickly
identify the source of the package.

But then again: they can have the same easy of use by simply looking
at the package maintainer field, so it’s only a very minor win.

(Tzu-ping Chung) #31

The biggest advantage for me is that a maintainer can be at peace, confident that their package naming scheme on PyPI is guaranteed to be future-proof.

PyObjC, for example, publishes a number of packages under the pyobjc-framework-[name] prefix, each corresponding to a macOS framework called [name].framework. If Apple announces today a new framework called FooKit, there’s nothing stopping me to publish pyobjc-framework-fookit, and put both PyObjC and its users in an awkward position. Yeah, they can publish their FooKit binding under pyobjc-framework-fookit-real, and it’s easy to identify the actual package by reading the maintainers list, but that’s just annoying for everybody.

Granted, nobody actually does this, but it’s nice to be able to not rely solely on good intent. I feel there are not really any downsides in this proposal, and those minor wins accumulate to make it worthwhile in the end.

(Dwight Hubbard) #32

One thing that seems to get missed that is a major need for namespaces is to allow entities to be able to run internal package repositories. Without some method for the companies to set up a company namespace there is no way to prevent conflicts between company internal packages and those on the public python repositories.

Simple Example, say there is a large company called largeco that has internal packages under the largeco package namespace.

So they create a package like largeco.coolutils and they have a ton of internal packages that use that package as a dependency.

Then at some point someone creates a totally different package called largeco.coolutils and publishes it to the public repositories.

This breaks all of largeco’s internal packages that use the largeco namespace. Of course if the public package is malicious it could do other things. Many of which could be serious security related issues.

So namespaces aren’t only about knowing the package source but also about being able to prevent package conflicts.

2 Likes
(Tzu-ping Chung) #33

This would also provide a solution to people raising issues like pypa/pip#3454 and pypa/pipenv#2159, where the fundamental problem is pip does not have a way to prefer a package source.

The problem IMO, however, is how to work out a balanced policy. The general criteria mentioned above (can manually apply if the entity has a significant number of packages) is likely not useful to companies wanting to reserve name for internal packages, but a liberal approach (e.g. allow reservation of <name>-* if the entity owns <name> package and/or username) would be very vulnarable to name-squatting. Maybe some compromise would be possible? Say, automatically reserve <name>-internal-* for the owners of package <name>.

(Hervé Beraud) #34

I’ve propose something similar ~1 years ago on pypa/warehouse, for more informations see my discussion and ideas here (a sort of idea draft)

(Nathaniel J. Smith) #35

So it sounds like we’ve identified three potential use cases for namespaces so far:

  1. Expanding the space of available package names to reduce conflicts and make it possible to publish forked packages without renaming everything.

    • Comment: IMO this doesn’t seem very promising right now, because we don’t have good ways to manage the resulting conflicts at the Python import level. Maybe it’s worth revisiting after we have a robust resolver and Conflicts metadata?
  2. Accurately signaling the origin of public packages. For example, if a package is called largeco-blah, end users might appreciate knowing whether the package is maintained by LargeCo Inc. or not.

    • Comment: this is essentially the same issue that classic trademark is trying to address – giving people accurate information about what they’re getting. We already have some relevant policies here – in particular, PEP 541 has mechanisms for handling trademark disputes – but they’re fairly ad hoc; this would be systematizing them. Some challenges include: how do we handle the tension between names that designate origin vs names that describe usage (e.g. pygithub is a package for working with github, so it’s an accurate descriptive usage, but it’s not maintained by GitHub Inc.)? How do we effectively communicate the difference to users? If PyPI is going to be in the business of promising to users that azure-storage comes from Microsoft, then how do the PyPI administrators figure out that they’re actually talking to Microsoft and not some scammer? (This is basically the same problem as Certificate Authorities have to solve, and it’s highly non-trivial.)
  3. Reserving portions of the namespace for private usage. Lots of organizations have internal packages; they definitely don’t want to accidentally get a public package that happens to use the same name, and they would prefer that no such public package exist (since it’s awkward to have unrelated packages where you can’t install both of them at the same time, and maybe their package will become public later).

    • Comment: This is essentially asking for PyPI to create a formal, blessed way to squat names. So the challenge would be to find a way to balance the public’s desire to keep names available to use and not be locked up by speculation or some opaque and unaccountable process, versus organizations’ desire to avoid accidental conflicts. One approach might be to carve out a specific namespace for this usage, e.g. prohibit packages on PyPI that start with private- and then document that everyone’s internal packages should use this. In the mean time, there are other options like using devpi (as noted up thread). This is clearly a common problem though, so at a minimum we should have some docs addressing it.
2 Likes