Namespace support in pypi

dhashby · May 6, 2019, 1:34pm

Elaborating on an idea initially surfaced in the Packaging Summit 2019 ideas thread and also mentioned at some point on the warehouse github site, this thread is intended to capture thoughts on work needed to implement namespaces on pypi.

I expect this could ultimately need discussed more broadly on relevant mailing lists. As someone that’s new to the PyPA ecosystem I’m totally open to suggestions about where best to communicate these ideas for feedback.

The basic idea is to implement support for namespaces in PyPi, allowing groups to aggregate different projects under a single namespace. A non-goal would be to deprecate support for the current “flat” packaging namespace. Bottom line is this would be an “opt-in” model.

A secondary goal would be to implement functionality within warehouse allowing orgs to establish policies around how their packages are managed. One example might be giving namespace owners the option to require 2FA for publishing packages under their purview, or (slightly crazy thought) requiring certain metadata be populated for their projects.

This is just a start. Please contribute other ideas on this thread, as well as things that we should consciously avoid.

dhashby · May 6, 2019, 2:21pm

One important topic that came up in discussions at the Packaging Summit was around how to manage namespace assignments. The process used by NuGet for namespace reservations was mentioned as a potential model to follow.

Key point of this is that (particularly initially) there would be some kind of gating process for creating namespaces, and even in the end-state there would be checks to see if namespaces have a close resemblance to other existing namespaces (as one of the points of this is to help protect against typosquatting attacks).

dhashby · May 6, 2019, 7:39pm

I’ve created a draft PEP for this idea. I’m open to any feedback folks may have.

dhashby · May 6, 2019, 8:31pm

Capturing some further conversation on this topic with Pradyun and Nic: sounds like their general preference would be to use - as the namespace delimiter, and to support multi-level namespaces so that different rules could be applied for something like the django- namespace vice django-contrib namespace. (Meaning that namespace owners could allow other community members to submit packages in the django-contrib namespace while maintaining tighter control over the higher-level django- namespace.)

It’s worth noting that the approach they have in mind would have the effect of making the existence of a namespace potentially less obvious to users (unless we require all packages containing dashes to use namespaces - which feels like a breaking change to a lot of folks). This is good in that there’s less for users to understand, but bad in that it’s potentially harder for users to intuit how a given namespace is being managed.

My personal bias (coming from a “high compliance” environment) would be to implement mechanisms that makes this linkage more obvious and intuitive. IMO, users should be able to recognize at a glance if they’re installing a package from a namespace vice from the global namespace.

dwight.hubbard · May 6, 2019, 9:06pm

I personally dislike using the “-” delimiter as the namespace delimiter since it is already used in non-namespaced packages.

So parsing the namespace from the name would be ambiguous and would be difficult to avoid parsing part of the package name as the namespace name in some cases.

malemburg · May 6, 2019, 9:43pm

I don’t see how namespaces would help prevent typosquatting or
malicious attacks. Those are possible using package names
in namespace categories as well.

By having namespaces “owners” you create a situation where a
single person or group of people would be able to gate access to
packages in that namespaces, which can easily lead to conflicts.
E.g. let’s say one of the owners has a package in the namespace
and the author of a competing package wants access as well.

Zope and Plone used such an approach, but without ownership.
It created a bit more order in the index, but not much.

I think you’d get better security by having curated lists of
packages for which the ownership has been verified. That would
easily allow identifying typosquatting packages.

dhashby · May 7, 2019, 2:53am

My thinking is the namespace would become part of the “public” interface used when installing packages (much like is done in npm and go ecosystems). I view the ownership situation as a feature/benefit of enabling clear line-of-sight to the package maintainers handling multiple projects under a single umbrella, consistent with how github/gitlab repos are organized.

In looking at the top 5000 package list there were clearly a number of groupings where different entities were managing large swaths of popular packages - but there was also considerable confusion around which packages had some support structure backing them vice being a hobbyist project that had long since gone dormant. (The aws project sticks in my head as one example that could trip up a novice pythonista)

Part of the discussion today was also around “community” namespaces (e.g. -contrib), where related packages could be “autoapproved” based on pypi config for that namespace. I’m not opposed to that, but would want to make sure that we implement mechanisms so that package consumers have an easy way to determine when those “relaxed” policies are in effect for a given namespace.

takluyver · May 7, 2019, 7:17am

Can you provide an example of what such a namespace might be, and how people might use it? There are very different concepts which could be called namespaces - broad categories like Text:: for Perl, channels organised by owner like conda-forge, clusters of packages that are maintained together like Jupyter, etc.

You describe it as an ‘opt-in’ model, but who decides to opt in? If I release a package into a namespace, does that force all my users to understand this? If I ignore namespaces, can users opt-in to using them? Would I publish packages solely into a namespace, or would I publish a package in the global namespace and register it in one or more extra namespaces? Can someone other than the package author add a package to a namespace?

pf_moore · May 7, 2019, 8:22am

I don’t really understand what this means. How is it different from the current practice of using dotted names like (say) zope.interface? It’s hard to comment on a proposal that leaves out so much of the detail.

It would be better if you could describe the features you’re referring to inline, as not all of us are familiar with other ecosystems. I have no idea how go, npm or nuget handle package management, so the above aren’t particularly illuminating

So we’d do pip install zope-interface and then what? import zope-interface isn’t valid. While it’s true that import name and package name don’t have to be the same, there’s a pretty strong tradition that they are, and breaking that raises more questions than it answers - it would be bad if, by introducing namespaces on PyPI, we increased the risk of name clashes in import statements.

I think this proposal needs quite a lot more detail before it’s something that can really go anywhere. I’m not against the idea, although I’d strongly prefer that it gets used sparingly. I have little time for Java-like import company.topic.package names, or for requiring users to over-use import aliases like import company.topic.package as pkg (which just dumps the problem of choosing a “friendly name” onto the end user). There may well be something I’m in favour of here (“Namespaces are one honking great idea” after all ) but it’s hard to be sure at this level of detail.

dhashby · May 7, 2019, 11:26am

Here are some examples from other languages/package repositories:
npm uses a @<namespace>/<package-name> syntax. Angular is one example of a very high-visibility project using namespaces.

NuGet uses a <namespace>.<package-name> syntax. MySQL is one example of a high-visibility project using this syntax.

Go is an interesting case, as it uses references to the VCS at its method for namespacing. So a go reference would be something like import "github.com/spf13/cobra". (With that said, my personal bias would be to carry the namespace syntax only as far as setup.py / requirements.txt / Pipfile / Pipfile.lock / pyproject.toml, and not require individual files to modify their import statements.)

Julia follows a similar approach to go.

I’ll flesh this out a bit and add some examples from other communities to the draft PEP I’ve started. That will be a bit later this week, as now I need to focus on digging out at work (and home). Thanks all for the comments…please keep them coming.

pf_moore · May 7, 2019, 2:52pm

Thanks for the extra references. One point you made particularly made me think - you said that you see this as being helpful in “high compliance” environments. I’m not 100% sure what you mean by that, but my experience of such environments is that they have a tendency to put purity ahead of practicality (to state it as kindly as I can ) which concerns me.

What that means in practice is that I’d expect a certain amount of pressure for high-visibility Python projects to sit within some sort of “well respected” namespaces, in order to allow those high-compliance environments to simply approve the namespace and be done with the accreditation process. The consequences of that are not good for users or projects - the projects get pressure to change things, and if they do the users have to change their code. Imagine, for example, the impact on end users if pip and setuptools were moved to reside under a “pypa” namespace…

You say that the proposal would be “opt in”, and would have minimal effect on people who don’t want to use it. The same was said of typing in Python, and yet I’m seeing more projects these days where type annotations are being added and type checks being done in the CI. That means if I want to contribute to those projects, I’m going to be under pressure to add annotations to code I submit, fix typing-related errors (some of which may be false positives - I’ve seen a fair number of “fix the annotations” type changes - those are saying to me that the code is fine, but there’s busywork getting the annotations right), etc. I don’t want to dump on typing here, but I do thing the parallel is worth considering - “it’s opt in, it will only affect people who choose to use it” is never quite as simple as the people pushing a new feature would want us to believe.

Anyway, people will be recovering for PyCon for a while, so I expect things will be quiet around here for a few days. I’ll hold off on any more comments until the discussion has had a bit more time to get going.

dstufft · May 7, 2019, 7:50pm

Short thought:

I think that we don’t need to do anything but change PyPI here and make it possible for someone to claim a prefix. As far as pip, setuptools, etc is concerned nothing has changed. All we should change IMO is that on PyPI we can claim entire prefixes.

A concrete example is a project I’ve recently worked at for AWS where we have like a hundred packages named something like aws-cdk.something and it would be awesome if we can claim anything thst starts with aws-cdk.* on PyPI and prevent other users from using thst namespace.

I do not think that pip or other projects would have to care at all about this. It is possible that some companies may whitelist entire namespaces, but I suspect that they would never require it. It would just be a optimization plus allow some level of control so people can say “Oh I know everything under this prefix on PyPI is the same org, I don’t have to vet them individually”.

pf_moore · May 7, 2019, 9:19pm

Seems reasonable, although I’d like it if there were some sort of rule that “you must have a certain number of packages with that prefix already uploaded to claim the prefix”, to deter pre-emptive name squatting. It’s hard enough to find good names that are unused on PyPI already.

dstufft · May 7, 2019, 9:29pm

Right, so I’ve mentioned before (though not on this thread) that I really like how Nuget has handled this, and basically it works like this:

If you register a namespace, any packages you own that use that namespace get alittle icon that says “hey, this is part of this namespace”.
Existing packages that use a newly registered namespace can continue to exist, but do not get that namespace icon.
New packages that use that namespace are rejected unless the person trying to create them is a member of that namespace.
Namespaces can be flagged as “public”, which basically means the third item in this list doesn’t exist for those packages, and anyone can continue to upload packages, but only ones by the “official” namespace get the little icon thing.

For Nuget, getting a namespace requires applying for the namespace, and getting it approved, they have some rough guidelines for criteria they use:

Does the package ID prefix properly and clearly identify the package owner?
Are a significant number of the packages that have already been submitted by the owner under the package ID prefix?
Is the package ID prefix something common that should not belong to any individual owner or organization?
Would not reserving the package ID prefix cause ambiguity and confusion for the community?
Are the identifying properties of the packages that match the package ID prefix clear and consistent (especially the package author)?

They have one more thing about licenses, but that doesn’t really apply to us.

So the fact it requires an application + approval and that there are some criteria that can roughly be boiled down to “is your use of this prefix notable enough for us to block all future people from using it”, it means it’s a lot harder for their to be a land grab kind of situation going on.

We could also expose this information in the APIs, for projects that want to take advantage of it (hypothetically, you could imagine pip search telling you if something is part of a reserved namespace or not) but accepting that information should be compeltely optional.

For more information on how Nuget works, you can read https://docs.microsoft.com/en-us/nuget/reference/id-prefix-reservation and personally I’d just wholesale ~~steal~~ borrow text from that, adjusting where it makes sense.

malemburg · May 8, 2019, 7:14am

What I am missing is why we need namespaces when all we really want
is an easier way to say “I know these developers and trust their
packages”.

If pip/setuptools were to allow for a list of trusted authors,
I think it’d be more helpful to use case of preventing typosquatting
or trust inference.

Namespaces or prefixes are a form of regulation, with all strings
attached, including the need for a workforce to take care of reviews,
appeals, revocation, etc. etc.

dstufft · May 8, 2019, 12:50pm

To my mind the primary use case isn’t the compliance one, though it can be used for that. The primary use case is really about making it easier for projects that have a large number of packages to communicate with users about which packages are theirs or not.

I gave the example earlier of aws_cdk.* where AWS has just about 100 packages published using that namespace, one per service roughly. Right now any random person can grab a new aws_cdk.* package, and pretend to be AWS. Now you can, with a little bit of digging, figure out it’s not AWS by looking at the users associated with the package but that introduces a risk of confusion for users because it will be pretty easy to miss one of a dozen of such packages they might be using that is a little bit different than the rest.

This isn’t really just an AWS problem either, Azure also ships a large number of packages that follow a common pattern like this and I’m not sure if Google does or not.

Roughly, when projects ship a large number of related projects, there’s currently no way to strongly link those projects together so that end users can easily differentiate between projects that are part of that “set” of packages, versus projects that are not. There is maybe another way of solving it, but namespaces do solve it pretty cleanly and I think solve it in the most unambiguous way for users such that they are least likely to inadvertently fall into a footgun trap.

pf_moore · May 8, 2019, 1:25pm

For cases like this, it would probably be worth the PEP discussing how the transition would work - particularly if people are using foo.bar as package names right now, but the namespace version would be foo-bar (or vice versa, or whatever). I don’t think that keeping the old names and publishing new versions under the new names is a very good idea - it makes the name clutter on PyPI even worse, and adds to user confusion (should I use aws_cdk.foo or aws_cdk-foo?)

The transition discussion should also address where there’s a clash right now - with aws_cdk.official owned by AWS, and aws_cdk.impostor that’s not owned by AWS, how would transitioning to an AWS owned namespace work? Obviously the name of the non-owned package wouldn’t change. So what would change? And how would users perceive that change?

dstufft · May 8, 2019, 1:30pm

For this I think the answer is it wouldn’t change? Packaging considers -, _, and . as exactly the same character, so I don’t think Namespace support would be any different.

Agreed. The Nuget solution to this is “official” packages get some visual indicator and non-official ones do not. In the case of existing packages there isn’t really a perfect solution, but the general suggestion would be for orgs that want a reserved namespace to select one that isn’t already in use by anyone, even if we technically allow it.

pf_moore · May 8, 2019, 1:38pm

Yep, ultimately I just think it needs to be spelled out in the PEP. The actual approach taken won’t make much difference, as long as it’s all clearly stated. (Specifically, so that other tools like devpi which want to reflect PyPI’s model can do so).

dhashby · May 9, 2019, 2:30am

In terms of transition strategy for how this would be implemented, the best parallels I’ve found for existing flat namespaces that adopted some sort of namespace strategy are NuGet (.NET packages) and npm (nodejs packages). The two communities took very different approaches. I’ll attempt to summarize them here.

NuGet’s adoption of namespaces was inspired by concerns about integrity and trust of NuGet packages - trying to make it easier for package consumers to determine who had produced a given package and gauge whether the provider is trustworthy. They took an “in place transition” approach using a dot-delimited syntax, where all existing packages were grandfathered, even if they happened to sit smack-in-the-middle of a namespace claimed by some group. (It reminds me of Chinese Nail Houses) They did indicate that those grandfathered packages would be delisted if they exhibited malicious behavior that took advantage of their status within that namespace, and as @dstufft mentioned also had visual indicators indicating the verified identity of the package provider that’s distinct from the package namespace.

NPM took a different approach, and described their motivation as being largely driven by having over 140,000 packages in their global namespace and devs having difficulty coming up with original yet meaningful package names. NPM introduced a completely new syntax for what they called “scoped modules”, using a @<namespace>/<packagename> syntax. This was a much more invasive change, as it required changes with both their package repository and the package installers. Each npm user received a namespace based on their username, and paying organizational customers could create an org namespace (for those that aren’t aware, npmjs.org has paid tiers of service) This syntax introduction also enabled “private” packages - which reside within a reserved namespace but weren’t shared publicly.

Since NPM introduced scoped packages, they’ve been widely embraced by the NodeJS community. But the global namespace is still used as well. NPM also implemented a “deprecated” flag on their package repo to give package maintainers moving to scoped namespaces a way to signal to consumers what had changed.

To me, NuGet’s approach is simpler (and we all know that “simple is better than complex”), but NPM’s approach is more explicit (and we all know that “explicit is better than implicit”). The bottom line for me (as I think back on Russell’s keynote last week) is which approach best postures Python for another 25 years of growth, innovation, and collaboration??