Dependency notation including the index URL

steve.dower · November 6, 2020, 10:43am

This seems like a very good option, and really shouldn’t require anything more than PEP 503 allowing a character (: works for me, and so would /) in package names, while PyPI disallows it. Much like how platform tags are treated. I don’t think there’s an issue with multiple prefixes per (private) index, nor collisions between prefixed packages.

Npm and Nuget have similar approaches for solving this issue.

Npm’s scopes basically match this proposal, with the additional feature that the package manager will let you specify at most one registry for each scope (with fallback to the default), so that if you use @mycompany/... you can restrict all consideration of these packages to only your private index. This also avoids sending web traffic to public servers, which may give away the names of private packages.

Nuget’s ID prefixes are reserved server-side, and only certain accounts are allowed to publish to them. This work great when coupled with code signing, such that only packages signed with a previously-submitted key are allowed to be published under that namespace. It also guarantees that private packages with that prefix cannot be published by someone who is not part of your organisation.

If our Python tools could support : in package names while PyPI blocks it, we could match npm very easily. That also leaves open the possibility of a Nuget-style server side permission for certain prefixes later on (maybe lease them out for extra revenue?), and I think it’s feasible to eventually support signature/publisher validation to those namespaces to offer some end-to-end confidence for users.

There’s no reason why the module names have to match the package name, and I think for this kind of namespacing they probably should not. There’s no need to enforce anything here, but I think it’s fine to have mycompany-requests lead to import requests. We already see plenty of (usually) non-malicious examples of this on PyPI.

However, where it could become confusing is if the “company” prefix becomes mixed with a legitimate namespace package. For example, we use azure as a traditional namespace package, but not all azure-* packages on PyPI contribute to it (I wish they did, but I don’t quite have the influence to win over every team, and certainly not the external people who publish with that in the name). Having azure::name for internal stuff and azure-name for public stuff seems like a good balance to me.

pradyunsg · November 6, 2020, 10:46am

Does pip not have to work with this?

More seriously, I wanna point that out since pip’s handling of requirement-names would likely need a decent chunk of work to allow for this.

steve.dower · November 6, 2020, 10:47am

To stop normalising : to _, yeah. Is that really a big job? You’ve only got one normalisation function in there, right?

[Edit] But yes, pip, twine, etc. etc. have to work with it. I stopped listing tools off the top of my head when I remembered that we have a spec for it.

steve.dower · November 6, 2020, 11:08am

An additional thought here (which I think Paul Ganssle had first and I just didn’t get it): private index software could also have a configured prefix that they assume on any un-prefixed packages. So packages could be published to that index without the prefix in their actual name, but would be treated by the indexes as if they had it.

For example, I could set up steves-feed: as a prefix for my private index and then push requests to it without modification. In my requirements file, I specify steves-feed:requests which is not found on PyPI, but is found on my private index. Not having to force users to modify and rebuild packages to use a private copy makes this vastly more usable.

This is essentially the infrastructure-based equivalent of a lockfile. For those of us working in larger organisations, doing it in infrastructure is preferable over requiring each individual engineer to understand and apply the same concepts individually.

sinoroc · November 6, 2020, 12:58pm

Am I understanding it right that what @pganssle and @steve.dower are arguing for here would circumvent the issue: 2 projects with the same name on 2 different indexes are considered as 1 project from the point of view of pip (dependency resolution, download cache, etc.)? In other words alpha::lib, bravo::lib, and lib would be considered by pip as 3 different projects. Is that right? That is interesting, that looks like it could solve some of the issues. I need to think about it.

I feel like it would miss some of the use cases, though. Well I did not make a clear list of use cases, so of course…

Would that help in the following scenario?

I want App from PyPI
I know that App depends on Lib==2.*
I do not want Lib from PyPI
I want Lib from my own private index
I have Lib==2.0 on my private index
But there is Lib==2.1 on PyPI

I feel like the UX should look something like:

pythonX.Y -m pip install App --constraint constraints.txt

where constraints.txt is something like:

--index-url https://pep503.private.dev/simple/ Lib

But this does not work currently, as we already said. And the currently recommended solution is to deploy something like devpi or pydist.

So I guess the question that is at the top of my mind right now is: what would this namespaces solution (alpha::lib) help solve that can not be solved with devpi, or pydist, etc.?

(Thanks all for the feedback. It is greatly appreciated.)

pf_moore · November 6, 2020, 1:15pm

That use case isn’t valid, because “I want Lib from my own private index, but not from PyPI” is in contradiction to the idea that “where Lib comes from” isn’t a distinguishing feature of a package in Python packaging.

If I reinterpret your requirement in a way that fits the model better, I suspect you mean

“I want something that allows me to do import lib and is declared as being version 2.*. I want to describe what I install in such a way that it doesn’t clash with the version 2.1 thing on PyPI called Lib”.

Put that way, it’s fairly clear that you want a different package name that provides the same import name. But there’s also an (implied) constraint that naming your package MyLib isn’t sufficient, because you can’t stop someone else uploading a package called MyLib to PyPI unless you claim that name on PyPI.

Is that a reasonable re-statement of what you want?

As far as I can see, two things:

People don’t want to install devpi. That’s a fair requirement, installing a private index server isn’t a trivial exercise.
Politically, saying that people shouldn’t use PyPI directly if they want to avoid name conflict exploits isn’t a message we want to promote (it’s far too close to “don’t trust PyPI” for people to understand the nuances).

steve.dower · November 6, 2020, 1:28pm

sinoroc:

I feel like the UX should look something like:
pythonX.Y -m pip install App --constraint constraints.txt
where constraints.txt is something like:
--index-url https://pep503.private.dev/simple/ Lib

I like this idea, though it’s entirely pip specific. It could also be used to handle things like platform-specific builds (e.g. maybe you need to use your own build of a particular package rather than the more generic manylinux builds published to PyPI). Deploying your own hosted service to deal with it certainly feels like overkill.

I don’t think we can sustain this point of view, in light of the complexities of native code compilation, platform-specific expectations, and potential malicious actors. Telling people that “wanting your forked copies of packages to be distinguishable from someone else’s is just wrong” will just push entire groups of users away from our ecosystem.

For 2, we need to provide reasonable workarounds for name conflicts if we don’t want that message going out like that. If we are able to say “use this feature to avoid name conflict issues” then it can sound positive - if we have nothing, then yeah, it’ll sound like “name conflict issues are unavoidable and you should avoid referencing PyPI”.

uranusjr · November 6, 2020, 1:32pm

But since they do have a private package that they want to refer by version specifier (instead of direct URL), they should already have some kind of architecture to host that package. What is that, if not a private index server? The only other viable choice I can see is --find-links, and that’s already only a few hundred line of Python away from a legistimate index that can solve the use case.

steve.dower · November 6, 2020, 1:36pm

It solves organisational problems, rather than technical ones. Trying to convince 1000+ engineers to follow the same set of guidelines is difficult, and teaching them all of the nuances to worry about near impossible.

With namespaces embedded in the package name, it’s easy to tell everyone “you must use <our prefix>: at the start of your package name”. It’s almost as easy to tell everyone “set --index-url to <our feed>”.

It’s very complicated to communicate to a large, distributed group of people that they should avoid creating name conflicts, keep track of new conflicts as they arise, correctly configure install commands to prefer internal packages, diagnose issues caused by misconfigurations that pulled down the wrong package, etc.

I personally get called in to help diagnose pip install failures due to network flakiness, which you’d think wouldn’t take a specialist to figure out (it’s right there in the log output ), but many smart engineers treat this stuff as a big black box. An install failure is outside of their area, and so they don’t care to dig any deeper.

The simpler the rules to follow, the more likely they are actually followed, and “use a prefix” is much simpler than what we have to use today (which includes “avoid PyPI when possible”…)

westurner · November 6, 2020, 2:21pm

So, there are currently two namespaces. Please correct this terminology early:

Index server Package URL/URNs (wherein prefixeless names default to any of the specified index servers as a prefix; and the index server enforces uniqueness)
The module namespace (where nothing - at all - enforces uniqueness or detects collisions). This is how e.g. pdbpp so easily shadows stdlib pdb, for example.

It is very confusing to discuss the separate module namespace and package index namespaces.

…

Caching proxies don’t care that the DNS is wrong or the package index is being MITM or DOS’d so pip chose the next index server on the list: application-unaware caching proxies will gladly overwrite a signed package with an unsigned package with the same name from a different backend server.

We should ask the TUF team how intentional name collisions, index priorities, and/or index server package groups can work with package signing.

Should PyPI detect colliding namespaces within uploaded packages?
Should PyPI support groups, or can you just create a package which installs those specific dependencies; a setup.py that calls pip install requirements.lock.signed.txt? Why would that be suboptimal?
Could requirements.txt simply tag succeeding entries with the most recently specified (or default) -i/–index-server?
Could there be a new option to specify the default and alternate package URI curies (similar to JSON-LD @context) in a per-package way?
It’s trivial to specify a repo priority with eg apt and yum, so we should expect users to ask for index server priorities; which requires new arg/conf parsing and a change to how the solver is called, AFAIU

steve.dower · November 6, 2020, 2:23pm

Agreed. This thread so far has been all about the package index namespace.

What happens after packages are installed is nicely flexible, and it’s one of Python’s more powerful features/conveniences. Nobody here is proposing any changes to that.

sinoroc · November 6, 2020, 3:01pm

To make it a bit more concrete, let’s say that we know that App depends on Lib, but the Lib that is on PyPI does not work in our environment (for whatever reason), so we have our own private fork of Lib that we distribute privately only, on our own internal index. Basically it is the same Lib, but with additional patches (that we can not upstream for whatever reason).

[Seems like a credible scenario to me. Not a real use case for me personally. So far I have never really needed any of the things we are discussing here, but I see the question popping up often enough that I am looking for concrete background info to base my answers on, when I try to help people.]

Yes, seems like it would be one of those cases, where we want to make sure Lib is installed from our own private index.

In that scenario I do not think it is possible. If we use the name MyLib then we are cutting ourselves off the dependency resolution. App requires Lib. That is why I would say keeping the same name is important. I do not know how to do this otherwise. I am also not clear at all if / how mycompany:Lib would help in that case.

Yes, that would be my point of view as well. Although I reckon there is probably a wide array of PEP 503 implementations out there. And I can imagine that maybe the implementations that have the features to enforce installing Lib from private index instead of from PyPI are not the ones people reach out for first, because they are less easy to install and manage.

I see 2 use cases at play here (so far):

Coincidental name clash
There are 2 distinct projects.
We have a private project Lib that we publish only on our private index, then one day comes where there is also a public project Lib on PyPI, and suddenly nothing works anymore.
To me this seems like it is where the introduction of namespaces such as acme:Lib vs. Lib could most likely help.
Intentional name clash
There is only 1 project.
We have our own private fork of the already existing public Lib.
To me this seems like local versions such as Lib-2.0+acme vs. Lib-2.1 could maybe help. I do not know enough about that topic to judge. Probably missing would be a way to instruct pip to prefer Lib-2.0+acme over Lib-2.1, even if the version number is lower.

In both cases I definitely see how well configured private indexes can help.

pf_moore · November 6, 2020, 3:07pm

OK. I’m not trying to defend this point of view, I’m just making the statement that it’s fundamental to a lot of code. If we want to change it, because that helps this issue, I’m fine with that. But someone needs to do the work.

For example if pip’s cache serves up a copy of foo-1.0-py3-non-any.whl from PyPI when the user asked for foo 1.0, but specified the private index, that would be a security bug, correct? I don’t want to see pip suddenly subject to a raft of alerts like this which we don’t have the resource to deal with, so we should look at how to pre-emptively locate and fix these assumptions if we go down that route.

Maybe it’s something we could/should look for funding on? After all, it’s helpful to a lot of business users, so we should be able to get some takers

westurner · November 6, 2020, 3:07pm

#!/bin/sh

# How do I detect that "myindex.org/group/requests" is installed but "requests" is not?

# Requirement as specified: "requests"
# Package URL inferred from implicit index URL: True
# Package URL: https://pypi.org/pypi/requests
# Package Download URL: https://files.pythonhosted.org/packages/45/1e/0c169c6a5381e241ba7404532c16a21d86ab872c9bed8bdcd4c423954103/requests-2.24.0-py2.py3-none-any.whl
# Package Download URL: https://files.pythonhosted.org/packages/da/67/672b422d9daf07365259958912ba533a0ecab839d4084c487a5fe9a5405f/requests-2.24.0.tar.gz

# Module names:

# export PYPATH=$(python -c 'import sys; print(":".join(sys.path))')
export _PYSITE=$(python -c 'import site; print(site.getusersitepackages())')
python -c 'import os, pprint, sys; pprint.pprint({x: sorted(os.listdir(x)) for x in sys.path if "site-packages" in x})'
python -c 'import os, pprint; x=os.environ["_PYSITE"]; pprint.pprint({x: sorted(os.listdir(x))})'
# find . -name '*.pth' print0 | el -0 -v -x "cat"

$ type cdpysite
cdpysite is a function
cdpysite () 
{ 
    [ -z "$_PYSITE" ] && echo "_PYSITE is not set" && return 1;
    cd "$_PYSITE"${@:+"/${@}"}
}

If a dependency specifies “requests”,
how and where is a relation declared such that we can determine that “index.org/group/requests” satisfies a request for requests?

How do we specify that we trust any of the index.org/group/ release-signing keys to satisfy this relation?

cat requirements.txt
cat requirements-other.lock.txt
cat Pipfile.lock

… How do I detect that “myindex.org/group/requests” is installed but “requests” is not?
Where should those extra relations be declaratively specified?

steve.dower · November 6, 2020, 3:12pm

If it can be spec’d out and not held up on “we plan to refactor this one day when we have the time so no changes for now”, I can probably get contributors. But there’s a poor history of contributions towards most of the projects involved, so it’ll have to be a very specific “pull” from the projects, because I don’t think I can convince anyone to “push” a contribution at this point.

sinoroc · November 6, 2020, 3:20pm

Yes, that looks like a very critical point to the discussion. At the very beginning of the discussion, I was wondering why the push back from you. Until you made it clear you had pip’s own on-disk cache in mind, and that started to make sense to me.

But isn’t it already an issue right now?

Let’s say yesterday I ran pip install Lib. And today I run pip install --index-url https://pep503.private.dev Lib. Am I not at risk of pip telling me: “I already have Lib in cache (or maybe I built a wheel for it yesterday), so I am not gonna download it again”. Or does pip already have checks in place for that sort of things?

pganssle · November 6, 2020, 3:49pm

In the broadest form of my proposal, it’s up to pip to decide how to treat alpha::lib vs bravo::lib vs lib. I imagine that alpha::lib or lib should not be able to satisfy a dependency on bravo::lib, and alpha::lib should not use a cached wheel for bravo::lib. But I would expect pip download bravo::lib to download something called lib-0.0.1.whl or something, so it might be that the work on the pip side is “don’t normalize away ::” + “be aware that the file name won’t include the namespace”. Maybe that adds more complexity than I had originally hoped.

No, my idea would not help in this situation. You’d need App to depend on bravo::lib. The idea for these namespaces is that they should relate to names of packages in a private index. I think that your problem is more of a deployment problem and you should use a caching server like devpi where if you need a patched version of Lib, you put it there.

The thing I’m trying to solve with the namespace proposal is for when you want to use the tools developed around PyPI for your own private packages, but you don’t want to publish to PyPI and you are worried about naming conflicts with upstream packages. The idea is that the packages don’t have to know about each other as long as you can specify, “I want package X, but this is a private package so if you find one on PyPI don’t send it to me!” It’s on you to configure everything else.

So, to be clear about how I saw this working, I think your “additional thought” was the only mode of operation I was considering. The idea would be:

Packager: Develops private package privlib and uploads it to the local devpi / artifactory. No changes to workflow at all.
Twine: No changes. You still do twine upload dist/*, with twine configured to upload to the local artifactory or whatever.
Dependent: Depends on mycorp::privlib, indicating that they only want privlib if it is supplied by an index that identifies itself as supplying the namespace mycorp.
pip: Allows dependencies on anything::anything-else. How it handles the cache is up to pip, but I think in the normal mode of operation it doesn’t matter.
PyPI: Presumably no change. If mycorp::privlib is requested, it will not be supplied because mycorp::privlib is not a valid name for a package.
Local package index: When supplying packages, this is configured such that it has one or more namespaces that it considers valid. If you configure devpi or something to know that it serves packages as mycorp::, it will strip that off and serve whatever’s on the right side of it.

That said, the more I think about it, the more I actually like the idea of the local package index keeping a clean separation between indexed and non-indexed packages (so that mycorp::privlib and privlib don’t both work if you have your index server configured correctly — that will lead people to get sloppy and leave off the namespace).

I think we could implement that entirely on the local package index side by having the local package index have a special upload target that puts things in the specified namespace. So instead of uploading to local-pypi.mycorp.com/upload you’pd upload to local-pypi.mycorp.com/mycorp/upload and it would go in the mycorp specifically (this is basically equivalent to configuring your index server as two nested index servers, one of which has more packages and only accepts things in the mycorp:: namespace and one of which is a simple mirror).

We could potentially implement this in twine, but I think that would require a lot more changes everywhere — the packagers would need to be able to indicate what index namespace to use for that package, we’d need to include that in the metadata somewhere, and twine would have to be able to parse it, since (at least by default) twine gets the name of the package from the file name, and the file name wouldn’t include the namespace.

pf_moore · November 6, 2020, 4:43pm

I’d have to check the pip code to be sure, but yes, that’s precisely my point. Pip treats anything that says it’s “Lib 1.0” as equivalent. If that’s a problem for your workflow, then it’s likely to be a problem in lots of different ways, and it’s not new.

pf_moore · November 6, 2020, 4:53pm

It’s not just about contributors, it’s about the whole process. We’re talking about something that’s at least comparable in size and complexity with the new resolver work, and that is something we’d been trying to do for literally years without getting anywhere until we got funding for a project manager, and a full project structure with dedicated staff working on it. And a lot of effort went into handling publicity, user education, and preparing for post-implementation support.

If all you can provide is people willing to write PRs and leave, then we still have all of those other areas missing, and PRs will languish. If we can get a good shared agreement on what should be done, and a funded project with clear goals and dedicated resources to achieve it, then that’s a very different proposal.

But at the moment, we don’t even have a consensus among the pip developers (let alone the “whole packaging community”) that the behaviour of assuming that all artifacts with the same name and version are equivalent, is a problem that needs fixing. So talk of people writing PRs is very premature.

uranusjr · November 6, 2020, 5:03pm

I would argue in this case they should either migrate to another tool, or provide enhancements to the tools they use. Making this a feature request in Python packaging feels like a shift of responsibility to me. It is easier if Python can offer a direct solution without them complicating the setup, because they don’t need to deal with complexities in implementing and maintaining Python packaging tools