Advice to avoid `--extra-index-url` to install private packages from GitLab CI

astrojuanlu · August 15, 2022, 10:11pm

I have the following use case:

My client develops a closed-source Python library, with dependencies specified in setup.py through setuptools (library A)
The code for library A lives in a private repository of a git SaaS (in this case, GitLab)
Library A depends on library B, which is pulled from a custom index:

    - "pip install 'B==0.9.0' --index-url https://.../pypi/simple --prefer-binary"
    - pip install .

I was trying to improve this workflow, but found a very important problem: since --extra-index-url is “insecure by design” (there were even talks about deprecating it, although this idea has been since abandoned because it would have been too disruptive), I cannot just add B to setup.py and do pip install . --extra-index-url ..., because potentially any could squat B on pypi.org (in this particular case, the name has already been squatted - that’s besides the point).

So I only see 3 solutions for this problem:

Enable hash-checking mode by freezing my setup.py dependencies to a requirements.txt (possibly using pip-tools cc @matthewfeickert) so that I can force pip to use the version of B that I decide, by specifying the hash. If I understand correctly, this would make my installation procedure immune to PyPI squatting of internal dependencies. I would not immediately detect incompatibilities with new versions of my dependencies though (but perhaps it’s a good thing that my pipelines won’t break inadvertently!)
Abuse the GitLab Package Registry upload all my dependencies there, and use it as my --index-url. Currently the free tier would give 5 GB storage, possibly enough but not sure for how long.
Provision a dedicated machine with a public IP, install devpi-server on it, and use it from CI as an --index-url.

Are there other options I’m potentially missing? Any tips are appreciated. If my assessment of the available options is at least correct, I might contribute some docs to pip as suggested in the issue tracker (although I have other contributions waiting for me to look at them…).

brettcannon · August 15, 2022, 11:24pm

The ways I have seen this issue solved is with file hashing (typically via a requirements.txt file), or a controlled supply chain that is exposed via its own package server.

And as for option 1, see How should a lockfile PEP (665 successor) look like? where a potential lock file spec is being discussed.

EpicWink · August 15, 2022, 11:29pm

I personally use pip-tools, as it’s fine for my workflow (although I wish I had put in the time to learn pipenv or Poetry for their advanced functionality). I only pin dependencies for my application deployments however, and both developers and libraries get the latest compatible.

This is in conjunction with --extra-index-url. We also prefix our project names with a disambiguation to reduce the likelihood of name collisions; perhaps you could ask PyPI admins to black-list that prefix, if you’re a big enough organisation.

Another option is to use a service like Azure Artifacts, which (like devpi) will cache public packages, but it’s managed.

matthewfeickert · August 16, 2022, 3:51am

As you’ve already pointed out @astrojuanlu, this is my suggestion based off of what I’ve learned from @brettcannon’s pip-secure-install recommendations. pip-tools makes this pretty easy. As @brettcannon points out in his comment too, there isn’t a formal lock file spec yet — though I’ve personally taken to calling the output file of what comes out of pip-compile --generate-hashes a lock file (and even taken to calling it requirements.lock though the pip-tools team calls it requirements.txt). Importantly though, if you are doing as you point out with

# requirements.txt is the lock file
$ python -m pip install --no-deps --require-hashes --only-binary :all: --requirement requirements.txt

then you are only able to install from the wheels that match the hashes in the requirements.txt.

I would not immediately detect incompatibilities with new versions of my dependencies though (but perhaps it’s a good thing that my pipelines won’t break inadvertently!)

In my mind this once again comes down to are you developing a Python library or an application? As @EpicWink has already pointed out, if it is an application, then you really want to be using a lock file anyway and then you can carefully ease up restrictions in your requirements and rebuild your lock file and rerun your tests to understand what you can update and when. If it is a Python library then the best you can do is to test your dependencies’s lower bounds with a constraints.txt file that has them pinned with the oldest Python you support, and also test at the latest releases or at HEAD of your dependencies (yay for nightly wheels!). (I’m writing this for completeness as Juan and I have already discussed this and he knows my views.)

uranusjr · August 16, 2022, 5:34am

A slight twist of option 3 would be to provide a pass-through server instead, e.g. a server that receives requests to various packages and simply redirects to one of your actual sources. That should be much cheaper (in many ways) than a full devpi setup. As a further improvement you could even run that server as a part of CI and just use localhost.

astrojuanlu · August 16, 2022, 12:44pm

Thanks all for the comments, they’re really helpful!

Glad to see the efforts are still ongoing, been following this on and off for the past two years - I’ll see if I’m articulate enough to emit an opinion there.

Notice that I’m developing a library - that’s why I brought the question up, since the primary use case of many of these locking workflows is application development instead. However, pip-tools is good enough for me and keeping setup.py and the resulting requirements.txt in sync is not necessarily a huge pain.

Good to know, thanks!

Thanks @matthewfeickert - however if I understand correctly, constraints.txt files won’t save my users or my CI from pulling malicious packages from PyPI, am I right? I’d either need locking or a PyPI proxy for that. Therefore, I see constraints files as a nice addition (even though I’m still wrapping my head around them) on top of the other solutions.

This sounds interesting @uranusjr, and "it should not be too hard"™ are you aware of any open source implementations of this idea? Sounds like it could be potentially useful for a lot of people.

pf_moore · August 16, 2022, 2:02pm

GitHub - uranusjr/simpleindex

astrojuanlu · August 16, 2022, 10:49pm

simpleindex looks like a very good solution indeed. Too bad the request project name has to match the distribution name, so one can’t create aliases like these:

[routes."abc-{subproject}"]
source = "http"
to = "https://user:pass@private.domain.org/simple/{subproject}/"

(otherwise, one gets Skipping link: wrong project name messages from pip -vv and no candidates are found)

Still, I’m really glad this project exists Looks like the simple solution I was looking for!

uranusjr · August 17, 2022, 2:13am

For more complex routing you could register custom route classes as described in the documentation. Although depending on the service you’re running CI on, sometimes just starting an nginx instance might even be easier, and definitely allow more configurability.

astrojuanlu · August 17, 2022, 3:39pm

For the record, this is what I ended up doing:

# .gitlab-ci.yml
run-tests:
  stage: test
  image: python:3.8
  before_script:
    - mkdir /etc/simpleindex
    - cp config/simpleindex.toml /etc/simpleindex/config.toml
    - pip install -r requirements/dev.txt  # includes simpleindex
  script:
    - nohup simpleindex /etc/simpleindex/config.toml >/dev/null 2>&1 &
    - PIP_INDEX_URL=http://127.0.0.1:8000 pip install .
    - pytest

b11c · August 19, 2022, 8:17am

In Gitlab in particular, you should never use --extra-index-url, since according to the official documentation:

In GitLab 14.2 and later, when a PyPI package is not found in the Package Registry, the request is forwarded to pypi.org.

The documentation actually directly advises against using --extra-index-url:

In these commands, you can use --extra-index-url instead of --index-url. However, using --extra-index-url makes you vulnerable to dependency confusion attacks because it checks the PyPi repository for the package before it checks the custom repository. --extra-index-url adds the provided URL as an additional registry which the client checks if the package is present. --index-url tells the client to check for the package on the provided URL only.

In other words:

--index-url checks the Gitlab registry first, and then PyPI.
--extra-index-url checks PyPI first, and then the Gitlab registry