Pip fetches from files.pythonhosted.org despite local mirror was specified

Hello everyone,

I’m using pip 22.2.2 on Python 3.10 with a local package mirror. However, while the local mirror is used as an index, downloads are attempted from the global pythonhosted mirror.

For example, when I run

:~$ pip install pandas -vvvi http://pypi.repo.test.hhu.de/simple --trusted-host pypi.repo.test.hhu.de

pip correctly uses the specified mirror as a package index:

Looking in indexes: http://pypi.repo.test.hhu.de/simple
1 location(s) to search for versions of pandas:
* http://pypi.repo.test.hhu.de/simple/pandas/
Fetching project page and analyzing links: http://pypi.repo.test.hhu.de/simple/pandas/
Getting page http://pypi.repo.test.hhu.de/simple/pandas/
Found index url http://pypi.repo.test.hhu.de/simple
Looking up "http://pypi.repo.test.hhu.de/simple/pandas/" in the cache
Request header has "max_age" as 0, cache bypassed
Starting new HTTP connection (1): pypi.repo.test.hhu.de:80
http://pypi.repo.test.hhu.de:80 "GET /simple/pandas/ HTTP/1.1" 304 0
Fetched page http://pypi.repo.test.hhu.de/simple/pandas/ as application/vnd.pypi.simple.v1+json

but afterwards, all reported links point to pythonhosted:

  Found link https://files.pythonhosted.org/packages/b4/8e/057ebd80a3b6dcda154dd6878744fc5549832a484e72bc4189b8d782be75/pandas-0.1.tar.gz (from http://pypi.repo.test.hhu.de/simple/pandas/), version: 0.1
[...]

how is that possible? The package index http://pypi.repo.test.hhu.de/simple/pandas/ does not contain a single reference to pythonhosted, as confirmed by running this snippet:

:~$ curl http://pypi.repo.test.hhu.de/simple/pandas/ | grep pythonhosted.org | wc -l
0
:~$ curl http://pypi.repo.test.hhu.de/simple/pandas/ | grep repo.test.hhu.de | wc -l
1405

0 lines in this HTML file matched pythonhosted, while 1405 lines matched our repository. What am I missing here?

Thanks a lot in advance!

You’ll need to provide a publicly available index for people to reproduce your issue. pip has many level of caches and there are too many variables for anyone to provide a concrete explaination based only on the information you provided.

1 Like

Hi, thanks a lot for the fast reply. Unfortunately, I can’t provide a publicly available index, as it is firewalled for local university use only. Fortunately, we were able to fix it internally.

Our pip mirror is a local nginx instance with URL rewriting enabled. Apparently since PoC of PEP 691 · pypa/pip@6f167b5 · GitHub, pip requests the mirror list with a different MIME type. However, nginx per default only applies URL rewriting to text/html content. This made it difficult to debug, as the responses looked perfectly fine when fetched via curl, but apparently didn’t work when fetched with pip. Adding sub_filter_types ‘*’; enables URL rewriting for all MIME types, and therefore fixed the issue. From the nginx manual:

Syntax: sub_filter_types mime-type …;
Default:

sub_filter_types text/html;

Context: http, server, location

Enables string replacement in responses with the specified MIME types in addition to “text/html”. The special value “*” matches any MIME type (0.8.29).

Additionally, the caching mechanisms of pip made it harder as we initially didn’t notice that the issue was already fixed. So if you have similar problems, try cleaning your cache with pip cache remove *.

Still, thanks a lot for the fast response :slight_smile: Maybe this helps other people with a similar issue in the future.

1 Like

Maybe this is related and helpful: PEP 691: JSON-based Simple API for Python Package Indexes - #84 by dstufft
(This out of my area of knowledge so maybe it is not actually related even if it seems so to me)

2 Likes

This is indeed related, thanks for linking it! Didn’t find that post by my search keywords regarding this error, so it might be helpful that this is linked here now!

1 Like

You also may want to keep an eye out for possible cases of this
issue:

It caught us by surprise doing similar URL rewriting with Apache,
because the JSON responses are all on one line of text and can be
many megabytes in length for some projects, so required custom
tuning for us to continue being able to rewrite the file URLs in
those specific situations.

1 Like

Thanks a lot! I’ll keep an eye on that, especially if further issues arise. At the moment my tests were successful, so let’s hope it continues works as-is.

I guess the issue is because “text/html” is last on the list of accept headers: pip/collector.py at main · pypa/pip · GitHub

I did guess this might cause an issue for someone in the PR but it’s impossible to test against all the private configurations people might have set up and then somehow warn them: https://github.com/pypa/pip/pull/11158#issuecomment-1186210795

1 Like

Apparently, you were right with that guess, exactly that’s what hit my university mirror.

If you still aren’t able to figure this out, consider running pip with -vvv and posting the output of that run. That’ll contain sufficient information to help point out what exactly is happening.

Fortunately, we were able to fix this already (see #3 for the full procedure: https://discuss.python.org/t/pip-fetches-from-files-pythonhosted-org-despite-local-mirror-was-specified/19320/3)

However, -vvv didn’t help much, as the output I posted above looked fine. We used tcpdump to look at the requests sent by pip, and noticed that the request headers were different and therefore the response was as well. Afterwards we found the corresponding git change, and the respective hints provided kindly on this thread. Thanks for all the support, and keep up the great work!

2 Likes