[SOLVED] Pip does not cache packages when using `--index-url`?

I have a pip.conf that includes a custom --index-url with an internal company artifactory:

[global]
index-url = https://user:pass@subdomain.jfrog.io/artifactory/api/pypi/python/simple

and I observe that pip doesn’t cache packages, and always downloads them every single time:

❯ pip cache list                                                                                     (kedro38-dev) 
No locally built wheels cached.
  ~ ···································································································· 14:33:1
❯ pip uninstall pyerfa -y && pip install pyerfa                                                      (kedro38-dev) 
Found existing installation: pyerfa 2.0.0.3
Uninstalling pyerfa-2.0.0.3:
  Successfully uninstalled pyerfa-2.0.0.3
Looking in indexes: https://juser:pass@subdomain.jfrog.io/artifactory/api/pypi/python/simple
Collecting pyerfa
  Downloading https://subdomain.jfrog.io/artifactory/api/pypi/python/packages/packages/62/58/accc45eea0a16180b0b91055ac2abf49af32407c5e2e24b9b746e1058a49/pyerfa-2.0.0.3-cp38-cp38-macosx_11_0_arm64.whl (333 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 333.2/333.2 kB 947.4 kB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17 in ./.micromamba/envs/kedro38-dev/lib/python3.8/site-packages (from pyerfa) (1.22.4)
Installing collected packages: pyerfa
Successfully installed pyerfa-2.0.0.3
  ~ ·······························································································  4s 14:33:2
❯ pip cache list                                                                                     (kedro38-dev) 
No locally built wheels cached.

On the other hand, if I comment that setting out, pip caching behaves as normal (although somehow pip cache list doesn’t reflect that??)

  ~ ·······························································································  4s 14:33:2
❯ pip cache list                                                                                     (kedro38-dev) 
No locally built wheels cached.
  ~ ···································································································· 14:33:2
❯ vim ~/.config/pip/pip.conf                                                                         (kedro38-dev) 
  ~ ···································································································· 14:34:0
❯ pip uninstall pyerfa -y && pip install pyerfa                                                      (kedro38-dev) 
Found existing installation: pyerfa 2.0.0.3
Uninstalling pyerfa-2.0.0.3:
  Successfully uninstalled pyerfa-2.0.0.3
Collecting pyerfa
  Downloading pyerfa-2.0.0.3-cp38-cp38-macosx_11_0_arm64.whl (333 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 333.2/333.2 kB 2.8 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17 in ./.micromamba/envs/kedro38-dev/lib/python3.8/site-packages (from pyerfa) (1.22.4)
Installing collected packages: pyerfa
Successfully installed pyerfa-2.0.0.3
  ~ ·······························································································  3s 14:34:1
❯ pip cache list                                                                                     (kedro38-dev) 
No locally built wheels cached.
  ~ ···································································································· 14:34:1
❯ pip uninstall pyerfa -y && pip install pyerfa                                                      (kedro38-dev) 
Found existing installation: pyerfa 2.0.0.3
Uninstalling pyerfa-2.0.0.3:
  Successfully uninstalled pyerfa-2.0.0.3
Collecting pyerfa
  Using cached pyerfa-2.0.0.3-cp38-cp38-macosx_11_0_arm64.whl (333 kB)
Requirement already satisfied: numpy>=1.17 in ./.micromamba/envs/kedro38-dev/lib/python3.8/site-packages (from pyerfa) (1.22.4)
Installing collected packages: pyerfa
Successfully installed pyerfa-2.0.0.3

(notice how the second time it’s “Using cached”, which is what I want :+1: although pip cache list says there’s nothing in the cache)

If I switch back to my --index-url, then pip stops “seeing” the package that is cached.

Is this normal/expected? I tried looking into the pip issues without success. Before opening a new one myself, I wanted to check.

Note that pip cache list refers to “locally built wheels”. In this case, there is no locally built wheel, the wheel is downloaded, and so is satisfied by the HTTP cache (which is separate, and not shown by pip cache).

The HTTP cache will treat the two indexes as different, and that could be part of the issue here. Also, it’s possible that Artifactory is for some reason not setting the HTTP headers correctly to allow caching - I doubt that, but it’s worth checking.

You should probably run pip in verbose mode to get more details. This is unlikely to be a pip issue, but it’s not clear what the problem actually is without more details (and you may well be able to work out the problem for yourself once you see those extra details).

3 Likes

I am in the process of implementing a private index internally to get away from Artifactory for compliance reasons and ran into the same issue. Basically, if you want to get the wheels use pip wheel -r ... and then compare the hashes to your internal mirror. If they differ then it means you built it.

1 Like

I think the index needing to provide wheel hashes before one of the forms of caching is allowed

Run:

curl 'https://user:pass@subdomain.jfrog.io/artifactory/api/pypi/python/simple/pyerfa'

(replacing parts as required)

and see if the file links have hashes (a element’s href value URL fragment: sha256=...)

1 Like

That’s useful, thanks for the clarification.

This is what I see:

$ curl -u "user:password" https://subdomain.jfrog.io/artifactory/api/pypi/python/simple/pyerfa/                                                                           (kedro310) 
<!DOCTYPE html>
<html><head><title>Simple Index</title><meta name="api-version" value="2" /></head><body>
<a href="../../packages/packages/9b/f7/6eafee6e8028a692a5bf303d3f32f2e3ea289d7c6cf50079d34ad874a7dd/pyerfa-1.7.0-cp36-cp36m-macosx_10_9_x86_64.whl#sha256=56f4ca4898b3ac9e954b6fe5cadf13b99aff6e9ad0656c3430f5be26b94e6c33" data-requires-python="&gt;=3.6" rel="internal">pyerfa-1.7.0-cp36-cp36m-macosx_10_9_x86_64.whl</a>
<a href="../../packages/packages/b7/57/02a654525d9ccc60601a1fca2a11df7cffb67aff6308822c6dde6451cb1c/pyerfa-1.7.0-cp36-cp36m-manylinux1_i686.whl#sha256=fc719188101eff0f95afd711b1e26e62fe0a7b8790eb85c822638bf804ea5e10" data-requires-python="&gt;=3.6" rel="internal">pyerfa-1.7.0-cp36-cp36m-manylinux1_i686.whl</a>
...

so it’s my understanding that the index is providing hashes.

We are still investigating this issue with your input, will report back.

Can I ask, why is this an issue anyway? A download of a 333Kb file doesn’t seem like it’s going to be a major issue. And if it is, then that’s more of a network or index server performance problem, surely?

It also matters what the HTTP cache headers are. Can you run curl with -D -, and check that it does not set no-cache or a really short TTL on the response headers?

Multiply this by all packages, all their dependencies, over and over again…

The performance hit is noticeable, especially since pip cannot do concurrent downloads.

Thanks, this is what I see:

HTTP/1.1 200 
Date: Fri, 10 Nov 2023 13:09:27 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: keep-alive
X-JFrog-Version: Artifactory/7.72.1 77201900
X-Artifactory-Id: 91c75d1c4d86af677162467b517822e47ce9a914
X-Artifactory-Node-Id: subdomain-artifactory-primary-2
Cache-Control: max-age=60
Strict-Transport-Security: max-age=31536000; includeSubDomains
X-Request-ID: 3033c3c6dc6862362954c9eebc8044b4

We suspect this has to do with how Jfrog (or our installation) manages cloud storage JFrog Help Center still investigating, thanks for the help :pray:

1 Like

This means that cached content is cached for only 60 seconds.

It must be something else, because even if I run a pip install three times within 60 seconds, it still doesn’t cache anything:

❯ date && pip install --force-reinstall numpy --no-deps                                                                                       (kedro310) 
Fri Nov 10 14:20:21 CET 2023
Looking in indexes: https://user:***@subdomain.jfrog.io/artifactory/api/pypi/python/simple
Collecting numpy
  Downloading https://subdomain.jfrog.io/artifactory/api/pypi/python/packages/packages/e3/63/fd76159cb76c682171e3bf50ed0ee8704103035a9347684a2ec0914b84a1/numpy-1.26.1-cp310-cp310-macosx_11_0_arm64.whl (14.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.0/14.0 MB 4.5 MB/s eta 0:00:00
Installing collected packages: numpy
Successfully installed numpy-1.26.1
  ~ ····································································································································  20s 14:20:4
❯ date && pip install --force-reinstall numpy --no-deps                                                                                       (kedro310) 
Fri Nov 10 14:20:42 CET 2023
Looking in indexes: https://user:***@subdomain.jfrog.io/artifactory/api/pypi/python/simple
Collecting numpy
  Downloading https://subdomain.jfrog.io/artifactory/api/pypi/python/packages/packages/e3/63/fd76159cb76c682171e3bf50ed0ee8704103035a9347684a2ec0914b84a1/numpy-1.26.1-cp310-cp310-macosx_11_0_arm64.whl (14.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.0/14.0 MB 8.2 MB/s eta 0:00:00
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.1
    Uninstalling numpy-1.26.1:
      Successfully uninstalled numpy-1.26.1
Successfully installed numpy-1.26.1
  ~ ····································································································································  10s 14:20:5
❯ date && pip install --force-reinstall numpy --no-deps                                                                                       (kedro310) 
Fri Nov 10 14:20:53 CET 2023
Looking in indexes: https://user:***@subdomain.jfrog.io/artifactory/api/pypi/python/simple
Collecting numpy
  Downloading https://subdomain.jfrog.io/artifactory/api/pypi/python/packages/packages/e3/63/fd76159cb76c682171e3bf50ed0ee8704103035a9347684a2ec0914b84a1/numpy-1.26.1-cp310-cp310-macosx_11_0_arm64.whl (14.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.0/14.0 MB 6.6 MB/s eta 0:00:00
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.1
    Uninstalling numpy-1.26.1:
      Successfully uninstalled numpy-1.26.1
Successfully installed numpy-1.26.1

At this point, I guess reach out to JFrog. :person_shrugging:t2:

This is almost definitely related to the configuration of the index server. If this is a pip bug, then we’d need a self contained reproducer that pip maintainers can run.

https://subdomain.jfrog.io/artifactory/api/pypi/python/packages/packages/e3/63/fd76159cb76c682171e3bf50ed0ee8704103035a9347684a2ec0914b84a1/numpy-1.26.1-cp310-cp310-macosx_11_0_arm64.whl

It might be worth testing whether this goes directly to the file or through a redirection, potentially to a URL with a short-lived access token in it. In this case, pip’s caching will cache against the URL with the token and the next time you request it it’ll be different.

We use simpleindex to hide upstream URLs from pip so that caching works on Azure Artifacts. It wouldn’t surprise me if Artifactory was also using redirects and access tokens in a similar way.

4 Likes

@steve.dower nailed it: JFrog is performing a HTTP 302 redirect to S3, which is probably confusing pip.

Could you share a bit more of your setup? I’m already familiar with simpleindex (see Advice to avoid `--extra-index-url` to install private packages from GitLab CI - #8 by astrojuanlu) so I will follow your advice and give it a try for this problem.

There’s really nothing more to share. A blanket wildcard setting that directs all package requests to a single feed is enough to make pip only see localhost URLs and cache them properly.

I did write a small extension (currently private, but it’s on my list to publish) for connecting to Azure Artifacts transparently when inside an Azure Pipelines build environment. You may want something similar to make it safer to inject an access token, but it really is just the obvious implementation of a Route (with a bit of tricks to stream through a streaming request, but as the second stage is local that barely matters).

This is what I tried:

[routes."{project}"]
source = "http"
to = "https://user:pass@subdomain.jfrog.io/artifactory/api/pypi/python/simple/{project}/"

[server]
host = "127.0.0.1"
port = 7990

but then pip install install sees the JFrog URL:

  ~ ································································· 14:24:5
❯ date && pip install --force-reinstall numpy --no-deps              (kedro310) 
Wed Nov 15 14:24:53 CET 2023
Looking in indexes: http://localhost:7990
Collecting numpy
  Downloading https://user:pass@subdomain.jfrog.io/artifactory/api/pypi/python/packages/packages/2f/ac/be1f2767b7222347d2fefc18d8d58e9febfd9919190cc6fbd8a4d22d6eab/numpy-1.26.2-cp310-cp310-macosx_11_0_arm64.whl (14.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.0/14.0 MB 24.7 MB/s eta 0:00:00
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.1
    Uninstalling numpy-1.26.1:
      Successfully uninstalled numpy-1.26.1
Successfully installed numpy-1.26.2
  ~ ···························································  15s 14:25:0
❯ date && pip install --force-reinstall numpy --no-deps              (kedro310) 
Wed Nov 15 14:25:11 CET 2023
Looking in indexes: http://localhost:7990
Collecting numpy
  Downloading https://user:pass@subdomain.jfrog.io/artifactory/api/pypi/python/packages/packages/2f/ac/be1f2767b7222347d2fefc18d8d58e9febfd9919190cc6fbd8a4d22d6eab/numpy-1.26.2-cp310-cp310-macosx_11_0_arm64.whl (14.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.0/14.0 MB 25.8 MB/s eta 0:00:00
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.2
    Uninstalling numpy-1.26.2:
      Successfully uninstalled numpy-1.26.2
Successfully installed numpy-1.26.2

the logs of simpleindex say

INFO:     127.0.0.1:53829 - "GET /numpy/ HTTP/1.1" 302 Found
INFO:     127.0.0.1:53837 - "GET /numpy/ HTTP/1.1" 302 Found

I might have to look into implementing a custom Route then.

Oh, it looks like we do rewrite the page to use relative links (the for loop in the middle of the function). You’ll probably need to do the same, otherwise the subsequent request will bypass simpleindex.

    async def get_page(self, params):
        try:
            index_url = await self._get_package_url(params)
            page = await self.__client.get(index_url, auth=self._get_auth())
            page.raise_for_status()
            doc = html5lib.parse(await page.aread())
        except httpx.HTTPStatusError as ex:
            if ex.response.status_code != 404:
                log(ex, tb=True)
            return Response(
                status_code=ex.response.status_code,
                content=await ex.response.aread(),
                media_type=ex.response.headers.get("Content-Type", "text/plain"),
            )

        for link in doc.findall("*/{http://www.w3.org/1999/xhtml}a"):
            bits = urlsplit(link.get("href"))
            filename = bits.path.rpartition("/")[-1]
            self.__filenames[filename] = link.get("href")
            new_href = f"./{filename}#{bits.fragment}"
            link.set("href", new_href)

        return Response(
            status_code=page.status_code,
            content=html5lib.serialize(doc, encoding="utf-8"),
            media_type="text/html",
            headers={"Cache-Control": CACHE_CONTROL},
        )
2 Likes

After a bit of digging I’m almost there: I had to implement the get_file method from the Route and forward the response like explained in Async Support - HTTPX

However, when I run pip install ... now, I don’t see the Looking up "..." in the cache in the verbose logs, it goes directly to fetching the package:

Given no hashes to check 146 links for project 'numpy': discarding no candidates
Collecting numpy
  Created temporary directory: /private/var/folders/r7/ywj0_kvj0mxfkdkx0jrgh73r0000gn/T/pip-unpack-2t5rq1jm
  Found index url http://localhost:7990/
  http://localhost:7990 "GET /numpy/numpy-1.24.4-cp38-cp38-macosx_11_0_arm64.whl HTTP/1.1" 200 None
  ...

How does pip decide whether to use CacheController.cached_request or not?

Can you run pip with --log out.log and post the logs in the GitHub Gist or something similar, forone of us to look at?

Sure, thanks! `pip install` logs when using a custom simpleindex, see https://discuss.python.org/t/pip-does-not-cache-packages-when-using-index-url/38228?u=astrojuanlu · GitHub

Oh I had it in front of my eyes:

pip is not caching http requests