astrojuanlu
(Juan Luis Cano Rodríguez)
November 8, 2023, 1:37pm
1
I have a pip.conf
that includes a custom --index-url
with an internal company artifactory:
[global]
index-url = https://user:pass@subdomain.jfrog.io/artifactory/api/pypi/python/simple
and I observe that pip
doesn’t cache packages, and always downloads them every single time:
❯ pip cache list (kedro38-dev)
No locally built wheels cached.
~ ···································································································· 14:33:1
❯ pip uninstall pyerfa -y && pip install pyerfa (kedro38-dev)
Found existing installation: pyerfa 2.0.0.3
Uninstalling pyerfa-2.0.0.3:
Successfully uninstalled pyerfa-2.0.0.3
Looking in indexes: https://juser:pass@subdomain.jfrog.io/artifactory/api/pypi/python/simple
Collecting pyerfa
Downloading https://subdomain.jfrog.io/artifactory/api/pypi/python/packages/packages/62/58/accc45eea0a16180b0b91055ac2abf49af32407c5e2e24b9b746e1058a49/pyerfa-2.0.0.3-cp38-cp38-macosx_11_0_arm64.whl (333 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 333.2/333.2 kB 947.4 kB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17 in ./.micromamba/envs/kedro38-dev/lib/python3.8/site-packages (from pyerfa) (1.22.4)
Installing collected packages: pyerfa
Successfully installed pyerfa-2.0.0.3
~ ······························································································· 4s 14:33:2
❯ pip cache list (kedro38-dev)
No locally built wheels cached.
On the other hand, if I comment that setting out, pip caching behaves as normal (although somehow pip cache list
doesn’t reflect that??)
~ ······························································································· 4s 14:33:2
❯ pip cache list (kedro38-dev)
No locally built wheels cached.
~ ···································································································· 14:33:2
❯ vim ~/.config/pip/pip.conf (kedro38-dev)
~ ···································································································· 14:34:0
❯ pip uninstall pyerfa -y && pip install pyerfa (kedro38-dev)
Found existing installation: pyerfa 2.0.0.3
Uninstalling pyerfa-2.0.0.3:
Successfully uninstalled pyerfa-2.0.0.3
Collecting pyerfa
Downloading pyerfa-2.0.0.3-cp38-cp38-macosx_11_0_arm64.whl (333 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 333.2/333.2 kB 2.8 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17 in ./.micromamba/envs/kedro38-dev/lib/python3.8/site-packages (from pyerfa) (1.22.4)
Installing collected packages: pyerfa
Successfully installed pyerfa-2.0.0.3
~ ······························································································· 3s 14:34:1
❯ pip cache list (kedro38-dev)
No locally built wheels cached.
~ ···································································································· 14:34:1
❯ pip uninstall pyerfa -y && pip install pyerfa (kedro38-dev)
Found existing installation: pyerfa 2.0.0.3
Uninstalling pyerfa-2.0.0.3:
Successfully uninstalled pyerfa-2.0.0.3
Collecting pyerfa
Using cached pyerfa-2.0.0.3-cp38-cp38-macosx_11_0_arm64.whl (333 kB)
Requirement already satisfied: numpy>=1.17 in ./.micromamba/envs/kedro38-dev/lib/python3.8/site-packages (from pyerfa) (1.22.4)
Installing collected packages: pyerfa
Successfully installed pyerfa-2.0.0.3
(notice how the second time it’s “Using cached”, which is what I want although pip cache list
says there’s nothing in the cache)
If I switch back to my --index-url
, then pip stops “seeing” the package that is cached.
Is this normal/expected? I tried looking into the pip
issues without success. Before opening a new one myself, I wanted to check.
pf_moore
(Paul Moore)
November 8, 2023, 2:12pm
2
Note that pip cache list
refers to “locally built wheels”. In this case, there is no locally built wheel, the wheel is downloaded, and so is satisfied by the HTTP cache (which is separate, and not shown by pip cache
).
The HTTP cache will treat the two indexes as different, and that could be part of the issue here. Also, it’s possible that Artifactory is for some reason not setting the HTTP headers correctly to allow caching - I doubt that, but it’s worth checking.
You should probably run pip in verbose mode to get more details. This is unlikely to be a pip issue, but it’s not clear what the problem actually is without more details (and you may well be able to work out the problem for yourself once you see those extra details).
3 Likes
ofek
(Ofek Lev)
November 9, 2023, 4:26am
3
I am in the process of implementing a private index internally to get away from Artifactory for compliance reasons and ran into the same issue. Basically, if you want to get the wheels use pip wheel -r ...
and then compare the hashes to your internal mirror. If they differ then it means you built it.
1 Like
EpicWink
(Laurie O)
November 9, 2023, 6:56am
4
I think the index needing to provide wheel hashes before one of the forms of caching is allowed
Run:
curl 'https://user:pass@subdomain.jfrog.io/artifactory/api/pypi/python/simple/pyerfa'
(replacing parts as required)
and see if the file links have hashes (a
element’s href
value URL fragment: sha256=...
)
1 Like
astrojuanlu
(Juan Luis Cano Rodríguez)
November 10, 2023, 9:42am
5
Paul Moore:
Note that pip cache list
refers to “locally built wheels”. In this case, there is no locally built wheel, the wheel is downloaded, and so is satisfied by the HTTP cache (which is separate, and not shown by pip cache
).
The HTTP cache will treat the two indexes as different, and that could be part of the issue here. Also, it’s possible that Artifactory is for some reason not setting the HTTP headers correctly to allow caching - I doubt that, but it’s worth checking.
That’s useful, thanks for the clarification.
This is what I see:
$ curl -u "user:password" https://subdomain.jfrog.io/artifactory/api/pypi/python/simple/pyerfa/ (kedro310)
<!DOCTYPE html>
<html><head><title>Simple Index</title><meta name="api-version" value="2" /></head><body>
<a href="../../packages/packages/9b/f7/6eafee6e8028a692a5bf303d3f32f2e3ea289d7c6cf50079d34ad874a7dd/pyerfa-1.7.0-cp36-cp36m-macosx_10_9_x86_64.whl#sha256=56f4ca4898b3ac9e954b6fe5cadf13b99aff6e9ad0656c3430f5be26b94e6c33" data-requires-python=">=3.6" rel="internal">pyerfa-1.7.0-cp36-cp36m-macosx_10_9_x86_64.whl</a>
<a href="../../packages/packages/b7/57/02a654525d9ccc60601a1fca2a11df7cffb67aff6308822c6dde6451cb1c/pyerfa-1.7.0-cp36-cp36m-manylinux1_i686.whl#sha256=fc719188101eff0f95afd711b1e26e62fe0a7b8790eb85c822638bf804ea5e10" data-requires-python=">=3.6" rel="internal">pyerfa-1.7.0-cp36-cp36m-manylinux1_i686.whl</a>
...
so it’s my understanding that the index is providing hashes.
We are still investigating this issue with your input, will report back.
pf_moore
(Paul Moore)
November 10, 2023, 10:09am
6
Can I ask, why is this an issue anyway? A download of a 333Kb file doesn’t seem like it’s going to be a major issue. And if it is, then that’s more of a network or index server performance problem, surely?
pradyunsg
(Pradyun Gedam)
November 10, 2023, 10:26am
7
It also matters what the HTTP cache headers are. Can you run curl with -D -
, and check that it does not set no-cache or a really short TTL on the response headers?
astrojuanlu
(Juan Luis Cano Rodríguez)
November 10, 2023, 1:12pm
8
Paul Moore:
Can I ask, why is this an issue anyway? A download of a 333Kb file doesn’t seem like it’s going to be a major issue. And if it is, then that’s more of a network or index server performance problem, surely?
Multiply this by all packages, all their dependencies, over and over again…
The performance hit is noticeable, especially since pip cannot do concurrent downloads.
Thanks, this is what I see:
HTTP/1.1 200
Date: Fri, 10 Nov 2023 13:09:27 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: keep-alive
X-JFrog-Version: Artifactory/7.72.1 77201900
X-Artifactory-Id: 91c75d1c4d86af677162467b517822e47ce9a914
X-Artifactory-Node-Id: subdomain-artifactory-primary-2
Cache-Control: max-age=60
Strict-Transport-Security: max-age=31536000; includeSubDomains
X-Request-ID: 3033c3c6dc6862362954c9eebc8044b4
We suspect this has to do with how Jfrog (or our installation) manages cloud storage JFrog Help Center still investigating, thanks for the help
1 Like
pradyunsg
(Pradyun Gedam)
November 10, 2023, 1:13pm
9
This means that cached content is cached for only 60 seconds.
astrojuanlu
(Juan Luis Cano Rodríguez)
November 10, 2023, 1:22pm
10
It must be something else, because even if I run a pip install
three times within 60 seconds, it still doesn’t cache anything:
❯ date && pip install --force-reinstall numpy --no-deps (kedro310)
Fri Nov 10 14:20:21 CET 2023
Looking in indexes: https://user:***@subdomain.jfrog.io/artifactory/api/pypi/python/simple
Collecting numpy
Downloading https://subdomain.jfrog.io/artifactory/api/pypi/python/packages/packages/e3/63/fd76159cb76c682171e3bf50ed0ee8704103035a9347684a2ec0914b84a1/numpy-1.26.1-cp310-cp310-macosx_11_0_arm64.whl (14.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.0/14.0 MB 4.5 MB/s eta 0:00:00
Installing collected packages: numpy
Successfully installed numpy-1.26.1
~ ···································································································································· 20s 14:20:4
❯ date && pip install --force-reinstall numpy --no-deps (kedro310)
Fri Nov 10 14:20:42 CET 2023
Looking in indexes: https://user:***@subdomain.jfrog.io/artifactory/api/pypi/python/simple
Collecting numpy
Downloading https://subdomain.jfrog.io/artifactory/api/pypi/python/packages/packages/e3/63/fd76159cb76c682171e3bf50ed0ee8704103035a9347684a2ec0914b84a1/numpy-1.26.1-cp310-cp310-macosx_11_0_arm64.whl (14.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.0/14.0 MB 8.2 MB/s eta 0:00:00
Installing collected packages: numpy
Attempting uninstall: numpy
Found existing installation: numpy 1.26.1
Uninstalling numpy-1.26.1:
Successfully uninstalled numpy-1.26.1
Successfully installed numpy-1.26.1
~ ···································································································································· 10s 14:20:5
❯ date && pip install --force-reinstall numpy --no-deps (kedro310)
Fri Nov 10 14:20:53 CET 2023
Looking in indexes: https://user:***@subdomain.jfrog.io/artifactory/api/pypi/python/simple
Collecting numpy
Downloading https://subdomain.jfrog.io/artifactory/api/pypi/python/packages/packages/e3/63/fd76159cb76c682171e3bf50ed0ee8704103035a9347684a2ec0914b84a1/numpy-1.26.1-cp310-cp310-macosx_11_0_arm64.whl (14.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.0/14.0 MB 6.6 MB/s eta 0:00:00
Installing collected packages: numpy
Attempting uninstall: numpy
Found existing installation: numpy 1.26.1
Uninstalling numpy-1.26.1:
Successfully uninstalled numpy-1.26.1
Successfully installed numpy-1.26.1
pradyunsg
(Pradyun Gedam)
November 10, 2023, 5:42pm
11
At this point, I guess reach out to JFrog.
This is almost definitely related to the configuration of the index server. If this is a pip bug, then we’d need a self contained reproducer that pip maintainers can run.
https://subdomain.jfrog.io/artifactory/api/pypi/python/packages/packages/e3/63/fd76159cb76c682171e3bf50ed0ee8704103035a9347684a2ec0914b84a1/numpy-1.26.1-cp310-cp310-macosx_11_0_arm64.whl
It might be worth testing whether this goes directly to the file or through a redirection, potentially to a URL with a short-lived access token in it. In this case, pip’s caching will cache against the URL with the token and the next time you request it it’ll be different.
We use simpleindex
to hide upstream URLs from pip so that caching works on Azure Artifacts. It wouldn’t surprise me if Artifactory was also using redirects and access tokens in a similar way.
4 Likes
astrojuanlu
(Juan Luis Cano Rodríguez)
November 15, 2023, 1:18pm
13
@steve.dower nailed it: JFrog is performing a HTTP 302 redirect to S3, which is probably confusing pip.
Could you share a bit more of your setup? I’m already familiar with simpleindex
(see Advice to avoid `--extra-index-url` to install private packages from GitLab CI - #8 by astrojuanlu ) so I will follow your advice and give it a try for this problem.
There’s really nothing more to share. A blanket wildcard setting that directs all package requests to a single feed is enough to make pip only see localhost
URLs and cache them properly.
I did write a small extension (currently private, but it’s on my list to publish) for connecting to Azure Artifacts transparently when inside an Azure Pipelines build environment. You may want something similar to make it safer to inject an access token, but it really is just the obvious implementation of a Route (with a bit of tricks to stream through a streaming request, but as the second stage is local that barely matters).
astrojuanlu
(Juan Luis Cano Rodríguez)
November 15, 2023, 1:29pm
15
This is what I tried:
[routes."{project}"]
source = "http"
to = "https://user:pass@subdomain.jfrog.io/artifactory/api/pypi/python/simple/{project}/"
[server]
host = "127.0.0.1"
port = 7990
but then pip install
install sees the JFrog URL:
~ ································································· 14:24:5
❯ date && pip install --force-reinstall numpy --no-deps (kedro310)
Wed Nov 15 14:24:53 CET 2023
Looking in indexes: http://localhost:7990
Collecting numpy
Downloading https://user:pass@subdomain.jfrog.io/artifactory/api/pypi/python/packages/packages/2f/ac/be1f2767b7222347d2fefc18d8d58e9febfd9919190cc6fbd8a4d22d6eab/numpy-1.26.2-cp310-cp310-macosx_11_0_arm64.whl (14.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.0/14.0 MB 24.7 MB/s eta 0:00:00
Installing collected packages: numpy
Attempting uninstall: numpy
Found existing installation: numpy 1.26.1
Uninstalling numpy-1.26.1:
Successfully uninstalled numpy-1.26.1
Successfully installed numpy-1.26.2
~ ··························································· 15s 14:25:0
❯ date && pip install --force-reinstall numpy --no-deps (kedro310)
Wed Nov 15 14:25:11 CET 2023
Looking in indexes: http://localhost:7990
Collecting numpy
Downloading https://user:pass@subdomain.jfrog.io/artifactory/api/pypi/python/packages/packages/2f/ac/be1f2767b7222347d2fefc18d8d58e9febfd9919190cc6fbd8a4d22d6eab/numpy-1.26.2-cp310-cp310-macosx_11_0_arm64.whl (14.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.0/14.0 MB 25.8 MB/s eta 0:00:00
Installing collected packages: numpy
Attempting uninstall: numpy
Found existing installation: numpy 1.26.2
Uninstalling numpy-1.26.2:
Successfully uninstalled numpy-1.26.2
Successfully installed numpy-1.26.2
the logs of simpleindex
say
INFO: 127.0.0.1:53829 - "GET /numpy/ HTTP/1.1" 302 Found
INFO: 127.0.0.1:53837 - "GET /numpy/ HTTP/1.1" 302 Found
I might have to look into implementing a custom Route
then.
Oh, it looks like we do rewrite the page to use relative links (the for loop in the middle of the function). You’ll probably need to do the same, otherwise the subsequent request will bypass simpleindex.
async def get_page(self, params):
try:
index_url = await self._get_package_url(params)
page = await self.__client.get(index_url, auth=self._get_auth())
page.raise_for_status()
doc = html5lib.parse(await page.aread())
except httpx.HTTPStatusError as ex:
if ex.response.status_code != 404:
log(ex, tb=True)
return Response(
status_code=ex.response.status_code,
content=await ex.response.aread(),
media_type=ex.response.headers.get("Content-Type", "text/plain"),
)
for link in doc.findall("*/{http://www.w3.org/1999/xhtml}a"):
bits = urlsplit(link.get("href"))
filename = bits.path.rpartition("/")[-1]
self.__filenames[filename] = link.get("href")
new_href = f"./{filename}#{bits.fragment}"
link.set("href", new_href)
return Response(
status_code=page.status_code,
content=html5lib.serialize(doc, encoding="utf-8"),
media_type="text/html",
headers={"Cache-Control": CACHE_CONTROL},
)
2 Likes
astrojuanlu
(Juan Luis Cano Rodríguez)
November 19, 2023, 12:34pm
17
After a bit of digging I’m almost there: I had to implement the get_file
method from the Route
and forward the response like explained in Async Support - HTTPX
However, when I run pip install ...
now, I don’t see the Looking up "..." in the cache
in the verbose logs, it goes directly to fetching the package:
Given no hashes to check 146 links for project 'numpy': discarding no candidates
Collecting numpy
Created temporary directory: /private/var/folders/r7/ywj0_kvj0mxfkdkx0jrgh73r0000gn/T/pip-unpack-2t5rq1jm
Found index url http://localhost:7990/
http://localhost:7990 "GET /numpy/numpy-1.24.4-cp38-cp38-macosx_11_0_arm64.whl HTTP/1.1" 200 None
...
How does pip decide whether to use CacheController.cached_request
or not?
pradyunsg
(Pradyun Gedam)
November 19, 2023, 1:17pm
18
Can you run pip
with --log out.log
and post the logs in the GitHub Gist or something similar, forone of us to look at?
astrojuanlu
(Juan Luis Cano Rodríguez)
November 19, 2023, 1:27pm
19
astrojuanlu
(Juan Luis Cano Rodríguez)
November 19, 2023, 1:44pm
20
Oh I had it in front of my eyes:
# We want to _only_ cache responses on securely fetched origins or when
# the host is specified as trusted. We do this because
# we can't validate the response of an insecurely/untrusted fetched
# origin, and we don't want someone to be able to poison the cache and
# require manual eviction from the cache to fix it.
if cache:
secure_adapter = CacheControlAdapter(
cache=SafeFileCache(cache),
max_retries=retries,
ssl_context=ssl_context,
)
self._trusted_host_adapter = InsecureCacheControlAdapter(
cache=SafeFileCache(cache),
max_retries=retries,
)
else:
secure_adapter = HTTPAdapter(max_retries=retries, ssl_context=ssl_context)
self._trusted_host_adapter = insecure_adapter
self.mount("https://", secure_adapter)
self.mount("http://", insecure_adapter)
pip is not caching http requests