What are the caching rules for wheels installed from URLs?

polm · December 2, 2022, 10:00am

I work on spaCy, which provides a number of trained machine learning pipelines as wheels hosted on Github using their releases feature. We have a wrapper so that you can run spacy download xxx from the command line, which runs pip install https://.../xxx.whl with the appropriate URL. The issue I have noticed is that running this command repeatedly will unpredictably use a cache or download the model again. You can test this with this command:

pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl

I was under the impression that because this is a wheel the filename should be used for caching, or, failing that, syntax like package@https://... or https://...#egg=package could be used. But none of those seem to change behavior, and the installed package never shows up in pip cache list (I assume because it is not a locally built wheel).

I assume that what is happening here is that caching is just based on the URL, and the issue is that the Github URL uses redirects with tokens that change pretty frequently, like an AWS URL. If this post-redirect URL is used that would explain the cache being used sometimes but not always.

Is my understanding that the cache is based on post-redirect URLs correct? Is there something we can do to ensure wheels are cached reliably based on name and version?

I believe this thread is related, but there is a lot going on and most issues seem to be about the opposite problem of packages not being reinstalled.

pf_moore · December 2, 2022, 10:23am

Pip caches http responses, by url, to reduce network traffic. Independently, it caches wheels that it built, to reduce the need for repeated builds.

It sounds like you were hoping that the latter would also apply to downloads, but you’re seeing the former, which is what actually happens.

polm · December 2, 2022, 10:45am

Thanks for the quick response. Just to be clear about the URL based caching, that’s based on the final URL after redirects, not the URL provided on the command line?

pf_moore · December 2, 2022, 10:59am

Honestly, I’m not sure. We use CacheControl, and I haven’t checked the details. It’s not documented/guaranteed what we cache, but if it matters to you, you can check the code.

steve.dower · December 2, 2022, 1:01pm

Yes, it’s based on the final file URL, and includes query parameters (?...) but not fragments (#...).

This thread is directly related. The best workaround I’ve found is to use https://github.com/uranusjr/simpleindex/ so that pip thinks it’s accessing http://localhost URLs and will cache correctly.

polm · December 6, 2022, 11:13am

Thanks for confirming it’s based on the final URL!

Also thanks for the tip on simpleindex. I’m not sure that’ll help with the situation we have, but it does give us some other approaches to consider at least.

Topic		Replies	Views
Wheel caching and non-deterministic builds Packaging	13	2350	March 20, 2021
Where is my cached Pip wheel coming from? Python Help packaging-help	11	664	April 19, 2024
A shareable content-addressable wheel artifact cache Packaging	10	726	February 16, 2022
Figuring out what is missing from dedicated packages for supporting downloading and installing a wheel from PyPI Packaging	5	1131	August 13, 2019
File metadata after wheel installation Packaging help	17	614	May 23, 2023

What are the caching rules for wheels installed from URLs?

Related Topics