What are the caching rules for wheels installed from URLs?

I work on spaCy, which provides a number of trained machine learning pipelines as wheels hosted on Github using their releases feature. We have a wrapper so that you can run spacy download xxx from the command line, which runs pip install https://.../xxx.whl with the appropriate URL. The issue I have noticed is that running this command repeatedly will unpredictably use a cache or download the model again. You can test this with this command:

pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl

I was under the impression that because this is a wheel the filename should be used for caching, or, failing that, syntax like package@https://... or https://...#egg=package could be used. But none of those seem to change behavior, and the installed package never shows up in pip cache list (I assume because it is not a locally built wheel).

I assume that what is happening here is that caching is just based on the URL, and the issue is that the Github URL uses redirects with tokens that change pretty frequently, like an AWS URL. If this post-redirect URL is used that would explain the cache being used sometimes but not always.

Is my understanding that the cache is based on post-redirect URLs correct? Is there something we can do to ensure wheels are cached reliably based on name and version?

I believe this thread is related, but there is a lot going on and most issues seem to be about the opposite problem of packages not being reinstalled.

Pip caches http responses, by url, to reduce network traffic. Independently, it caches wheels that it built, to reduce the need for repeated builds.

It sounds like you were hoping that the latter would also apply to downloads, but you’re seeing the former, which is what actually happens.

Thanks for the quick response. Just to be clear about the URL based caching, that’s based on the final URL after redirects, not the URL provided on the command line?

Honestly, I’m not sure. We use CacheControl, and I haven’t checked the details. It’s not documented/guaranteed what we cache, but if it matters to you, you can check the code.

Yes, it’s based on the final file URL, and includes query parameters (?...) but not fragments (#...).

This thread is directly related. The best workaround I’ve found is to use https://github.com/uranusjr/simpleindex/ so that pip thinks it’s accessing http://localhost URLs and will cache correctly.

1 Like

Thanks for confirming it’s based on the final URL!

Also thanks for the tip on simpleindex. I’m not sure that’ll help with the situation we have, but it does give us some other approaches to consider at least.