I work on spaCy, which provides a number of trained machine learning pipelines as wheels hosted on Github using their releases feature. We have a wrapper so that you can run
spacy download xxx from the command line, which runs
pip install https://.../xxx.whl with the appropriate URL. The issue I have noticed is that running this command repeatedly will unpredictably use a cache or download the model again. You can test this with this command:
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl
I was under the impression that because this is a wheel the filename should be used for caching, or, failing that, syntax like
https://...#egg=package could be used. But none of those seem to change behavior, and the installed package never shows up in
pip cache list (I assume because it is not a locally built wheel).
I assume that what is happening here is that caching is just based on the URL, and the issue is that the Github URL uses redirects with tokens that change pretty frequently, like an AWS URL. If this post-redirect URL is used that would explain the cache being used sometimes but not always.
Is my understanding that the cache is based on post-redirect URLs correct? Is there something we can do to ensure wheels are cached reliably based on name and version?
I believe this thread is related, but there is a lot going on and most issues seem to be about the opposite problem of packages not being reinstalled.