Decent pip caching on GitHub Actions

GitHub Actions has support for preserving the pip cache across builds, but, due to the fact that the GHA cache is not updated on an exact key match, caching the pip cache only works optimally if one’s repository contains a lock file giving the exact versions of all build & test dependencies in use. It’s my understanding (I could be wrong) that lock files generally shouldn’t be committed to repositories, and generating a lock file as part of a build requires downloading the relevant files from PyPI before cache restoration can even take place, defeating the point of the GHA cache.

The best way I’ve found so far to manage caching of the pip cache is this suggestion, which has one create a separate GHA cache for each run; the most recent cache is then restored at the start of each run and, and at the end of the run, the cache is saved with updates in a new GHA cache. This results in a series of ever-growing (mostly duplicate) caches being created until you hit the 5 GB limit, at which point old caches begin getting deleted. This could be better.

Has anyone figured out a decent setup for caching the pip cache on GitHub Actions?

Lock files should definitely be committed to repositories! That’s the only way of sharing the set of versions you’re saying work.

Here’s an example, which I’m reasonably sure I just copied from some documentation: til/publish.yml at main · Julian/til · GitHub

Lock files should definitely be committed to repositories! That’s the only way of sharing the set of versions you’re saying work.

Perhaps I should clarify: I’m talking about building Python libraries, with requirements in setup.cfg and tox.ini files, not Pelican sites. When I declare a version range in setup.cfg, I’m claiming that my library works with that entire range, and if there’s a version in that range for which it doesn’t work, I want to hear about it. I honestly don’t understand why one would want to use a lockfile when developing a library.

1 Like

IMHO there’s a clear distinction you should make first. Are we talking about an application or a library. If it’s an application you should pin all your dependencies and periodically upgrade. you should commit your lock file. If it’s a library you might want to get notified when due to your dependencies you break, in which case you probably shouldn’t create/committ a lock file. In this case you will pay with no cache on GitHub, but perhaps getting notified right away when a dependency breaks you is worth the trade-off. As usual choose the right solution for the right case, there’s no universal truth.

PS. One day GitHub will allow to update the cache, but until then no great solution in sight, just workarounds.

Certainly agree if you’re talking about a library not an app you generally wouldn’t have a lockfile.

I do think if you’d like you can try having a cache anyhow and stash some local versions of libraries in it (hashed on your setup.cfg) – it won’t fully save you if newer versions are available, but it’ll likely save some downloading for any package that isn’t the case for (so pip should hit PyPI, see that the version is up to date, and not download it again because it finds that artifact locally). But I haven’t bothered personally with that, for libraries I maintain, downloading dependencies isn’t a terribly long part of their CI runs.

I was looking in to this a while back. Remember that not all requirements need to be cached: you can still speed up installation time by using a static cache key

Not quite. Certainly, there’s an argument for lock-files being overkill for smaller projects, they do help with ensuring a consistent environment for development and testing, and for application usage. Really, the only time I wouldn’t use them is with end-user usage of libraries

I’m not sure that’s true. From my understanding the key for the cache in GHA simply controls whether that cache is used elsewhere. There’s still a post-run action for the cache action to update the cache if any changes are made. So if you’re caching your pip cache directory it will still have the desired effect of caching what was downloaded from PyPI even if your requirements.txt file doesn’t change while what gets pulled in does.

There’s still a post-run action for the cache action to update the cache if any changes are made.

Not quite. The cache is not updated if there’s an exact key hit; it’s only updated if the cache action falls back to using one of the restore keys. Thus, if your cache key is based solely on a hash of your (non-pinned) requirements.txt, and your requirements.txt doesn’t change, but a new version of a dependency is released, that new dependency won’t be added to the GHA cache, and pip will still have to download it on every run.

1 Like