A shareable content-addressable wheel artifact cache

Hello there! :wave: I’m trying to gauge interest to see if there are other folks with similar problems / use cases and whether it’s worth writing a full PEP about it, or to see if there’s an easy way to achieve this without modding pip (too much / at all). Feedback appreciated!

Abstract

A shareable, content-addressable wheel artifact cache (as in, the files inside the .whl archive. maybe also the per-major-version .pyc files and even full directories that are ready to be symlinked) can be created such that pip does not need to extract files into the venv and instead hardlink / symlink the files from the artifact cache. That way even if there are O(n) different venvs on a single machine with shared packages among them, the disk space for each individual file is only paid once, and each venv’s package only contains hardlinks / symlinks (or even is a single symlink, like .egg-links but for non-editable packages). Furthermore, most package versions only change a few files between releases, so there is already some amount of de-duplication.

Motivation

  • When in an environment where there are multiple virtual environments, each venv incurs in additional disk space usage and setup time.
  • Every distinct user will also not be able to fully leverage the pip wheel cache, and will thus incur extra costs on downloading wheels on both the user (download time) and PyPI infrastructure. (This is probably only relevant for larger providers or enterprises where there is control around shared disks that can contain the content addressable cache).

Backwards compatibility

  • If a particular version of a file in a wheel archive is not found in the cache, pip can fall back to installing it normally by extracting it as a regular file.

Security Implications

  • Any hash algorithm can have collisions, which could make it possible to replace a file in the content-addressable cache without anybody noticing. Current wheels use SHA-256 hashes to track files, which does not yet have any practical collision attacks, but this might not hold in the future.
  • The content-addressable cache can be shared in a read-only fashion to avoid accidentally overwriting a file that is hosted in the cache.

Alternatives considered

  • Using system packages: we tried this in the past and did not scale :frowning: Different users needing different package versions made this a very difficult problem.
  • Using other package managers like Conda: We wanted to diverge as little as possible from the “stock” Python experience to avoid forcing folks into certain technologies. Plus, PyPI is already well-maintained, and using another package manager would mean that we would need to maintain it.

Proof of concept

At Replit, we recently released a version of this idea, and we want to expand on it (hence this post: we want to avoid painting ourselves into a corner if there’s community interest). The cache is mounted read-only into all users’ directories (with a per-user overlayfs just to make it appear writable to further prevent accidents). We went with symlinks because the cache is hosted in a different filesystem, and hardlinks are only available in the same filesystem.

3 Likes

I don’t think this calls for a PEP as even pip’s current caching isn’t standardized. Using links of some sort to prevent from copying files around seems like an implementation detail of an installers more than anything.

Thank you for sharing. I have considered building something like this in the past, mostly for the same reasons.

I was not able to find that many technical details on the backend. I understand that you mount the cache into the user directory via overlayfs, but what are you using for the storage backend? Just disk storage?

I wonder if you wouldn’t be able to achieve better performance and perhaps cut down on disk storage if you were to build a custom loader that used something like memcached or redis as a backend, allowing you to have a central cache shared by multiple machines.
This might not make sense for most users, but for a service like replit it might make a big difference.

That said, it is unclear to me what you would be proposing in a PEP. Your current approach depends on specific OS features that are not available everywhere, and it is something that could be implemented today by installers if they want. For that reason, I think it might be best to leave it as an installer implementation detail.

What I think would be exciting is having a new installer that would specifically target this use-case.

1 Like

This would work for .py and .pyc files but not for shared extensions for most platforms. On Linux dlopen requires a local file on the file system.

(There are ways to hack around the requirement, e.g. create a memfd, then dlopen /proc/self/fd/$fd. Or you can mmap PROT_EXEC and perform ELF shared library magic manually. I wouldn’t recommend it.)

1 Like

Agreed. This sounds like an interesting idea, but I’m not sure I’d want to see it in pip. We’ve already got a ton of complexity, and adding something like this (particularly as we’d have to have a fallback approach for systems which don’t have symlinks/hardlinks) would have to give significant benefit to justify it. Also, there’s probably security issues on multi-user systems, if you have the cache owned by one user but linked from a different user’s environment. (And one cache per user negates a chunk of the benefits). I don’t want to expose pip to those sorts of issue, to be honest.

I’d encourage exploring this technique, but I’d imagine it might be something that a new installer, with more freedom to be experimental, would use, rather than pip.

1 Like

I was not able to find that many technical details on the backend. I understand that you mount the cache into the user directory via overlayfs, but what are you using for the storage backend? Just disk storage?

Yup, it’s a run-of-the-mill read-only persistent disk that can be mounted on multiple devices simultaneously.

That said, it is unclear to me what you would be proposing in a PEP. Your current approach depends on specific OS features that are not available everywhere, and it is something that could be implemented today by installers if they want. For that reason, I think it might be best to leave it as an installer implementation detail.

Mostly gather consensus. I was thinking that maybe the layout of the cache in terms of directories. But so far it seems like the consensus is to leave it as an implementation detail, so the PEP would be moot!

I’d encourage exploring this technique, but I’d imagine it might be something that a new installer, with more freedom to be experimental, would use, rather than pip.

gotcha! thanks for the feedback. Does the community have a preferred alternative installer? I’d rather avoid reinventing the wheel once again to add a relatively small thing.

No, currently pip is the main Python installer. installer and distlib are libraries with installation capabilities, but there’s no “full” installer apart from pip[1].

That’s more or less my point, though - unless someone creates a viable alternative to pip, ideas like this don’t really have a good place to live.


  1. I’m not 100% sure what higher-level tools like poetry use behind the scenes, they may use pip, or they may implement their own install logic. ↩︎

I’m not 100% sure what higher-level tools like poetry use behind the scenes, they may use pip, or they may implement their own install logic.

yup, poetry relies on pip under the covers.

That’s more or less my point, though - unless someone creates a viable alternative to pip, ideas like this don’t really have a good place to live.

oh! i see. will keep an eye out for proposals. once again, thanks for the feedback!

As an example, I have 65 virtual environments, in total about 40GB. Exercise for the reader if that 40GB is important enough to minimise (I would say I have a much higher than average virtual environment usage). My Pip cache is of course 5GB (100MB in wheels)

As an example, I have 65 virtual environments, in total about 40GB.
Exercise for the reader if that 40GB is important enough to minimise

For personal workstations I’d recommend a file system with builtin
deduplication like Btrfs or ZFS instead. You can also enable
compression which is really effective for text files.