A package mirror that recompresses its archives?

A mirror that re-compresses packages is a pretty obvious step from any discussion about package compression but it has the potential to break a couple of longstanding assumptions. Most importantly it means serving different bytes than the publisher sent. It could mean serving different bytes with the same filename even if the content underneath the compression is the same.

For example, a mirror might recompress the latest releases of the top 512 packages, serving a simple index that would include those recompressed packages on their per-package pages /simple/squished-package/, every other file of those top 512 packages (older releases) and every other package would link to or redirect to the ordinary index.

If your client supports the format, point it at this new index. Boom, less bandwidth. How broken would this be?

What’s the guarantee in question? Is it that installing from a local wheel

pip install dist/mypackage-1.0.whl

produces exactly the same files on the filesystem as installing from the index

pip install mypackage==1.0

If we convinced tools to rely on a RECORD hash instead of a wheel hash, it’d be just another mirror.

It certainly sounds like the right way to test this at scale. Plenty of people already use private indexes with modified or rebuilt packages that don’t quite match the “original”.

1 Like

So far I’m considering changing the version number. dist-info/WHEEL will change. You could compare the original record against all file hashes if you were so motivated.

It would generally work in today’s world, you’d probably end up breaking the --hash feature in pip unless your mirror was deterministic or cached the compressed wheel. In a post TUF world you’d have to effectively treat this as an entirely new set of packages rather than as a mirror, but it would otherwise be fine.

I think that’s probably a bad idea. It’s not exactly the case, but it feels very much like something that falls in the same category as violating the cryptographic doom principle. More concretely, one big problem with it, is it’s a lot easier to get it wrong when you have to actually start decompressing and extracting the artifact to verify that you got non-malicous bits, versus if you’re verifying before you treat the artifact as anything but an opaque set of bytes. I get nervous when I see security features that introduce extra ways to subtly void your entire security posture that were largely needless.

1 Like

Yes decompressing things is notoriously insecure. The CVE pages for things like unzip and gun up are quite long (GNU Gzip : Security vulnerabilities, CVEs). So you definitely want to verfiy before decompressing.

You could use signatures to get around that. We do this with spack binaries because for any one package there can be many builds, and we don’t want to encode hundreds of sha256’s in the package recipes. We sign the outer archive, which is the first check done by the installer, and the metadata contains the hash of the inner archive.

Signatures and keys have their own management overhead, but in this case it seems like it could make sense to trust a mirror and have it sign recompressed archives. Downside is if the mirror is compromised in a way that gets you the signing key, it’ll happily send you bad archives.

Seems reasonable. Even though “check the hashes of individual files in an archive” is a common technique, you have to trust the mirror for other reasons. Better to solve one problem at a time.

The spec for signed JAR files. I assume you are familiar but disagree with this scheme. https://docs.oracle.com/javase/7/docs/technotes/guides/jar/jar.html#The_META-INF_directory

Turns out the wheel build number works for this application, no special index needed. Just a static index of only the packages you have added. No need to have redirects for every other pypi package. When compressing the wheels append to, or in all 950 wheels I tried, add, a build number to the new wheels. You can expose them to pip as an --extra-index-url, they sort as more-preferable and you get the desired result.

If it was a public service you would have to hide the re-compressed copy if upstream released a wheel with the exact same filename and build number you chose.

RPM and Debian packages call their packaging-level versions “Release” and “revision”.