Want-Digest support for warehouse

There should be a consistent and standard way to extract the hashes of packages from an index

pip-tools and hashin use the JSON RPC to extract sha256 hashes, but is not supported when requesting hashes of different types, eg sha3

currently including SHA256 hashes is a “should” for the simple index API

I think it would be good to add a MUST support Want-Digest, and a MUST support of the python3 guaranteed hashlib algorithms: {‘sha3_512’, ‘sha384’, ‘sha3_224’, ‘sha224’, ‘sha3_384’, ‘sha256’, ‘blake2b’, ‘sha3_256’, ‘shake_256’, ‘shake_128’, ‘blake2s’, ‘md5’, ‘sha1’, ‘sha512’}

Why?

PyPI already provides MD5 and SHA-256. Aside from eventually growing support for a more modern hash algorithm, what’s the perceived value in adding support for all of these, many of which are older or prone to attack?

1 Like

perhaps the MUST list of hashes could be reduce, but making the Want-Digest protocol mandatory is more important for standardization of requesting hash information

From a practical standpoint, It would be non trivial to support Want-Digest in PyPI. Serving a file goes directly from our CDN to S3 or GCS, so we don’t have any sort of web server that we control in the path of serving files.

Off the top of my head I can only think of two ways for us to implement this:

  • Add a proxy server, and change our infrastructure to go from CDN -> S3/GCS to CDN -> Proxy -> S3/GCS, and in this proxy server we basically just implement support for Want-Digest.
  • I think both S3 and GCS support adding static metadata to a file that gets surfaced as a response header, so we could maybe do something like add a header per hash we support, and implement Want-Digest in VCL… maybe? I’m not 100% sure it’s feasible.

Outside of PyPI itself bandersnatch mirrors don’t have any custom running Python code, and it’s just a cronjob that drops some files on disk, and it expects you to have some web server configured to serve that disk directory. This would imply that your web server has to support Want-Digest or that they now need a specialized server process, is there any general purpose web servers that do support Want-Digest?

A more foundational question is wether Want-Digest is the right tool for the job, particularly as it pertains to mirrors and general purpose web servers (assuming there is any that support it). Right now the hashes in /simple/foo/ act as an integrity check against anything done to the file, it effectively treats the file storage as untrusted so disk corruption, S3/GCS modification, etc that change the file hash will cause installers to fail, because unless they also modify the /simple/foo/ page the hashes won’t align, but with a general purpose file server, or the VCL option above, we start trusting the file storage and modifications to that will start to reflect naturally in the Want-Digest response. That isn’t wrong exactly, but it’s an open question about whether it’s the behavior we would want or not.

Basically, I think supporting Want-Digest should probably be something that goes through the PEP process, both because of the open questions and because it’s modifying the public interface of something that was defined in the PEP process. Hopefully as part of that we would nail down specifics and figure out if it’s something we’d want to support or not.

For whatever it’s worth, my current gut instinct would be -0, but I’m happy to listen to arguments that it should happen.

1 Like

I agree with Donald and Dustin here. What’s the point of additional hashing algorithms? Are you worried that SHA2-256 will be broken soon or do you have compliance requirements to use other hashes?

The list of hashing algorithms is overkill. The shakes don’t even make sense to include. At most it might make sense to include SHA3-384 in addition to SHA2-256. This would introduce a new type of hashing algorithm (sponge instead of Merkle-Damgard) and a stronger hash with more bits. I consider SHA3-512 overkill, too.

While blake2 is a fantastic algorithm it’s not endorsed as an official standard by NIST or another governmental body. It might make more sense to include SM3 hash or GOST R 34.11-94 hash instead to cover Chinese and Russian market.

1 Like

you can support Want-Digest statically by always providing the Digest: header with all possible/relevant hashes

I picked the https://docs.python.org/3/library/hashlib.html#hashlib.algorithms_guaranteed set mostly arbitrarily

Perhaps a better list would be the hashes that are mandatory for TLS 1.3? eg {“sha256”}

What problem are you trying to solve with additional file hashes? I cannot give you a reasonable answer without understanding your problem space and concerns first.

Also there is no such thing as TLS3. Do you mean TLS 1.3? Depending on your problem it may make sense to have a list of hashes that does NOT overlap with hashing algorithms in TLS.

sorry yes, I meant TLS 1.3

Oh is that valid? I didn’t read the RFC for it, I just read the Mozilla page which said:

The sender provides a list of digests which it is prepared to accept, and the server uses one of them

So I assumed it was at most one, if you can just return with all of them every time then Warehouse could implement it easily enough. I still think it would be problematic for mirrors, but I could be wrong there.

I still think it would require a PEP though either way.

Quoting the RFC:

A “Digest” field MAY contain multiple representation-data-digest
values. This could be useful for responses expected to reside in
caches shared by users with different browsers, for example.

A recipient MAY ignore any or all of the representation-data-digests
in a Digest field. This allows the recipient to choose which digest-
algorithm(s) to use for validation instead of verifying every
received representation-data-digest.

A sender MAY send a representation-data-digest using a digest-
algorithm without knowing whether the recipient supports the digest-
algorithm, or even knowing that the recipient will ignore it.

“Digest” can be sent in a trailer section. When using incremental
digest-algorithms this allows the sender and the receiver to
dynamically compute the digest value while streaming the content.

Want-Digest is just an optimization for cases where the server is calculating the hashes on the fly. Want-Digest is probably the wrong term for me to have used here, and Digest is what I’m after.

@dstufft can you give me a hand with the pep?