Draft PEP: Recording the source hash of installed distribution

this seems very pip-specific. Tools like pipenv and poetry also have hash verification.

I’m not a 100% sure about the specific wording in the PEP draft, but this is basically aimed at storing information about the original hash of the file in the “installation database” that we have. Both pipenv and poetry have access to these hashes (in their lock files and during installation) and can be augumented to store this in the .dist-info directory if needed – I think they both use pip under-the-hood, so this might actually “just work”?

My concern is that it might “just work” by pip recalculating a hash that they already maintain. And yes, this is again a matter of “is it costly enough to matter?”

More broadly, this feels more like “private pip data” being added to the database so that pip install can communicate with pip freeze. Or maybe a standard that’s only needed by pip freeze, and other tools have their own ways of doing this.

I’d be fine with defining a “tool specific data” area in the .dist-info directory that pip could use to stash this data if that’s all that’s really needed here. But if other tools are expected to save the data, I’d feel a lot better knowing that it was useful to more than just pip freeze

The sdist format is not defined, there is no guarantee the build process is deterministic. The build system could touch the files when creating the tarball, which would result in different timestamps and then a different hash, for example.

Could you explain how this would be more secure? The goal of adding hashes to requirements.txt is to verify the origin, you want to verify if the file you downloaded is the one the developers intended when they wrote requirements.txt, correct? If you store the hash file inside the wheel, this has no value. An attacker can just replace it with a malicious wheel and keep the correct hash metadata. Am I missing something here?

This proposal does successfully secure installations from sdists, but isn’t installing from wheels the most common use case? Shouldn’t we be thinking of a solution that covers both?

About the sdist format, I think that the process is nondeterministic, but the tarball that is published is constant and have constant hash.

This process is more secure because, unless the case specified in which the environment we want to copy is comprised, it does not matter if we place hashes or not, because the attacker has complete control. However, if we pinned the hashes correctly, and the attacker has control of the remote PyPI, he can not alter the installation as it will fail.

This proposal also secure installation from wheels, as we calculate hash from the received distribution.
The only problem we have is with editable/VCS source, as it is not received as a single compressed file.

About the file being optional, we need to discuss this, as I think I should add that tools SHOULD abort freeze if some of the hashes are not found, as there is no purpose for half pinned requirements.

I do not think there is an overhead for computing the hash, this only needs to happen once per distribution and it’s dependencies. I can try to benchmark the difference, if there is any. Calculating one hash per requirement is not very heavy, and because of that I think this should always be calculated and not in “opt in”

Note that pip’s hash mode already enforces every package in the requirements.txt must have a hash, so freezing an environment containing non-hashable packages would produce a non-installable requirements.txt anyway.

This would make the most sense to me:

  1. An installer SHOULD create a HASH file.
  2. A tool providing the freezing-with-hash functionality SHOULD abort if any of the installed distributions in the current environment does not contain a HASH file. If the tool chooses not to abort, it MUST display a warning message that the resulting requirements.txt is not installable.

You can’t really say that in a PEP though, without defining what “freeze” is. This is where I think this PEP is too closely tied to pip, in current terms.

Things that have no standard meaning at the moment:

  • Freezing
  • Requirements files
  • Hash mode

I could, for example, write a script that introspected my site-packages, read the HASH files, and wrote a file that included the names of everything and a hash for pip. Is that script allowed by the PEP? (Hint: It should be, because you don’t know what I want and can’t mandate that I follow any rules). If it is, why? Because it’s not a “tool”? Because the operation isn’t a “freeze”? Because the hashes were found but I chose not to write them?

I know this feels like nit-picking (and it is!) but insufficiently precise standards can be a real problem for implementors.

I’d suggest that you strip back the scope of this PEP and concentrate solely on something that:

  • Allows (but doesn’t require) the existence of a HASH file in .dist-info.
  • States what it will contain, if it exists.

Leave handling of cases where it doesn’t exist, and deciding whether to write it or not, to the individual tools (pip, other installers, etc). That way you don’t have to think about questions like those I raise above.

Some further thoughts:

  1. It’s not actually clear to me whether PEP 376 allows arbitrary files in .dist-info (see what I said above about unclear standards :slightly_smiling_face:). If arbitrary files are allowed, pip could just use an implementation-defined HASH file. But that risks clashes with other tools - having a namespacing mechanism for tool-specific files would be better (as would clarifying the intent of PEP 376!)
  2. This PEP suggests recording the hash of the distribution source (where the source is a single file) but it doesn’t record what that file was, or where it came from. pip freeze might not need this information, but other tools might. Has this been considered? Maybe at least the source filename (if not the actual location) would be useful?
  3. We’re getting very much into the area of lock files here (after all, requirement files with hashes are basically a form of lock file) so this discussion should probably be taken into account.

I see, it makes sense to step back and let the installers use this information at will, allowing the existence, or ever recommending it sounds like a good idea.

About the further thoughts:

  1. If we add a new file, we might as well standardize it. Leaving random files in the .dist-info seems like a worse option
  2. Not sure if we should make speculation such as this, but I can probably offer to transform this file into a JSON, which makes it easily changeable.
  3. I will take a look

Thanks so much for the helpful feedback!

My point here is that if the file is standardised, it’s very much not easily changeable. That’s not a technical issue, it’s a problem with backward compatibility and process. If we defer questions like “do we need other data”, then when that question comes up later, we need another round of standardisation, and we have to consider versioning the file, as there will be data “in the wild” using version 1 of the spec.

I’m suggesting that we broaden the scope if the spec now, so that we (a) avoid that problem to the extent that we can, and (b) save people’s time by just having one discussion.

Anyhow, rather than monopolise this conversation, I’ll step back and let other tool maintainers comment further.

Published wheels are also constant and should have a constant hash.

What do you mean here? If you pin a hash in requirements.txt, pip can download the file and verify the hash. If an attacker takes control of the environment and tries to replace the file, the hash wouldn’t match and pip would fail.

Correct, for sdists.

I don’t follow.

This means we should calculate the hash and insert it into the resulting built wheel.

This only secures installations from sdists. You fetch the sdist, build a wheel from it and install it. In this case, you are sure the wheel file hasn’t been tampered with because you were the one generating it.

How does this secure installations where we fetch the wheel file from PyPI and install it?

I am not sure why are you creating a difference between sdists and wheels in that case, both of them are published as a constant and have a constant hash.

In both cases, if we managed to install on our machine successfully, we can pin the hashes correctly and be safe, even if the remote is compromised.

Please remember that pinning the hash also helps in case the package was changed without a version change, this will verify that the package stayed the same without modification for better fidelity.

The PEP does not specify the format of hash. I assume that you imply hash digest as lower-case hexadecimal ASCII string (output of hexdigest()). Could you please clarify this?

In my opinion it does not make much sense to suggest SHA384 and SHA512. Internally they both virtually the same SHA2 algorithm. SHA384 is a truncated version of SHA512 with a different start vector. In the past decade or two cryptographers and protocol designers have learned the hard that choices can be a burden. I suggest that your change the list of hashing algorithms to:

  • make SHA2-256 (aka sha256) mandatory.
  • optionally allow SHA2-512 (sha512) as additional hash digest for users that require a stronger hash for compliance reasons
  • optionally allow SHA3-256 and SHA3-512 for the highly unlikely case that SHA2 becomes compromised. SHA3 is a different construct (sponge instead of Merkle-Damgard).

It simplifies implementations if you guaranteed one algorithm with decent security margins. I propose SHA256 because it’s standard and you get it from PyPI for free.

In your PEP proposal, the handling of sdists and wheels is different. I am raising concerns about how it handles wheels.

Can you be more explicit here? Which hashes? You say we store the the distribution (sdist) hashes, how does that work for wheels?

It would be maybe better if you described the process of downloading and verifying a wheel step by step. Something like:

  • Download “tire” file
  • Do voodoo magic
  • Validate tire width
  • Save some information

I went a little creative there to make it clear this is just an example, and not a proposal on how it would work.

This is the preferred algorithms of pip in hash-checking-mode option, I updated the PEP to note that:

This file MUST be formatted as lines of ``hash_algorithm:hash``.
``hash_algorithm`` specifies the hash algorithm used, it is RECOMMENDED that
only hashes which are specified here be used for source distribution hashes.
At time of writing, that list consists of 'sha256', 'sha384', and 'sha512'
as those are the preferred algorithms used by ``pip``'s hash-checking-mode.
``hash`` specifies the hash result of the hash algorithm operation on the
source distribution, represented as lower-case hexadecimal ASCII string.

see the following lines and let me know if it makes more sense:

In any case, we take the single compressed file that we downloaded,
calculate his hash, and place the results in the final HASH file.

I’d be inclined to say that pip should follow the PEP, not the other way around. If there’s a good reason for pip to have chosen those algorithms, it can be used to justify them for this PEP. If there isn’t, the PEP should recommend something better and pip can change to reflect that.`

As far as I am aware, pip always requires the provider (i.e. requirements.txt) to provide both the algorithm and value when checking hashes. The only place I can find pip calls out these algorithms over others is in pip hash, and I read it more like “don’t use SHA1 and SHA224” than “SHA256, SHA384, and SHA512 are equally recommended.”

That is in the PEP and I’ve read it.

The PEP can be used to secure wheels, but the proposed implementation doesn’t do this. To secure wheels, the hash of every wheel file must be present in requirements.txt so that pip can verify the hash of the wheel downloaded. I think I’ve made a bad job of differentiating this two things, sorry.

So, the PEP itself is fine, however the hash distribution/storage mechanism is designed in such a way that makes it difficult to be used. AFAIK there is no optimal solution, but I think it’s worth exploring other options.

I’m preparing to convert PEP 376 to a proper PyPA specification document (Edit: it is now PEP 627). If you end up wanting to change/convert PEP 376, please coordinate with me to avoid conflicts/duplication of work.
Adding a HASH file is out of scope of my effort, but it should be easy to add it to that spec if your PEP is accepted.

As an RPM packager, I would like to add that projects can be installed from other things than wheels/sdists/things pip can handle. Not all environments are pip freeze-able. (The current PEP draft, where HASH is optional, is fine. I’m just adding another point to keep in mind.)

2 Likes

Would it make sense, for the sake of extensibility, to include these hashes in a json file that provides information about the concrete distribution that the installer used.

In the future, this could be extended with more information such as the URL that was downloaded, the index that provided the distribution, etc.

I kind of agree, I think a json format is more extendible and flexible, so I will probably change it in the following days

I currently do not plan on changing PEP 376, but thanks for your feedback!

The idea is to create HASH file in each installation, and then when we look at the environment, we can easily get the hashes. If each time we install we drop a HASH file, every installed distribution will have its hash, and so the hash of every wheel/sdist file will be present in requirements.txt. I hope I explained this in a reasonable way.

Maybe, but this discussion is a big one, and not entirely related only to this PEP draft, I would welcome ideas, but this is not my focus in this PEP