Draft PEP: Recording the source hash of installed distribution

NoahGorny · July 5, 2020, 6:19pm

This is a draft PEP proposal originated from https://github.com/pypa/pip/pull/8519
Remarks will be merged into https://github.com/NoahGorny/peps/blob/hash-wheel-source/pep-9999.rst

This is my first time doing this, any constructive criticism is very welcome!

PEP: 9999
Title: Recording the source hash of installed distribution
Author: Noah Gorny <noah.bar.ilan@gmail.com>
Sponsor: ??? <???>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 5-Jul-2020
Post-History:
Discussion-To:

Abstract
========

Currently, after installation, the hash of the downloaded sdist/wheel is not recorded.

This proposal defines
additional metadata, to be added to the installed distribution by the
installation front end, which records the source hash for use by
consumers which introspect the database of installed packages (see PEP 376).

Motivation
==========

The original motivation of this PEP was to permit tools with a "freeze"
operation allowing a Python environment to be recreated to extend their capabilities
and provide a secure way to generate hash-pinned requirements.

Specifically, the PEP originated from the desire to address `pip issue #4732`_:
i.e. improving the behavior of ``pip freeze`` to allow it to output installed packages hash,
in order to allow easy pinning of the requirements, and easy reproduction
of the environment using hash-checking mode.

Freezing an environment
-----------------------

Pip also sports a command named ``pip freeze`` which examines the Database of
Installed Python Distributions to generate a list of requirements. The main
goal of this command is to help users generating a list of requirements that
will later allow the re-installation the same environment with the highest
possible fidelity.

However, you can not currently output the installed distribution's hash,
as this information is not stored and can not always be computed at run time
from local information.
This means that there is no easy way to output source hashes using `pip freeze`.

The advantages of installing in hash-checking mode
--------------------------------------------------
As noted in the pip `user guide`__, hash-checking mode allows for increased
fidelity in case of compromised PyPI or HTTPS cert chain, or in the case of
package change without version changing. This approach allows for easier and more
secure automated server deployment.

It is also labor-saving alternative to running private index server with approved
packages. It can also substitute for a vendor library, providing easier
upgrades and less VCS noise.

Rationale
=========

This PEP specifies a new ``HASH`` metadata file in the
``.dist-info`` directory of an installed distribution.

The fields specified are sufficient to retrieve source distribution hash,
of various algorithms. The line by line format allows for algorithms to be
inserted and deleted in the future easily.

Specification
=============

This PEP specifies a ``HASH`` file in the ``.dist-info`` directory
of an installed distribution, to record the source hash of the distribution.

The canonical source for the name and semantics of this metadata file is
the `Recording the source hash of installed distribution`_ document.

This file MUST be created by installers in any installation.

This file MUST be formatted as lines of ``hash_algorithm:hash``.
``hash_algorithm`` specifies the hash algorithm used, it is RECOMMENDED that
only hashes which are specified here be used for source distribution hashes.
At time of writing, that list consists of 'sha256', 'sha384', and 'sha512'.
``hash`` specifies the hash result of the hash algorithm operation on the source distribution.

Note about different types of sources
-------------------------------------

Distribution can be obtained with different type of packaging. One example would
be the ``wheel`` format (PEP 427), and another would be source distribution (sdist).
We need to note that we should take the hash of the ``source``, regardless of his type
this means that we should save the hash of the original sdist ``tar.gz`` and not
of the resulting built wheel as wheel building is nondeterministic. This means we
should calculate the hash and insert it into the resulting built wheel.

Use cases
=========

"Freezing" an environment

  Tools, such as ``pip freeze``, which generate requirements from the Database
  of Installed Python Distributions SHOULD exploit ``HASH``
  if it is present, and give it priority over other means to generate hashes, in order
  to generate a higher fidelity output. Tools are not required to output the hashes
  in the default use-case, and it is RECOMMENDED to allow this option via a specialized flag.

Backwards Compatibility
=======================

Since this PEP specifies a new file in the ``.dist-info`` directory,
there are no backwards compatibility implications.

Alternatives
============

There are various alternatives, which all share the same problem- they generate
hashes from remote sources, as they can not generate hash from the local
installation (unless saved in cache).

pipenv
------
Environment manager that organizes your python environment using ``Pipfile.lock``
which contains hashes of the distribution source. Those hashes are obtained ``after``
installation, using remote queries of the warehouse API. This solution works, but
requires you to use pipenv to manage all of your python package environment.
It also queries the hashes from the remote, which, if intercepted, can be modified with
regardless of actual local distribution original hash.

References
==========

.. _`pip issue #4732`: https://github.com/pypa/pip/issues/4732
.. _`user guide`:  https://pip.pypa.io/en/stable/user_guide/#hash-checking-mode

Copyright
=========

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.


..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8
   End:

pf_moore · July 5, 2020, 6:38pm

This is my first time doing this, any constructive criticism is very welcome!

Welcome, and thanks for contributing this!

One thing that needs to be noted, there may not be a distribution to hash (consider pip install .). So “This file MUST be created by installers in any installation” needs to be weakened, the file needs to be optional, at least for that case.

In general, the file has to be optional anyway, if only to allow for backward compatibility with the current standard - tools must be prepared for it not to exist if they aren’t going to break on existing data.

Also, what’s the overhead of computing the hash of a large distribution file (something like scipy or tensorflow)? As someone who has no interest in using hash verification or freezing with hashes, I don’t particularly want to pay any significant cost to enable a feature I’m not interested in. If it’s cheap (relative to the install operation), then that’s fine, but if it’s costly, maybe it should be “opt in”?

I’m also a bit concerned that this seems very pip-specific. Tools like pipenv and poetry also have hash verification. We should make sure that this proposal fits in with what they do.

pradyunsg · July 5, 2020, 6:47pm

this seems very pip-specific. Tools like pipenv and poetry also have hash verification.

I’m not a 100% sure about the specific wording in the PEP draft, but this is basically aimed at storing information about the original hash of the file in the “installation database” that we have. Both pipenv and poetry have access to these hashes (in their lock files and during installation) and can be augumented to store this in the .dist-info directory if needed – I think they both use pip under-the-hood, so this might actually “just work”?

pf_moore · July 5, 2020, 7:00pm

My concern is that it might “just work” by pip recalculating a hash that they already maintain. And yes, this is again a matter of “is it costly enough to matter?”

More broadly, this feels more like “private pip data” being added to the database so that pip install can communicate with pip freeze. Or maybe a standard that’s only needed by pip freeze, and other tools have their own ways of doing this.

I’d be fine with defining a “tool specific data” area in the .dist-info directory that pip could use to stash this data if that’s all that’s really needed here. But if other tools are expected to save the data, I’d feel a lot better knowing that it was useful to more than just pip freeze…

FFY00 · July 5, 2020, 7:13pm

NoahGorny:

we should save the hash of the original sdist ``tar.gz`` and not
of the resulting built wheel as wheel building is nondeterministic. This means we
should calculate the hash and insert it into the resulting built wheel.

The sdist format is not defined, there is no guarantee the build process is deterministic. The build system could touch the files when creating the tarball, which would result in different timestamps and then a different hash, for example.

Could you explain how this would be more secure? The goal of adding hashes to requirements.txt is to verify the origin, you want to verify if the file you downloaded is the one the developers intended when they wrote requirements.txt, correct? If you store the hash file inside the wheel, this has no value. An attacker can just replace it with a malicious wheel and keep the correct hash metadata. Am I missing something here?

This proposal does successfully secure installations from sdists, but isn’t installing from wheels the most common use case? Shouldn’t we be thinking of a solution that covers both?

NoahGorny · July 6, 2020, 6:33am

About the sdist format, I think that the process is nondeterministic, but the tarball that is published is constant and have constant hash.

This process is more secure because, unless the case specified in which the environment we want to copy is comprised, it does not matter if we place hashes or not, because the attacker has complete control. However, if we pinned the hashes correctly, and the attacker has control of the remote PyPI, he can not alter the installation as it will fail.

This proposal also secure installation from wheels, as we calculate hash from the received distribution.
The only problem we have is with editable/VCS source, as it is not received as a single compressed file.

NoahGorny · July 6, 2020, 6:39am

About the file being optional, we need to discuss this, as I think I should add that tools SHOULD abort freeze if some of the hashes are not found, as there is no purpose for half pinned requirements.

I do not think there is an overhead for computing the hash, this only needs to happen once per distribution and it’s dependencies. I can try to benchmark the difference, if there is any. Calculating one hash per requirement is not very heavy, and because of that I think this should always be calculated and not in “opt in”

uranusjr · July 6, 2020, 7:01am

Note that pip’s hash mode already enforces every package in the requirements.txt must have a hash, so freezing an environment containing non-hashable packages would produce a non-installable requirements.txt anyway.

This would make the most sense to me:

An installer SHOULD create a HASH file.
A tool providing the freezing-with-hash functionality SHOULD abort if any of the installed distributions in the current environment does not contain a HASH file. If the tool chooses not to abort, it MUST display a warning message that the resulting requirements.txt is not installable.

pf_moore · July 6, 2020, 7:45am

You can’t really say that in a PEP though, without defining what “freeze” is. This is where I think this PEP is too closely tied to pip, in current terms.

Things that have no standard meaning at the moment:

Freezing
Requirements files
Hash mode

I could, for example, write a script that introspected my site-packages, read the HASH files, and wrote a file that included the names of everything and a hash for pip. Is that script allowed by the PEP? (Hint: It should be, because you don’t know what I want and can’t mandate that I follow any rules). If it is, why? Because it’s not a “tool”? Because the operation isn’t a “freeze”? Because the hashes were found but I chose not to write them?

I know this feels like nit-picking (and it is!) but insufficiently precise standards can be a real problem for implementors.

I’d suggest that you strip back the scope of this PEP and concentrate solely on something that:

Allows (but doesn’t require) the existence of a HASH file in .dist-info.
States what it will contain, if it exists.

Leave handling of cases where it doesn’t exist, and deciding whether to write it or not, to the individual tools (pip, other installers, etc). That way you don’t have to think about questions like those I raise above.

Some further thoughts:

It’s not actually clear to me whether PEP 376 allows arbitrary files in .dist-info (see what I said above about unclear standards ). If arbitrary files are allowed, pip could just use an implementation-defined HASH file. But that risks clashes with other tools - having a namespacing mechanism for tool-specific files would be better (as would clarifying the intent of PEP 376!)
This PEP suggests recording the hash of the distribution source (where the source is a single file) but it doesn’t record what that file was, or where it came from. pip freeze might not need this information, but other tools might. Has this been considered? Maybe at least the source filename (if not the actual location) would be useful?
We’re getting very much into the area of lock files here (after all, requirement files with hashes are basically a form of lock file) so this discussion should probably be taken into account.

NoahGorny · July 6, 2020, 8:26am

I see, it makes sense to step back and let the installers use this information at will, allowing the existence, or ever recommending it sounds like a good idea.

About the further thoughts:

If we add a new file, we might as well standardize it. Leaving random files in the .dist-info seems like a worse option
Not sure if we should make speculation such as this, but I can probably offer to transform this file into a JSON, which makes it easily changeable.
I will take a look

Thanks so much for the helpful feedback!

pf_moore · July 6, 2020, 9:08am

My point here is that if the file is standardised, it’s very much not easily changeable. That’s not a technical issue, it’s a problem with backward compatibility and process. If we defer questions like “do we need other data”, then when that question comes up later, we need another round of standardisation, and we have to consider versioning the file, as there will be data “in the wild” using version 1 of the spec.

I’m suggesting that we broaden the scope if the spec now, so that we (a) avoid that problem to the extent that we can, and (b) save people’s time by just having one discussion.

Anyhow, rather than monopolise this conversation, I’ll step back and let other tool maintainers comment further.

FFY00 · July 6, 2020, 9:33am

Published wheels are also constant and should have a constant hash.

What do you mean here? If you pin a hash in requirements.txt, pip can download the file and verify the hash. If an attacker takes control of the environment and tries to replace the file, the hash wouldn’t match and pip would fail.

Correct, for sdists.

I don’t follow.

This means we should calculate the hash and insert it into the resulting built wheel.

This only secures installations from sdists. You fetch the sdist, build a wheel from it and install it. In this case, you are sure the wheel file hasn’t been tampered with because you were the one generating it.

How does this secure installations where we fetch the wheel file from PyPI and install it?

NoahGorny · July 6, 2020, 12:36pm

I am not sure why are you creating a difference between sdists and wheels in that case, both of them are published as a constant and have a constant hash.

In both cases, if we managed to install on our machine successfully, we can pin the hashes correctly and be safe, even if the remote is compromised.

Please remember that pinning the hash also helps in case the package was changed without a version change, this will verify that the package stayed the same without modification for better fidelity.

tiran · July 7, 2020, 9:20am

The PEP does not specify the format of hash. I assume that you imply hash digest as lower-case hexadecimal ASCII string (output of hexdigest()). Could you please clarify this?

In my opinion it does not make much sense to suggest SHA384 and SHA512. Internally they both virtually the same SHA2 algorithm. SHA384 is a truncated version of SHA512 with a different start vector. In the past decade or two cryptographers and protocol designers have learned the hard that choices can be a burden. I suggest that your change the list of hashing algorithms to:

make SHA2-256 (aka sha256) mandatory.
optionally allow SHA2-512 (sha512) as additional hash digest for users that require a stronger hash for compliance reasons
optionally allow SHA3-256 and SHA3-512 for the highly unlikely case that SHA2 becomes compromised. SHA3 is a different construct (sponge instead of Merkle-Damgard).

It simplifies implementations if you guaranteed one algorithm with decent security margins. I propose SHA256 because it’s standard and you get it from PyPI for free.

FFY00 · July 7, 2020, 1:59pm

In your PEP proposal, the handling of sdists and wheels is different. I am raising concerns about how it handles wheels.

Can you be more explicit here? Which hashes? You say we store the the distribution (sdist) hashes, how does that work for wheels?

It would be maybe better if you described the process of downloading and verifying a wheel step by step. Something like:

Download “tire” file
Do voodoo magic
Validate tire width
Save some information

I went a little creative there to make it clear this is just an example, and not a proposal on how it would work.

NoahGorny · July 7, 2020, 7:19pm

This is the preferred algorithms of pip in hash-checking-mode option, I updated the PEP to note that:

This file MUST be formatted as lines of ``hash_algorithm:hash``.
``hash_algorithm`` specifies the hash algorithm used, it is RECOMMENDED that
only hashes which are specified here be used for source distribution hashes.
At time of writing, that list consists of 'sha256', 'sha384', and 'sha512'
as those are the preferred algorithms used by ``pip``'s hash-checking-mode.
``hash`` specifies the hash result of the hash algorithm operation on the
source distribution, represented as lower-case hexadecimal ASCII string.

see the following lines and let me know if it makes more sense:

In any case, we take the single compressed file that we downloaded,
calculate his hash, and place the results in the final HASH file.

pf_moore · July 7, 2020, 7:31pm

I’d be inclined to say that pip should follow the PEP, not the other way around. If there’s a good reason for pip to have chosen those algorithms, it can be used to justify them for this PEP. If there isn’t, the PEP should recommend something better and pip can change to reflect that.`

uranusjr · July 7, 2020, 8:08pm

As far as I am aware, pip always requires the provider (i.e. requirements.txt) to provide both the algorithm and value when checking hashes. The only place I can find pip calls out these algorithms over others is in pip hash, and I read it more like “don’t use SHA1 and SHA224” than “SHA256, SHA384, and SHA512 are equally recommended.”

FFY00 · July 8, 2020, 11:33am

NoahGorny:

see the following lines and let me know if it makes more sense:
In any case, we take the single compressed file that we downloaded,
calculate his hash, and place the results in the final HASH file.

That is in the PEP and I’ve read it.

The PEP can be used to secure wheels, but the proposed implementation doesn’t do this. To secure wheels, the hash of every wheel file must be present in requirements.txt so that pip can verify the hash of the wheel downloaded. I think I’ve made a bad job of differentiating this two things, sorry.

So, the PEP itself is fine, however the hash distribution/storage mechanism is designed in such a way that makes it difficult to be used. AFAIK there is no optimal solution, but I think it’s worth exploring other options.

encukou · July 8, 2020, 3:31pm

I’m preparing to convert PEP 376 to a proper PyPA specification document (Edit: it is now PEP 627). If you end up wanting to change/convert PEP 376, please coordinate with me to avoid conflicts/duplication of work.
Adding a HASH file is out of scope of my effort, but it should be easy to add it to that spec if your PEP is accepted.

As an RPM packager, I would like to add that projects can be installed from other things than wheels/sdists/things pip can handle. Not all environments are pip freeze-able. (The current PEP draft, where HASH is optional, is fine. I’m just adding another point to keep in mind.)