How to hash a directory in lockfiles

mikeshardmind · November 7, 2024, 9:20pm

This is an offshoot from PEP 751: now with graphs!

I’m proposing a method based on blake2, given its suitability for combining digests and availability within the standard library (guaranteed available as part of hashlib, not an optional algorithm) as well as reference implementation availability.

While blake3 is faster, there is no current evidence that blake2 is insufficiently secure, and blake3 is not guaranteed to be available in Python currently.

The things which are taken into account:

That any symlinks in the directory must only point inside the directory
A symlink to a path is considered equivalent to if it was duplicated
- This allows for portable behavior preferring symlinks on supported systems
File metadata such as last modified date is not hashed
The directory’s name does matter

These considerations allow portable relocation of a directory, including via tools that don’t preserve metadata, to resolve as the same hash, while retaining the parts we need to care about for source trees in python.

I’ve written a reference implementation

from hashlib import blake2b
from pathlib import Path


def _check_symlink(base_path: Path, to_check: Path, /) -> Path:
    if to_check.is_symlink():
        resolved = to_check.resolve(strict=True)
        try:
            resolved.relative_to(base_path)
        except ValueError as exc:
            msg = "Dangerous symlink detected"
            raise ValueError(msg) from exc
        return resolved
    return to_check


def _portable_path_repr(base_path: Path, dir_path: Path, /) -> bytes:
    """don't rely on os.pathsep"""
    base = base_path.as_posix()
    d = dir_path.as_posix()
    return ("." + d.removeprefix(base)).encode()


def hash_dir(dir_path: Path, /) -> str:
    if not dir_path.is_dir():
        msg = "This function is intended for use on paths"
        raise ValueError(msg)

    rel_root = dir_path.resolve(strict=True)

    dir_component =  blake2b(last_node=False, person=b"dirs")
    file_component = blake2b(last_node=False, person=b"files")
    data_component = blake2b(last_node=True, person=b"data")
    root_node = blake2b(last_node=True, digest_size=32)
    root_node.update(dir_path.name.encode())

    walk_gen = dir_path.walk(top_down=False, follow_symlinks=False)
    for root, dirs, files in walk_gen:
        for name in dirs:
            fp = (root / name)
            resolved = _check_symlink(dir_path, fp)
            dir_component.update(_portable_path_repr(rel_root, resolved))

        for name in files:
            fp = (root / name)
            resolved = _check_symlink(dir_path, fp)
            file_component.update(_portable_path_repr(rel_root, resolved))
            data_component.update(resolved.read_bytes())


    root_node.update(dir_component.digest())
    root_node.update(file_component.digest())
    root_node.update(data_component.digest())
    return root_node.hexdigest()

There are a few places in this where the behavior of the standard lib is implicitly part of the algorithm, and this might need adjusting to explicitly avoid a dependency on behavior which may change in the future.

In particular, while unlikely to change, these may be useful for non-python implementations such as uv:

The defaults for hashlib.blake2b (unlikely to change)
The ordering of pathlib.Path.walk
~~The specific format of output of os.path.relpath~~
The specific output of pathlib.Path.as_posix()

While I did consider other hash functions and other means of combining hashes, I was unhappy with the approaches needed for hashes not designed with incremental tree updates as part of their design, such as building an intermediate container as an implementation detail.

We might need a name for this process as well so that if a future better way of hashing a directory becomes viable, it’s possible to communicate the process. As this is not raw use of blake2 on an input, the name should not be “blake2”, but might contain it. I don’t have a preference on naming, and would prefer to leave that bit of bikshedding until figuring out if we have a consensus on it being possible.

Note: Implementation and concerns have received an update for file separators and to include empty dirs

brettcannon · November 7, 2024, 9:27pm

/cc @sethmlarson , @dustin , and @woodruffw as the usual security folks.

brettcannon · November 7, 2024, 9:37pm

First, thanks for starting this conversation!

For those of us not steeped in cryptography, how is it better than SHA256 for this sort of situation? Are you saying we would need to mandate blake2 for this approach no matter what? That would be a first for Python packaging to mandate a specific hashing algorithm.

What are the benefits of hashing the file paths instead of just the contents? I would think the hashes alone in the appropriate order are good enough to be resilient and to avoid the “specific format of output of os.path.relpath” problem.

I’m assuming the data isn’t held on to until the digest is calculated? Otherwise this could take up a lot of memory.

We would have to be very explicit about this.

hauntsaninja · November 7, 2024, 9:38pm

You could use the same algorithm git uses to calculate tree hashes (can be implemented in about 150 lines of Python)

sethmlarson · November 7, 2024, 9:44pm

Drawing on some prior-art, SPDX SBOM has a concept called a “Package Verification Code” which is a similar use-case. Some thoughts on it:

We probably shouldn’t mandate a specific algorithm.
Package Verification Code doesn’t take into account the name of the files. I’m kinda confused why it doesn’t, maybe due to cross-platform file names being a nightmare? Not sure.
Package Verification Code doesn’t track empty directories. Neither does the proposed implementation, how much do we care about those? Because permissions can’t cross-platform I don’t think we can do much there?
Instead of sorting at the “walk” level, package verification code sorts the hashes after every entry is accounted for, concats, and then hashes that result. Might be easier from a non-Python implementation POV?

Note that Package Verification Code is less to do with integrity and more to do with identity? Definitely need to take that into account.

pf_moore · November 7, 2024, 9:48pm

One problem with hashing paths is that trees won’t be portable between Unix and Windows, as the path separator is different.

mikeshardmind · November 7, 2024, 9:55pm

I’m likely not the best person to communicate this one on the cryptographic side, there’s significant implementation benefits here in that this can be made parallel safely (something uv would likely want to do) since it does not require all of it’s input at once, but can safely combine digests.

As for mandating an algorithm, I think we actually should here, but that this doesn’t need to be the only method we end up supporting. This could be the first, with more added as consensus on “is this implementable by all ecosystem members?” comes around so that users encountering a lockfile should be able to expect any up-to-date tool consuming lockfiles understand that process.

Renaming a file such that it orders the same way can still change behavior. Especially when renaming shared libraries that are found by another file.

This generates an internal node with it’s own digest in blake2, one of the benefits here.

I don’t see any reason we need to care, I’m unaware of any case where the presence of an empty dir should be a behavior change, except possibly with code that tests for the existence of an empty dir in its own source tree (This would be odd)… but if I’m wrong, this approach can be adapted to include a dirs digest as well.

I can write a portable function that should be used instead if this is a dealbreaker, copying across operating systems was not something I considered given that this was (At least for the pep it would be included as part of?) for local source trees.

oscarbenjamin · November 7, 2024, 10:00pm

Not sure if this is relevant but Python treats empty directories as importable packages:

$ python -c 'import foo; print(foo.__path__)'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'foo'
$ mkdir foo
$ python -c 'import foo; print(foo.__path__)'
_NamespacePath(['/stuff/current/active/sympy/foo'])

mikeshardmind · November 7, 2024, 10:06pm

well, rather than guess on if not handling this might break someone’s stability, might as well update the above for this…

This method is already known to be susceptible to intentionally crafted collisions with git defaults. There’s a process to use sha256 (I’m not going to go find the reference implementation for this at this time)

This approach might make a good candidate for an alternative method to include and support so long as it is using sha256.

steve.dower · November 7, 2024, 10:33pm

I’m inclined to agree with not mandating an algorithm (assuming trees have any depth at all, directories could be hashed recursively, which would allow each directory to be calculated in parallel and then combined at its parent).

File ordering needs to be specified in terms of a portable locale ordering. This might be “convert to UTF-8 and do direct byte comparisons”, but following the platform/stdlib default is not sufficient.

mikeshardmind · November 7, 2024, 10:34pm

I’ve updated the proposed implementation in the original post, as the generalized approach is unchanged, I’m not including it directly separately

change summary:

Accounts for empty directories
Accounts for copying across operating systems
No longer relies on os.path.relpath behavior
Gains a reliance on pathlib.Path.as_posix behavior

To expand further on needing to mandate a set of supported methods, this isn’t an out of the box hashing algorithm that “just works” on a directory (if there were one I was aware of available here and it was suitable, simply naming that one would have been much simpler), the existing comparison to allowing a choice of hash for files doesn’t quite hold from my perspective.

It is allowed (And was a specific consideration when thinking about uv) to expand this into the parallelizable form, this is a reference implementation showing what should be combined and how, and providing an implementation.

steve.dower · November 7, 2024, 10:56pm

So does that remove/reduce the benefit of blake2 far enough that we could allow sha256?

mikeshardmind · November 7, 2024, 11:01pm

I don’t think so, no. blake2 use is what enables this to be parallelized. sha256 we would need to evaluate the safety of various methods of mixing sha256 digests (or require that the hash of a directory is a list of hashes, which would explode in size) to gain that, blake2 is designed for tree updates.

steve.dower · November 7, 2024, 11:07pm

How does it explode in size? If the hash of a directory is the hash of the directories and files (non-recursive) within that directory, then we recurse on the calculation of each directory instead, which can be easily parallelised provided you get them back in the right order:

def hash_dir(root):
    hasher = hashlib.sha256()
    for f in stable_crossplatform_sort(os.listdir(root)):
        if os.path.isdir(f):
            hasher.update(hash_dir(f))
        else:
            hasher.update(open(f, 'rb').read())
    return hasher.digest()

(that’s some very quick pseudocode, pretty sure it doesn’t work, and I didn’t reference the original proposed code either - I’m illustrating where the recursion occurs, not the full process)

mikeshardmind · November 7, 2024, 11:15pm

I can’t definitively say that’s possible to make safe and evaluating if it is doesn’t seem worth it when blake2 is widely available and designed for the ability to update trees.

The version you have there is not. You’re interspersing files and combined directory digests, and not differentiating them in any way. The obvious collision creatable here is creating a file that contains the hash of a dir, remove the dir, and ensure that it sorts into the same location.

Liz · November 7, 2024, 11:18pm

Can you show what this would look like written recursively instead of flattening the directory like this? I think that actually would change the resulting hash but improve how it is written to be less reliant on os specific things.

mikeshardmind · November 8, 2024, 1:23am

@steve.dower @Liz

Hopefully this addresses both of your feedback on this

from hashlib import blake2b
from pathlib import Path


def _check_symlink(base_path: Path, to_check: Path, /) -> tuple[str, Path]:
    if to_check.is_symlink():
        resolved = to_check.resolve(strict=True)
        try:
            resolved.relative_to(base_path)
        except ValueError as exc:
            msg = "Dangerous symlink detected"
            raise ValueError(msg) from exc
        return to_check.name, resolved
    return to_check.name, to_check


def make_file_node(path: Path, name: str | None = None) -> bytes:
    name = name if name is not None else path.name
    name_component = blake2b(name.encode(), last_node=True, person=b"name")
    file_data = path.read_bytes()
    data_component = blake2b(file_data, last_node=False, person=b"data")
    file_node = blake2b(last_node=False, person=b"file")
    file_node.update(data_component.digest())
    file_node.update(name_component.digest())
    return file_node.digest()


def make_dir_node(rel_root: Path, path: Path, name: str | None = None) -> bytes:
    bname = (name if name is not None else path.name).encode()
    name_component = blake2b(bname, last_node=True, person=b"name")
    data_component = blake2b(last_node=False, person=b"children")

    digests: list[bytes] = []
    for child in path.iterdir():
        name, resolved = _check_symlink(rel_root, child)
        if resolved.is_dir():
            digests.append(make_dir_node(rel_root, resolved, name))
        else:
            digests.append(make_file_node(resolved, name))

    digests.sort()
    for digest in digests:
        data_component.update(digest)

    is_last = rel_root == path

    dir_node = blake2b(last_node=is_last, person=b"dir")
    dir_node.update(data_component.digest())
    dir_node.update(name_component.digest())
    return dir_node.digest()


def hash_dir(dir_path: Path) -> str:
    if not dir_path.is_dir():
        msg = "This function is intended for use on paths"
        raise ValueError(msg)

    if dir_path.is_symlink():
        msg = "Top level paths must not be symlinks"
        raise ValueError(msg)

    return make_dir_node(dir_path, dir_path).hex()

This is probably a better implementation for multiple reasons, it’s no longer reliant on os specific things for one, by making it recursive rather than flattening it there’s a small tradeoff in how parallel this can be made, but in return the path seperators no longer need handling.

It’s also more clear that this is a merkle tree, which also makes documenting this as an algorithm much easier.

for each child, recurse and generate a node based on file or directory.
Symlinks should be followed only when not the top directory and links remain in the directory without forming a cycle, but using the name of the link, not the destination.
Child nodes of directories are ordered by sorting of digest.

I could write something similar to this for sha256 if it’s desired to have choices, but there aren’t that many hashes that are considered cryptographically secure and suitable for use in this manner, and as far as I know, none of those hashes provide a standard way of hashing a directory (this is why we can’t just say “bring your own hash”, there needs to be an agreement on how to build the tree from that hash)

Given the small number of hashes suitable for this purpose, It would not be difficult to standardize within packaging the way in which a tree should be constructed, but this particular implementation uses features that exist in blake2b specifically for the purposes used that do not exist in other hashes within the standard library’s support.

woodruffw · November 8, 2024, 6:15pm

Thanks for opening this topic @mikeshardmind!

I briefly skimmed the comments to catch up, so apologies if my response here misses anything.

Some notes:

There’s a bit of prior art for directory hashing already: Go has dirhash, which is used in their checksum database for the Go packaging ecosystem. I’d suggest referencing that as a potential “known-good” approach, since Go’s checksum DB was built by professional cryptographers.
In practice, there’s no security margin difference^[1] between SHA256 and Blake2. SHA256 is strictly more available, but Blake2 is already baked into CPython and is used elsewhere in Python packaging (notably, PyPI itself). With that said, I strongly agree about mandating a specific algorithm: if the scheme ever needs to be rolled forwards, it could be versioned like Go’s and use h1: for the current selection and h2:... etc. for future selections.
Regardless of digest supported, the construction here needs to perform adequate domain separation: a common cryptographic vulnerability when chaining/nesting hashes together is failing to separate each logical unit of work, allowing an attacker to produce a hash collision despite the hash itself being cryptographically sound. My colleagues wrote a good blog post on this. The way Go’s dirhash does this is by forbidding newlines in filenames and then using \n as the domain separator – I believe the current hash_dir reference implementation effectively achieves the same separation by using Blake2’s personalization ability, but I need to think a bit more about that to convince myself of that

TL;DR: Unless there’s a strong motivating reason to roll our own directory hashing construction, I recommend that Python packaging crib Go’s dirhash – the underlying hash function is just as secure, the construction was built by professional cryptographers with footguns like domain separation in mind, and there’s a reference implementation that can be used for behavioral/differential testing.

Blake2 is faster in typical workloads than SHA-256, however, which might be a sufficient reason to pick it. ↩︎

mikeshardmind · November 8, 2024, 11:42pm

This was definitely part of why I reached for blake2. I figured using a hashing function with these concerns baked into its construction by the experts creating and reviewing the hash function itself would help ensure I didn’t accidentally create something less secure than intended. It was much easier to say “This was already evaluated by experts for this use” and then be able to convince myself that I hadn’t missed a major concern than to do the same with sha256. I was unaware of the history on dirhash, but it looks more similar to the original hash_dir function I started with. While I think what I have here locally for sha256 is safe to use, I’m sure it’s not worth the time to fully evaluate if it is right now, with easier to review options with either the blake2 implementation above, or the existing dirhash from go. ^[1]

While I definitely agree with the appeal of using dirhash from the perspective of less to review cryptographic soundness of, relying on prior expert construction, it does not appear that dirhash shares the same desirable behaviors with portable filenames (it appears the os path seperator is an implementation detail) and ensuring all symlinks are relative to the directory being hashed. While this could be addressed with preprocessing, it’s another thing to keep in mind in what would need to be documented for those implementing it.

My forbidden character in file names for the sha256 version is “/” given that it’s uniformly forbidden in filenames across the major operating systems already and that I was building it up recursively like the most recent blake2 solution, rather than iteratively fully relative to the base path like the original, so the full path need not be recorded. This does result in more work than the flattened iterative approach. Before this, I likely would have still gotten domain separation correct because I was already primed to be aware of it by what I had re-discovered while considering hash digest functions, but cryptography is not my area of expertise, only an area I have found personal enjoyment in learning. ↩︎

woodruffw · November 10, 2024, 1:23am

This is badly documented on Go’s part, but dirhash uses forward-slash uniformly. This is mentioned under DirFiles.

FWICT, Go’s dirhash seemingly doesn’t check symlinks at all – it hashes any and all files in the directory, regardless of their resolved path. I’m a little murky on why this would be a security issue, but it’s possible I’m missing something obvious or underthinking it .