How to hash a directory in lockfiles

mikeshardmind · November 10, 2024, 12:12pm

If you’re only considering being able to compromise the hash, I don’t think you’re missing anything or underthinking it.

I was considering two consequences of a mandated algorithm as well as the use cases people have said that they have.

If the algorithm is mandated, and so far, I think at least the two of us agree that it needs to be, then there are novel resource attacks like dropping a symlink that points at drive root somewhere you expect to end up eventually picked up downstream and hashed like this to attack the resources of those hashing (they can’t fail early, they won’t know the hash mismatch until complete). Addressing that would require revising the algorithm or revising all tools that use it, rather than revising one unified tool that uses the algorithm (we don’t have one unified tool). This is possible for individual tools to address themselves in an opinionated way, but then this ceases to be interoperability between tools in the ecosystem if some tools allow it and others fail it.
People have clarified that this isn’t just for locking local source trees in their view. Cases described as “stability” and “reproducible science”, while saying that these cases often involve sharing a directory of loose files have also been brought up. Mandating an algorithm fail before distributing a directory that isn’t as standalone as expected would be better for cases like this, as the person interested in shipping something reproducible would want to be alerted to this. Since we need to mandate an algorithm, it was my own perspective that the algorithm should do as much as is reasonable without compromising security to assist these varied use cases.

woodruffw · November 11, 2024, 2:08pm

This is true, but I’m not super clear on the threat model: are there attackers who have the ability to drop arbitrary symlinks who don’t have the ability to drop arbitrary files? Symlink creation is typically as (if not slightly more) privileged than file creation, in practice.

I understand this perspective, and I think these are legitimate/valid use cases that deserve a standard algorithm! But this is an argument for either approach regardless of symlink handling, right?

(I think my underlying position here is that you’re 100% right about picking one thing, and not allowing any unnecessary agility, and I have a preference for picking the thing that has prior art in another ecosystem. But this is not something I feel especially strongly about, modulo the symlink parts – if there’s consensus that Python wants its own dir hashing primitive, then yours seems like a very good design.)

mikeshardmind · November 12, 2024, 1:24am

To be clear, I don’t think this particular angle is likely, it’s a novel non-security attack that would only exploit existing partial trust, and would still fail at the point where hashing happens to confirm the files are what was trusted before. At the worst reasonably foreseeable way this could happen, it’s probably only an annoyance that works exactly until noticed, in something like a CI job that pulls in remote files from a partially trusted source, verifying the set of files against a trusted lock file. Setting a timeout on the CI job would prevent future annoyances.

There have been a handful of more problematic issues with symlinks and trust before with symlinks and git submodules. However, this would fail before anything outright dangerous happens here if the lock file is fully trusted with or without this restriction. This is one of those niche edge cases that makes it seem like there might be more here, but all I can currently conceive of is an avoidable annoyance without a compelling use case to allow it.

William Woodruff:

mikeshardmind:

Since we need to mandate an algorithm, it was my own perspective that the algorithm should do as much as is reasonable without compromising security to assist these varied use cases.

I understand this perspective, and I think these are legitimate/valid use cases that deserve a standard algorithm! But this is an argument for either approach regardless of symlink handling, right?

(I think my underlying position here is that you’re 100% right about picking one thing, and not allowing any unnecessary agility, and I have a preference for picking the thing that has prior art in another ecosystem. But this is not something I feel especially strongly about, modulo the symlink parts – if there’s consensus that Python wants its own dir hashing primitive, then yours seems like a very good design.)

I think even if we adopt dirhash from go, we’d want our own documentation of the steps and a pure python reference implementation based on what’s guaranteed to be available in the standard library so that the majority of existing tooling does not become responsible for needing to implement this from scratch, and that there is an implementation that all people in the ecosystem should reasonably be able to read and follow along with. I can’t expect people working on python tooling to be able to follow through go source code and fully understand any subtleties of it, and I’ll admit that even though I’m somewhat familiar with go, upon first read, I missed both in documentation and in code that forward slashes would be uniformly used (though all of the code I’ve ever written in go never touches the filesystem)

The unnecessary agility portion would point towards not supporting symlinks that point outside the directory, but we could just as easily implement something that produces equivalent hashes as dirhash when meeting that constraint. I’d be amenable to providing such an implementation and documenting the full algorithm in a single location rather than split across the functions used internally if this is an approach people would prefer.

Going back to something you said before, I don’t think this is true. if we’re considering things like FIPS mode use of openssl to be the source of the availability of these digest functions, it’s worth pointing out that blake2b and sha256 are both guaranteed to be available in python’s hashlib even when not provided by openssl, and that I’m reasonably sure FIPS compliance would not preclude the use of blake2b here as the use case is out of scope for what FIPS regulates.

It might be good to have picked blake3 and not blake2b here, but blake3 is unfortunately not available in python’s standard library right now, and even if included for 3.14, would be ~5 years away from a reasonable expectation of general ecosystem availability.

In terms of performance, both in total speed and in wattseconds (for both parallel and non parallel use) for the approaches here, blake3 beats out blake2b which beats out sha256, even on architectures that have specialized sha256 instructions that are used (I can provide analysis of this later if people aren’t familiar), so I’d personally lean towards blake2b for python when considering what information we have and the use cases, concerns, and availability of digests I am currently aware of, but I think either this or a customized version of go’s dirhash that restricts symlinks are both acceptable.

woodruffw · November 12, 2024, 4:49am

A use case that I’m aware of for “external” symlinks is single-sourced version and README files in nontrivial packaging setups. I don’t think these are good per se, but I’ve written a few symlinks like the following over the years:

some-repo/
  README.md
  core/
    # ...
  python/
    package-a/
      README.md: ../../README.md
    package-b/
      README.md: ../../README.md

(I’d consider it a fair judgement to say “this is bad,” and to restrict external symlinks regardless. But I figure an example helps with characterization here.)

Fully agreed! Sorry, I didn’t mean to imply that we wouldn’t do these things: we would certainly need a Python reference implementation.

FWIW, this was meant to be more generic. It was more about SHA-256 being a universally available strong cryptographic hash, not a claim that Blake2 won’t be available in Python. For Python in particular, I fully agree that the presence of Blake2 is not a concern.

(The underlying thinking here was that Python packaging information gets produced and consumed in a lot of ecosystems that aren’t themselves Python, e.g. Rust, random shell scripts, Homebrew formulae, etc. SHA-2 is a much less “exotic” strong general-purpose digest in these settings.)

Liz · November 12, 2024, 3:33pm

I wouldn’t call blake2 “exotic” and it’s widely available everywhere.

It’s the default hash in libsodium
it’s part of argon2
it’s used in wireguard
there’s a formally verified implementation of it
There’s fat binaries provided by the authors
the linux kernel uses it
The authors mention all of the above as well, more, and link to 3rd party implementations of it here for wasm, rust, go, java, scala, haskell, python, C, dart, and even php all dating to having existed before 2019, some as far back as 2012

I understand that fewer people are familiar with the hash than have heard of sha256, but I don’t think availability is a concern even outside of python.

woodruffw · November 12, 2024, 10:06pm

I think in relative terms, it is. The measuring stick I’m using for this is the universal presence of shasum/sha256sum compared to b2sum for the Blake2 variants (which I have locally, but only because I installed it via Homebrew). This may seem silly, but I think it’s an important part of interoperation: a lot of Python package consumption tooling isn’t written in Python, and SHA-2 is universally available in other settings (without a 3p implementation) while providing the same security margin.

This isn’t intended to be a value judgement – Blake2 is great, and as I noted PyPI uses it internally. But I think in relative terms SHA-2 is a “commodity” hash and Blake2 is a “enthusiast’s choice” hash, and IMO external Python packaging interfaces should pick the thing that’s strictly more commodity by default (both in the interest of interop, and because it enables us to skip the design step and use a well-defined construction already).

As two related datapoints: neither Ruby’s Digest API nor the JVM’s MessageDigest include Blake2 variants by default, but both include SHA-2. Both ecosystems have mature 3p implementations, but defaults matter!

mwichmann · November 12, 2024, 10:37pm

b2sum is widely available on Linux systems - it’s part of GNU coreutils.

brettcannon · November 12, 2024, 11:58pm

Yep, and that doesn’t even cover normalized forms.

Right, we have to worry about tooling written not using Python for these standards, which is why I’m at least asking if choosing the hash algorithm is truly required to make this work?

But that doesn’t cover macOS or Windows where support would also be important. Python also works on more than those OSs. And since this is motivated for source trees it means we need more compatibility than e.g. typical wheel files, for example, as people would be making sdists from the source tree to then make a wheel.

woodruffw · November 13, 2024, 12:41am

IMO no, it’s not a hard requirement.

I think my baseline position here is that picking a known-good approach (like Go’s dirhash) instead of rolling our own is a “win” because we can piggyback on an already extant construction that’s sound (domain separation), strong (SHA-2), and universal (SHA-2). A new construction could have all of these properties (except being not as universal, per above), but IMO the parsimonious thing to do is use the thing that already exists that has a good track record with a similar ecosystem (Go).

mikeshardmind · November 13, 2024, 2:37pm

a b2sum fat binary is available for windows and linux, implementations exist in every programming language I’ve seen used to interact with packaging, and any language with reasonable ffi can just use the HACL* implementation that python itself leans on.

I think choosing the hash algorithm is required, because this is an interoperability question. This is also a reason against sha256 in my view of it, as we’d be mandating a worse performing hash, and this would likely be a long term standard that interacts with a lot of files and computing resources (At least if wheels are any indication there)

If someone can make a convincing argument that it’s fine for something consuming a lockfile to reject it because of lack of support for an algorithm and that that’s something for the publisher/consumer to reconcile between themselves so long as we document how each algorithm should work when supported, then I think we’d want an implementation that looks like go’s dirhash for those needing to use sha256, as well as the above most recent revision using blake2b, and possibly a blake3 variant in the future.

I had to give this more thought. I still have a nagging feeling that symlinking to a directory outside of the directory being hashed is opening up too many things we don’t want to have to fully consider the behavior of. Do you (or anyone else seeing this) think that it’s common enough that we should consider allowing a symlink to a file, for cases like this, but restrict symlinking to a directory outside? Does anyone have a use case for symlinking a directory that’s outside the directory being hashed? If so, what if it’s allowed, but only when still linking to files or directories that are still at the same hierarchal level as the lockfile or below? (ie. in your example there, the lock file lives at the same level as readme.md, and points to each of the directories python/{package-a, package-b}) with a hash for each.

woodruffw · November 13, 2024, 11:01pm

I’m not aware of a use case for symlinking a whole directory outside of the prefix! I suppose there could be something obtuse like subpackages foo and bar both sharing common code/extensions via a symlinked directory, but that’s not something I’ve actually seen. Curious what other folks think.

steve.dower · November 13, 2024, 11:40pm

It’s not uncommon in monorepos, but you would argue that the security prefix here is the repository root, not the root of the hashed directory. That could be permitted as an additional parameter, but probably adds more complexity than it’s worth.

You might be able to make it work by following symlinks outside the prefix, but treating those within the prefix as their own namespace and hash the target path instead. You could recreate the directory tree with different symlinks provided all the files in them are the same. But I’d think that breaks the usefulness of having an “external” reference, even in the monorepo case.

Liz · November 14, 2024, 8:48am

I’m not an expert here. I do what I can with the knowledge I have, and I try to learn more whenever it matters. I feel like there’s too many claims here that I have a reason to care about that aren’t proven well enough here for me to actually find any expert review consensus about.

It looks clear to me that the most recent blake2b based option is safe, but that’s only because I trust the people who have reviewed blake2b extensively, there’s no new evidence to the contrary, and it appears to use blake2 in a way blake2 was designed for.

It’s not clear to me that the same exists for go’s dirhash. The documentation of it leads me to believe that someone just had the idea of “lets sha256 the output of sha256sum for a bunch of files” because the documentation itself describes it that way first, and I can’t find a review or analysis of it. There’s various parts that aren’t documented, and are internal go behavior, which further makes me question how thorough the design actually was. (maybe unfairly, since my search to understand this better showed that this was internal to go at first) The ordering of filenames is not a documented part of dirhash, but an implementation detail of path/filepath.Walk

I don’t say this with any intent of disrespect to those who constructed it, use it, have advocated for it, or anything else of that nature. A lazy construction that happens to also be a safe wouldn’t be a bad thing, lazy isn’t pejorative, it’s smart. I’d call @mikeshardmind’s implementation above pretty lazy too, it’s built in a way where the hash can’t be swapped out because it uses the things that blake provides.

I have reason to trust many of the people who have said neutral to positive things about using go’s dirhash, but I’d feel better about it if I understood, or at least could find better resources on why people are so confident that this construction is sound enough for this task. If you’ve got any links that I could read or reviews that I might have missed, I’d appreciate it.

It also seems like go’s may have slightly different concerns based on what it does and doesn’t support, so If we have to tweak the behavior of it to meet those concerns, are we benefiting fully from using the existing construction?

I don’t have a use for it. When trying to better understand the alternative of using what go has, I came across a couple things relevant to this I think?

go apparently doesn’t follow symlinks for directories for dirhash either.
go also doesn’t include empty directories in the hash, which is something that can cause a behavioral difference for python packages. empty directories are importable as namespace packages as @oscarbenjamin pointed out and you adjusted for before.
The file paths are walked for the dirfiles step, and therefore emitted with a hash in lexical ordering, here lexical ordering means sorting as if it is utf-8 and it is not a locale based sort, but sort by codepoint

mikeshardmind · November 14, 2024, 9:32am

Essentially, we’re looking for a sound construction of a multi-hash.

Modern “out of the box” options here are tuplehash, blake2, and blake3

what @woodruffw linked above about domain separation covers the way to construct this from repeated use of a secure hash obliquely. (domain separation) but the two specific things we’re looking for with this are collision resistance and Second preimage resistance

The only existing practical attack on sha256 here is out of scope for this function as we’re not concerned with a blind forgery done as a length extension attack. Though it’s technically true that sha256 is more likely to be fully broken before blake2 is if the rate of research and flaw finding continues unchanged, with current knowledge, neither of them are broken for this if things are constructed properly.

Thanks for listing these, I’ll probably get around to verifying and collecting everything from this thread so far and provide the pure python approach based on dirhash for comparison some time later this week when I have more time.

mwichmann · November 14, 2024, 2:22pm

Yes, I get it - I’m having a quite similar discussion with someone in a
different project, who proposed xxhash for content signatures (for
performance reasons) - xxhash produces an astonishing number of wheels
but it’s still a package to install and some legacy platforms aren’t
supported.

But blake2 is considerably better off since it’s in hashlib, so we’re
only worrying about support utilities used outside of Python, which
seems a lesser burden, that’s where I thought it was worth pointing out
that at least one common platform has a command-line tool commonly
available.

mikeshardmind · November 16, 2024, 11:22am

implementation comparison of 3 discussed approaches is available GitHub - mikeshardmind/directory_hash_canidates including high level notes on how they differ

The 3 approaches being:

Leverage things that exist in blake2 for this
Use a custom construction based on sha256
Use a construction like go’s dirhash (there are minor differences here, but the core construction, and therefore the safety of that construction is identical)

If we have to pick a singular winner, I still think the Blake2-based implementation is the best option on the table. If people are leaning towards wanting there to be multiple options and just standardizing on options tools in the ecosystem can use, and then tools choose what they implement, I think that would be blake2 + one of the others here (I don’t think we need multiple based upon sha256)

In any outcome where things can move forward from here, I’ll go into full detail on each one we want to keep and document each so that they do not rely on any Python implementation details in how the methods are described.

mikeshardmind · December 4, 2025, 5:22pm

It’s been over a year since I brought this topic up; since then, the Blake2 variant in the above has been in use at my day job in internal tooling.

There have been no issues with it in use there, though our use has the benefit of fitting perfectly within how I designed the function (benefits of being fully aware of our own use case before writing it…). I was also recently asked to add a license to the above by someone interested in using it in a reproducible science case.

I believe it is the right approach for this tooling going forward, and would like to move forward with proposing it as a standard. I believe it may also have utility in other emerging use cases, such as for tools that may implement things based on pep 739: PEP 739: Static description file for build details of Python installations, as it could be used to provide a singular hash for the state of the installation as it was recorded.

@sethmlarson @brettcannon I’m aware that there are concerns on using Blake2 since it is considered by some to be less available or less accessible than an option built on SHA-3 given that not all packaging tools are written in Python. Would providing multiple native implementations that can be used as either a library or a standalone application in python, zig, go, and rust be sufficient to alleviate these concerns and unblock the use of Blake2, or are there other concerns with deriving from Blake2 that need addressing?

brettcannon · December 4, 2025, 7:37pm

As long as it’s in hashlib that’s probably the key thing. But @sethmlarson is the security expert. I’m also not the right person to convince since pylock.toml is out and so this proposal will need to be its own thing.