This is an offshoot from PEP 751: now with graphs!
I’m proposing a method based on blake2, given its suitability for combining digests and availability within the standard library (guaranteed available as part of hashlib, not an optional algorithm) as well as reference implementation availability.
While blake3 is faster, there is no current evidence that blake2 is insufficiently secure, and blake3 is not guaranteed to be available in Python currently.
The things which are taken into account:
- That any symlinks in the directory must only point inside the directory
- A symlink to a path is considered equivalent to if it was duplicated
- This allows for portable behavior preferring symlinks on supported systems
- File metadata such as last modified date is not hashed
- The directory’s name does matter
These considerations allow portable relocation of a directory, including via tools that don’t preserve metadata, to resolve as the same hash, while retaining the parts we need to care about for source trees in python.
I’ve written a reference implementation
from hashlib import blake2b
from pathlib import Path
def _check_symlink(base_path: Path, to_check: Path, /) -> Path:
if to_check.is_symlink():
resolved = to_check.resolve(strict=True)
try:
resolved.relative_to(base_path)
except ValueError as exc:
msg = "Dangerous symlink detected"
raise ValueError(msg) from exc
return resolved
return to_check
def _portable_path_repr(base_path: Path, dir_path: Path, /) -> bytes:
"""don't rely on os.pathsep"""
base = base_path.as_posix()
d = dir_path.as_posix()
return ("." + d.removeprefix(base)).encode()
def hash_dir(dir_path: Path, /) -> str:
if not dir_path.is_dir():
msg = "This function is intended for use on paths"
raise ValueError(msg)
rel_root = dir_path.resolve(strict=True)
dir_component = blake2b(last_node=False, person=b"dirs")
file_component = blake2b(last_node=False, person=b"files")
data_component = blake2b(last_node=True, person=b"data")
root_node = blake2b(last_node=True, digest_size=32)
root_node.update(dir_path.name.encode())
walk_gen = dir_path.walk(top_down=False, follow_symlinks=False)
for root, dirs, files in walk_gen:
for name in dirs:
fp = (root / name)
resolved = _check_symlink(dir_path, fp)
dir_component.update(_portable_path_repr(rel_root, resolved))
for name in files:
fp = (root / name)
resolved = _check_symlink(dir_path, fp)
file_component.update(_portable_path_repr(rel_root, resolved))
data_component.update(resolved.read_bytes())
root_node.update(dir_component.digest())
root_node.update(file_component.digest())
root_node.update(data_component.digest())
return root_node.hexdigest()
There are a few places in this where the behavior of the standard lib is implicitly part of the algorithm, and this might need adjusting to explicitly avoid a dependency on behavior which may change in the future.
In particular, while unlikely to change, these may be useful for non-python implementations such as uv:
- The defaults for
hashlib.blake2b
(unlikely to change) - The ordering of
pathlib.Path.walk
The specific format of output ofos.path.relpath
- The specific output of
pathlib.Path.as_posix()
While I did consider other hash functions and other means of combining hashes, I was unhappy with the approaches needed for hashes not designed with incremental tree updates as part of their design, such as building an intermediate container as an implementation detail.
We might need a name for this process as well so that if a future better way of hashing a directory becomes viable, it’s possible to communicate the process. As this is not raw use of blake2 on an input, the name should not be “blake2”, but might contain it. I don’t have a preference on naming, and would prefer to leave that bit of bikshedding until figuring out if we have a consensus on it being possible.
Note: Implementation and concerns have received an update for file separators and to include empty dirs