Hashlib.file_digest() should take a pathlib.Path object to compute a hash of file in the filesystem

Hashing a large file from the filesystem is a very common operation, the new hashlib.file_digest in Python 3.11 makes this task much easier by taking file objects. It avoids the pitfall of a naive implementation that slurps potentially very large file into memory. But currently it still requires you to have a file object that needs to be opened and closed:

with open("file.txt", "rb") as f:
    print(hashlib.file_digest(f, hashlib.sha256))

It seems that it should be a natural extension that hashlib.file_digest() should also take a pathlib.Path(), so you can instead do:

print(hashlib.file_digest(pathlib.Path("file.txt"), hashlib.sha256))

The main benefit of natively taking pathlib.Path() is also that callers of file_digest() don’t need to worry about needing to close the file object.

Currently, if you do:

print(hashlib.file_digest(open("file.txt", "rb"), hashlib.sha256))

That would leave an unclosed file object, which is bad.

Another alternative:

print(hashlib.sha256(pathlib.Path("file.txt").read_bytes()))

this works fine, but it slurps the whole file into memory, which is fine for smaller files, but are problematic when working with large files.

This is how the json module works iirc. You need to open a file object and close it yourself. I would agree that this is potentially problematic if the open context manager did not exist. But since it exists, I think the current way is fine.

Also, wouldn’t tying any function that works on file object to a file system operation make it hard to work with non-filesystem filelike objects? (I think these exist but I rarely work with such objects.)

A universal rule of thumb that APIs that takes a file object would also take Path too sounds very sensible on the surface, but I think that would likely snowball to undesirable designs.

The use case for taking pathlib.Path() is not as strong for json as it would be for file_digest(). The return value of json.load() is usually about just as big as the input file; so if you know that the file is always small enough to safely json.load(...), then it’s usually small enough for json.loads(Path().read_bytes()) too.

When you need to able to handle bigger files and where the standard library does support incremental parsing (e.g. csv, xml.sax), the code for incremental parsing usually is already far from being one-liners to begin with. You almost always want to explicitly control the lifetime of the file anyway when parsing incrementally, so requiring an explicit context manager makes a lot of sense:

with pathlib.Path(...).open() as file:
    csvfile = csv.DictReader(path)
    ...

if those libraries do take a Path(), then the DictReader, etc would have to become context managers too:

path = pathlib.Path(...)
with csv.DictReader(path) as csvfile:
    ...

I’m not a fan of the proliferation of all sorts of libraries starting to become context managers and starting to need to handle their file object closure.

However, with hashlib, “incremental” hashing doesn’t usually make sense. Most usages of calculating hashes is to hash the entire file and unlike csv/xml/json, and digest objects don’t need to become context managers if they start accepting Path.

Also, note that an object implementing Path does not necessarily have to be a filesystem paths. It doesn’t even have to implement the whole Path interface. It just has to implement an .open(mode=...) method that returns a file object. There are many third party libraries that implemented pathlib interface for other types of hierarchical data like archive/compression file formats or remote APIs like FTP, S3, or HTTP paths.

1 Like

Here’s some syntactic sugar to generalize the idea of supplying a Path where an open file is expected as the first argument:

from dataclasses import dataclass
from functools import partial
from pathlib import Path
from typing import Optional


@dataclass
class FileParams:
    mode: str = 'r'
    buffering: int = -1
    encoding: Optional[str] = None
    errors: Optional[str] = None
    newline: Optional[str] = None


def _invoke(path, params, func, *args, **kwargs):
    with open(path, **params) as f:
        return func(f, *args, **kwargs)


class InvokablePath(pathlib.Path):
    def invoker(self, *args, **kwargs):
        return partial(_invoke, self, vars(FileParams(*args, **kwargs)))

Now it should be possible to write things like:

print(InvokablePath('file.txt').invoker('rb')(hashlib.file_digest, hashlib.sha256))

The calling code is completely DRY (although it doesn’t really save typing) and the file will be context-managed without needing the with syntax. The partial-application design is so that arguments for the open call can be kept separate from arguments for hashlib.file_digest, json.load etc. while still supporting *args and **kwargs for both.

One could envision the existing Path providing this functionality built-in, rather than needing a subclass wrapper (although the naming probably needs work).