Hashing a large file from the filesystem is a very common operation, the new hashlib.file_digest in Python 3.11 makes this task much easier by taking file objects. It avoids the pitfall of a naive implementation that slurps potentially very large file into memory. But currently it still requires you to have a file object that needs to be opened and closed:
with open("file.txt", "rb") as f:
print(hashlib.file_digest(f, hashlib.sha256))
It seems that it should be a natural extension that hashlib.file_digest() should also take a pathlib.Path(), so you can instead do:
This is how the json module works iirc. You need to open a file object and close it yourself. I would agree that this is potentially problematic if the open context manager did not exist. But since it exists, I think the current way is fine.
Also, wouldn’t tying any function that works on file object to a file system operation make it hard to work with non-filesystem filelike objects? (I think these exist but I rarely work with such objects.)
A universal rule of thumb that APIs that takes a file object would also take Path too sounds very sensible on the surface, but I think that would likely snowball to undesirable designs.
The use case for taking pathlib.Path() is not as strong for json as it would be for file_digest(). The return value of json.load() is usually about just as big as the input file; so if you know that the file is always small enough to safely json.load(...), then it’s usually small enough for json.loads(Path().read_bytes()) too.
When you need to able to handle bigger files and where the standard library does support incremental parsing (e.g. csv, xml.sax), the code for incremental parsing usually is already far from being one-liners to begin with. You almost always want to explicitly control the lifetime of the file anyway when parsing incrementally, so requiring an explicit context manager makes a lot of sense:
with pathlib.Path(...).open() as file:
csvfile = csv.DictReader(path)
...
if those libraries do take a Path(), then the DictReader, etc would have to become context managers too:
path = pathlib.Path(...)
with csv.DictReader(path) as csvfile:
...
I’m not a fan of the proliferation of all sorts of libraries starting to become context managers and starting to need to handle their file object closure.
However, with hashlib, “incremental” hashing doesn’t usually make sense. Most usages of calculating hashes is to hash the entire file and unlike csv/xml/json, and digest objects don’t need to become context managers if they start accepting Path.
Also, note that an object implementing Path does not necessarily have to be a filesystem paths. It doesn’t even have to implement the whole Path interface. It just has to implement an .open(mode=...) method that returns a file object. There are many third party libraries that implemented pathlib interface for other types of hierarchical data like archive/compression file formats or remote APIs like FTP, S3, or HTTP paths.
The calling code is completely DRY (although it doesn’t really save typing) and the file will be context-managed without needing the with syntax. The partial-application design is so that arguments for the open call can be kept separate from arguments for hashlib.file_digest, json.load etc. while still supporting *args and **kwargs for both.
One could envision the existing Path providing this functionality built-in, rather than needing a subclass wrapper (although the naming probably needs work).