While reading through the prospective changes for Python 3.14 I noticed something in the API of the new pathlib.Path.scandir() that could trip up users. The post here is prompted by a suggestion to do so in response to this GH comment.
In my opinion, Path.scandir() returning an iterator of os.DirEntry directly, whose .path member is a str or bytes, rather than a Path, is unexpected.
At least on first glance, it’s unexpected that a member named “path” on an object returned from a method of Path is not actually of type Path.
with path.scandir() as entries:
for entry in entries:
assert isinstance(entry.path, Path), "unexpected type"
There are two other factors to this: As pointed out by @barneygale on GH, a path to the entry can be acquired with path / entry.name (or alternatively with Path(entry.path)). The other factor is that solving this would require another type similar to os.DirEntry, and I’m not sure if the improved ergonomics is worth creating another type for it.
I just wanted to bring this up before the API is set in stone, since changing it after the fact isn’t really possible without breaking changes.
If Path.scandir were the only method to get a directory’s contents, you’d be right. But there is already Path.iterdir for an iterator of Paths. Also Path.glob for a recursive search. And Path.walk doesn’t yield Paths either.
Path.scandir is there to provide convenient access to os.scandir, and provide more info and potential performance boosts:
Using scandir() instead of iterdir() can significantly increase the performance of code that also needs file type or file attribute information, because os.DirEntry objects expose this information if the operating system provides it when scanning a directory.
However, I don’t understand the point in making it a context manager, and in the example, despite it being named entry, calling Dir_Entry.is_path could be confused with Path.is_dir.
This is called automatically when the iterator is exhausted or garbage collected, or when an error happens during iterating. However it is advisable to call it explicitly or use the with statement.
I feel quite torn on this, and I welcome everyone’s input.
FWIW, PEP 471 – os.scandir() function landed in Python 3.5, whereas PEP 519 – adding a filesystem path protocol landed in 3.6. Had os.fspath(dir_entry) been supported when os.scandir() was added, I think it’s possible that the os.DirEntry.path attribute might never have been added, which might have avoided the type confusion we’re experiencing now.
I use my own Path object.
And my Path.scandir has extra argument return_type=2:
a) 0 - returns DirEntry
b) 1 - returns DirEntry.path
c) 2 - returns Path objects.
Restricting it to Path objects I found is not a good idea as DirEntry has a lot of efficient good things, while it is inconvenience not being able to get Path objects.
I do not like functions that can change their return type based on a parameter. I feel like this makes it less ergonomic if anything because of the added mental load.
I think it should be Path because it’s the natural expectation and will likely satisfy the most common cases. Perhaps, even a new type (subclass?) that adds the missing attributes from os.DirEntry?
It’s producing DirEntry objects that provides the scandir laziness. If we don’t want laziness, then the existing iterdir method is a better choice.
While we could potentially try to do something clever about the path attribute, that introduces its own problems due to the differences it would create between the two ways of invoking scandir.
I think what @ncoghlan is saying is that, if the user iterates through entries and throws most of them away, that allows the implementation to avoid system calls.
I suspect that careful design of the entries iterator could obviate most of the overhead though.
To recap:
scandir() doesn’t load all directory entries into memory at once, which is why there’s the iterator
DirEntry .name and .path are free
DirEntry .is_file/dir/symlink are free on modern filesystems
DirEntry .stat incurs a system call
I would think that a DirEntry could be wrapped into a Path, or perhaps a Path-like object that would provide the same free/cheap/expensive semantics. Am I wrong?
Or is the issue, perhaps, that a Path could outlive the iteration, and therefore every element must be reified?
Finally, should Path.scandir() perhaps be considered a low-level primitive with Path.glob() and Path.walk() being the high-level primitives that user should be steered towards?
Personally (and not sure if I’ve said this elsewhere on this issue), I think os.scandir() is the low-level primitive, and if users particularly want that functionality then they should call os.scandir(Path).
Anything we attach to Path is an abstract operation - we have to allow for it to be implemented by another kind of Path-based hierarchy, and its return values should return compatible Paths that remain within that hierarchy. For example, if you create a path referring to an S3 bucket/blob container, then Path.scandir() needs to return more paths within that container, rather than file system paths.
Of course, Path isn’t actually the abstract base, so what we’re discussing here is a helper on WindowsPath or PosixPath[1], and not something that is generic. I’m not sure it’s worth it. To create an algorithm that works with abstract Path is useful, but if you intend to only work with local file system paths, then os.scandir(Path) will duck-type validate that for you in the most Pythonic way.
Or a common base class that is guaranteed to only refer to a local filesystem. ↩︎
If we decide to revert the addition of Path.scandir() then I’m OK with that, for what it’s worth.
I am interested in adding an abstract version of pathlib.Path, and tbh this was my main motivation for adding a scandir() method. The abstract base class already exists in cpython, but it’s private:
Previously the PathBase.iterdir() method was abstract, but this hampered performance for the PathBase.walk(), glob() and copy() methods, which were implemented using self.iterdir(). These methods are now implemented using self.scandir(), which improves performance a great deal. And it improves performance of Path.copy() too, because Path inherits the PathBase.copy() implementation.
But the pathlib ABCs are still private, and so it’s possible to conclude that none of the above matters. I’m sympathetic to this view and I’m open to adding scandir() only if we decide to make PathBase public, which would require a PEP but not much more technical work.
Some other non-ABC motivation for adding a scandir() method can be found in a previous post of mine
These could also be improved by implementing them with os.scandir in the subclasses (possibly with a helper function for shared code). That’s kind of the model with ABCs - code that works in the base, and subclasses implement what is essential and override what they can improve.
Absolutely, and that’s what we do in the implementations of Path.glob() and Path.walk() – they call os.scandir() directly, and in fact they work with strings internally for performance reasons, only converting to/from Path at the edges.
But it’s a shame for every implementation of PathBase to need to re-implement the walking, globing and copying algorithms if they want to make use of cached information about directory children. For example, the Artifactory “Get Folder Info” endpoint annotates each child with "folder": true or false, and it would be most excellent if PathBase could use this information in its default implementation of glob(), rather than needing to make additional HTTP calls to establish each child’s type as an iterdir()-based implementation would. Artifactory isn’t an oddball case either - it’s quite common to know more about directory children than just their names.
Yeah, and IME it’s more common to retrieve directories and files separately than to get them all at once and query later (or get them with the attribute and have to strip it).
Maybe we can add some kind of “cached stat” attribute that isn’t forced to go and refresh all the info? That’s probably the biggest win on DirEntry, and it’s only really because we started without caching that we’re forced to make every operation do a real stat each time (by contrast, most similar abstractions from the Windows side start with caching, because you wouldn’t dream of doing a fresh stat each time).
The TOCTOU risks do get worse, but only practically, not technically (as in, you needed to handle them anyway, you’re just more likely to get burnt if you use a cached attribute for the path type).
Maybe we can add some kind of “cached stat” attribute that isn’t forced to go and refresh all the info? That’s probably the biggest win on DirEntry, and it’s only really because we started without caching that we’re forced to make every operation do a real stat each time (by contrast, most similar abstractions from the Windows side start with caching, because you wouldn’t dream of doing a fresh stat each time).
Yeah, I wouldn’t do it in terms of DirEntry. I was thinking more like a set of cached results for the Path methods (like is_dir, etc.) and return those values instead of re-requesting them.
The challenge is invalidating that cache (could be a new named argument) without breaking subclasses (couldn’t be a new named argument…). Though potentially just having a new invalidate_attribute_cache() method[1] would be enough? If it’s not there, there’s no caching to worry about, and it means that the results from methods like exists() and is_dir() can be set at construction time if known.
Perhaps even a module-level function that silently handles the missing attribute on objects passed into it. ↩︎
How do we do that without breaking the API though? Path methods are presently guaranteed to return up-to-date results, and so code like this is kosher:
def delete_file(path):
assert path.exists()
os.unlink(path)
assert not path.exists()
Yeah, exactly. Which is why (unfortunately) the only good way to do it (now) is to fully implement the optimised functions on the subclasses, rather than trying to abstract the optimisations across different implementations.
(Though we could probably get a decent amount of them with a might_have_children attribute that’s set on first access or when known. That’s the main attribute that we can get for free.)