Is there a `pathlib` equivalent of `os.scandir()`?

kknechtel · February 25, 2024, 10:43am

I would guess that these are relatively simple objects; e.g. they aren’t caching the stat, inode etc. results when created. Plus they’re implemented in C.

I just tried it and os.scandir was much faster for a directory that contained only a few files. It had so much of an advantage that [x.name for x in os.scandir()] (to get the result in the same format) is almost as fast as os.listdir().

barneygale · April 11, 2024, 12:15am

FWIW, using os.scandir() does speed up iterdir() a little bit if we forgo some upfront normalization of results. This makes iterdir() faster in the general case, but slower if the user subsequently calls any PurePath method on the results that needs to fully parse the path (e.g. .name or .parent). All in all I think it’s worthwhile. PR:

github.com/python/cpython

GH-117727: Speed up `pathlib.Path.iterdir()` by using `os.scandir()`

python:main ← barneygale:iterdir-scandir

opened 08:34PM - 10 Apr 24 UTC

barneygale

+8 -19

Replace use of `os.listdir()` with `os.scandir()`. Forgo setting `_drv`, `_root`… and `_tail_cached`, as these usually aren't needed. Use `os.DirEntry.path` to set `_str`. Timing: ```bash $ ./python -m timeit -s "from pathlib import Path; p = Path('Lib')" "list(p.iterdir())" 1000 loops, best of 5: 303 usec per loop # before 1000 loops, best of 5: 257 usec per loop # after # --> 1.18x faster ``` * Issue: gh-117727

fireattack · May 10, 2024, 5:25am

Great update!

One pain point when using Path.iterdir (in term of performance), however, is to distinguish files and folders.

From what I noticed (also mentioned above), Path.is_file() is very expansive. When processing thousands of thousands files and folder (especially when doing so recursively), the time spent on this easily passes the time spent on listing files/dirs, if your goal is, say, to get all the files in a folder.

For a quick example:

def iterdir_generator(directory):
    dirpath = Path(directory)
    def core(dirpath):
        for x in dirpath.iterdir():
            if x.is_file():
                yield x
            elif x.is_dir():
                yield from core(x)
    return list(core(dirpath))

Is there any way to improve on that?

MegaIng · May 10, 2024, 9:11am

Don’t use Path.iterdir, use os.scandir. It sadly isn’t viable to add the stats caching performed by the later into the Path objects because of expectations people have about the objects.

blhsing · May 10, 2024, 3:30pm

I’m not sure what expectations about objects you’re referring to, but I think it’s entirely viable and makes perfect sense to add the stats caching performed by os.scandir into the Path objects, since those os.DirEntry objects that os.scandir generates have direct 1-to-1 corresponding methods in Path objects.

To quote the documentation of os.DirEntry:

Note that there is a nice correspondence between several attributes and methods of os.DirEntry and of pathlib.Path. In particular, the name attribute has the same meaning, as do the is_dir() , is_file() , is_symlink() , is_junction() , and stat() methods.

Here’s a quick implementation of a scandir method for a Path object that generates Path objects instead of os.DirEntry objects:

import os
from pathlib import WindowsPath, PosixPath

class ScannablePath(WindowsPath if os.name == 'nt' else PosixPath):
    def scandir(self):
        yield from map(CachedPath, os.scandir(self))

class CachedPath(ScannablePath):
    def __new__(cls, dir_entry):
        path = super().__new__(cls, dir_entry.path)
        path._dir_entry = dir_entry
        return path

    is_dir = lambda self: self._dir_entry.is_dir()
    is_file = lambda self: self._dir_entry.is_file()
    is_symlink = lambda self: self._dir_entry.is_symlink()
    is_junction = lambda self: self._dir_entry.is_junction()
    stat = lambda self: self._dir_entry.stat()

for path in ScannablePath('/').scandir():
    print(path.name, path.is_dir(), path.stat().st_size)

Demo here

MegaIng · May 10, 2024, 3:54pm

Being able to store them in a data structure for permanent use. Path objects shouldn’t go stale. With the DirEntry objects, this is something the user should be aware of and is something they are explicitly asking for by using scandir.

blhsing · May 10, 2024, 4:00pm

We can always add additional facilities to expire and/or refresh those stats. The point here is that it makes sense for Path objects to cache those stats and to leverage the output of os.scandir so that users can have both the friendliness of pathlib and the performance of os.scandir in one API.

barneygale · May 10, 2024, 4:52pm

Users expect that Path.is_dir() etc perform a fresh lookup, so we can’t change behaviour without it being opt-in.

We could perhaps add a entry: os.DirEntry | pathlib.Path attribute to Path that returns a cached os.DirEntry where available (e.g. paths generated from iterdir()), or return self if not. Or perhaps it should be status() -> os.DirEntry | pathlib.PathEntry, where PathEntry is a limited and caching version of Path.

barneygale · October 13, 2024, 7:50pm

I’ve logged a feature request for this, and I have a PR on the way:

github.com/python/cpython

Add `pathlib.Path.dir_entry`

opened 07:48PM - 13 Oct 24 UTC

barneygale

type-feature performance topic-pathlib

# Feature or enhancement `Path.iterdir()` uses `os.scandir()` under-the-hood,… but it throws away the resulting [`os.DirEntry`](https://docs.python.org/3/library/os.html#os.DirEntry) objects, despite their [numerous useful features](https://peps.python.org/pep-0471/). I propose we add a new `Path.dir_entry` attribute that stores an `os.DirEntry` object or `None`. This attribute will be set to a directory entry in paths yielded from `Path.iterdir()`. This would allow users to call methods such as `child.dir_entry.is_symlink()` to check for symlinks without incurring a mandatory system call. It will help speed up the implementation of `Path.copy()` too. See discussion: https://discuss.python.org/t/is-there-a-pathlib-equivalent-of-os-scandir/46626

barneygale · October 28, 2024, 12:16am

@ncoghlan added some great feedback to the issue and PR about the potential dir_entry attribute, and how it could work for paths that aren’t generated by Path.iterdir(). I’ve been trying to get to grips a potential API, but a lot of it isn’t obvious (e.g. when caches are generated, updated, expired), which I think indicates that my whole approach is wrong.

What if we simply added Path.scandir() instead?

There’s a few things to weigh. The most obvious objection is that path.scandir() is only an alternate spelling of os.scandir(path). But some of pathlib’s utility comes down to trivially wrapping the most useful os functions in methods (for example, see Path.stat(), chmod(), rmdir()), despite (sometimes quite convincing) arguments that this is improper practice. I suggest that scandir() is among the most commonly-used os functions and might deserve a Path method on its own merits.

My own pet interest is in eventually exposing a PathBase class that users can subclass to implement virtual filesystems. To that end, having a scandir() method that yields caching os.DirEntry-like objects is a massive help for implementing various high-level PathBase methods, like glob(), walk() and copy().

There’s a bit of cross-over between these concerns. Python 3.14’s Path.copy() method is implemented in PathBase at the moment, and uses PathBase.iterdir() to walk directory trees. For performance reasons it ought to work with os.DirEntry when dealing with local paths, which means I can either 1) add a near-duplicate implementation in Path that uses os.scandir(), or 2) Add scandir() to the PathBase interface and call that from copy().

Thoughts?

ncoghlan · October 28, 2024, 12:42am

I think that is tidier than what I suggested on the issue, so +1 from me.