Add pathlib.Path.walk method

Ovsyanka · January 8, 2022, 12:48pm

Pathlib is great, yet every time I have to parse a bunch of files, I have to use os.walk and join paths by hand. That’s not a lot of code but I feel like pathlib should have higher-level abstractions for all path-related functionality of os. I propose we add a Path.walk method to combat this issue.

We have already had a small discussion in the related github issue where a few questions were raised:

Where do we put the os.walk logic?
i) Do we extend iterdir logic with walk logic?
ii) Make a Path.walk method with the same API as os.walk
iii) Do we make a method with a different name and API to provide a simplified access to os.walk features?
Do we re-implement Path.walk using only pathlib tools to optimize it?

Extending iterdir with os.walk logic will most likely lead to iterdir having 4-6 arguments and its “walk” usages being basically equivalent to Path.glob("**/*"), losing some of the core os.walk features. For example, in os.walk, the users can prevent descension into certain directories by removing their names from the yielded directory list.

barneygale has mentioned that os.walk api is a bit too complex so it would be great to simplify it for pathlib, yet I do not see any way of simplifying it without losing its features.

Here’s the current prototype of the implementation:

def walk(self, topdown=True, onerror=None, followlinks=False):
    for root, dirs, files in self._accessor.walk(
        self,
        topdown=topdown,
        onerror=onerror,
        followlinks=followlinks
    ):
        root_path = self._from_parts([root])
        modified_dirs = [root_path._make_child_relpath(d) for d in dirs]
        yield (
            root_path,
            modified_dirs,
            [root_path._make_child_relpath(file) for file in files],
        )

        # In topdown mode, os.walk() allows user to prevent descension
        # into certain dirs by removing them from the yielded list.
        # We use this little hack to reach the same behavior.
        # See os.walk for more details.
        dirs.clear()
        dirs.extend([d.name for d in modified_dirs])

It is twice as slow as os.walk but about ~2.7 times faster than Path.glob("**/*") and seems to support all of the features and has an equivalent API as os.walk.

EpicWink · January 8, 2022, 1:41pm

Is there notable desire for the extended features of os.walk? In my use-case, I only want to list all files recursively, and I implement that by calling pathlib.Path.iterdir recursively.

Also in my experience, if I care about speed, I don’t use pathlib at all.

I think I’m biased against the idea due to my dislike towards os.walk's interface

Ovsyanka · January 8, 2022, 5:52pm

Path.glob is suited for what you were doing with iterdir.

As for walk, I keep finding use cases specifically for its functionality both at work and in personal projects.

ben11kehoe · January 8, 2022, 8:11pm

I would like to see walk() available in pathlib, but I’d propose that one change be made from os.walk(): the yield value should be a four-tuple: the root of the walk (that is, the original Path object), the relative path to the current directory within the walk, and then the dirs and files. As os.walk() is today, only providing the current directory means keeping the root of the walk around separately and having to then get the relative path from it.

Ovsyanka · January 8, 2022, 10:11pm

Could you provide a specific use case for that fourth return value? Every time os.walk is discussed, people ask to do less return values because its api is complex enough. Adding another one without huge necessity seems a bit wasteful.

ben11kehoe · January 9, 2022, 3:35am

Today, if you do os.walk('/home/me/some_dir'), the first value in the 3-tuple when you’re somewhere in the walk could be like '/home/me/some_dir/foo/bar/loc', and since you often care about the relative path foo/bar/loc you have to do os.path.relpath() on it. So your code ends up looking like

root = '/home/me/some_dir'
for abs_dir, subdirs, files in os.walk(root):
  rel_dir = os.path.relpath(abs_dir, root)

with the pathlib proposal, I’d like to skip the dir.relative_to(root) and instead have:

for root, rel_dir, subdirs, files in Path('/home/me/some_dir').walk():
  # use e.g. rel_dir.parts
  abs_dir = root / rel_dir # if it's needed

The reason I think the root should be included is that I’ve encountered a lot of situations where the code handling the iteration wouldn’t actually reference to the root otherwise, so it ends up needing to be specifically injected there, whereas if it was returned as part of the iteration it would just be there.

Ovsyanka · January 9, 2022, 11:23am

I understand your idea but I don’t understand how crucial it is in walk. Could you provide a real-life use case? The way I see it now, we are only saving a single line of code and only in a case when the user needs relative paths. That sounds cool but in reality we’re bloating the already bloated API for a very minor benefit in a very specific use case.

barneygale · January 9, 2022, 3:25pm

-1 on adding another tuple item. I think our options here are:

Expose os.walk() fairly directly, but make it yield Path objects rather than strings
Invent a friendlier version of os.walk() and add it to pathlib.

It’s perhaps worth noting that os.walk() uses os.scandir() under-the-hood, which is the same thing pathlib uses to implement globbing.

barneygale · January 9, 2022, 6:49pm

For a return type, how about:

Generator[tuple[pathlib.Path, list[os.DirEntry]]

This differs from os.walk in the following ways:

The root path is expressed as a Path rather than a string
The children are expressed as a list of os.DirEntry objects, rather than two lists of strings.

This could be implemented efficiently with _make_child_relpath() - no splitting or joining on / tokens is needed.

Users could make use of os.DirEntry’s efficient filesystem access:

scandir() will provide as much of this information as possible without making additional system calls. When a stat() or lstat() system call is made, the os.DirEntry object will cache the result.

The list of entries could be modified to prevent further recursion, as in os.walk():

When topdown is True , the caller can modify the dirnames list in-place (perhaps using del or slice assignment), and walk() will only recurse into the subdirectories whose names remain in dirnames ; this can be used to prune the search, impose a specific order of visiting, or even to inform walk() about directories the caller creates or renames before it resumes walk() again.

If users need a full Path object for every file in the tree, they can do:

for root, entries in Path('/').walk():
    for entry in entries:
        path = root / entry
        # do stuff with path

Ovsyanka · January 10, 2022, 11:03pm

I agree that the optimization of os.walk using pathlib seems pretty cool but is it worth it?
I feel like there are three issues with this solution:

We duplicate the logic of os.walk in the code base which means that we will potentially need a lot more tests than with a wrapper.
os.walk splits directories and files into separate lists for two reasons: it makes no sense to remove items from the ‘files’ list and it makes it easier for the user to go through only files or only directories. If we go with the ‘entries’ idea, we lose this feature entirely. It yields two issues:
i) We will need to go through all files every time we descend into a new set of directories because we store them in a single list, and the user can alter that list after any yield. In many cases, it could yield a much worse performance than just writing a wrapper around os.walk.
ii) The user will have to distinguish between files and directories on his own. Not a big problem, but seems like a downgrade nonetheless.
The solution you offered seems to be focused on optimization and removal of the third return value. That is fine but it seems (besides optimization) nearly equivalent to:

for root, files, dirs in os.walk(top):
    root = Path(root)
    for entry in files + dirs:
        path = root / entry

So I’m doubtful whether this API is better. The more I look at os.walk, its tests, and prototypes of Path.walk, the more I see that its authors made the API this complex for a good reason.

barneygale · January 10, 2022, 11:14pm

We duplicate the logic of os.walk in the code base which means that we will potentially need a lot more tests than with a wrapper.

Yes, and pathlib duplicates a lot of glob.glob() to avoid splitting and joining on '/'. I don’t see why it would require many more tests

it makes it easier for the user to go through only files or only directories

You can call is_dir() / is_file() / is_symlink() / whatever to filter.

We will need to go through all files every time we descend into a new set of directories because we store them in a single list, and the user can alter that list after any yield . In many cases, it could yield a much worse performance than just writing a wrapper around os.walk.

I don’t follow I’m afraid. I’m suggesting we use the same algorithm as os.walk() in top-down mode.

The solution you offered seems to be focused on optimization and removal of the third return value. That is fine but it seems (besides optimization) nearly equivalent to:

Faster and simpler. What’s not to like?

Ovsyanka · January 10, 2022, 11:22pm

I agree on all points but we’re not really simplifying anything if the user has to do the same number of actions when using Path.walk + filtering.

About the part that you don’t follow:
os.walk’s algorithm is based on going through the list of dirs. Your algorithm is based on going through the list of entries. Say, there is a thousand files in root and a single subdirectory. When we descend, we might need to go through all files before getting to the subdir, yet we only need the subdir.

Honestly, that’s a minor issue. My biggest problem is the fact that we’re not strictly simplifying os.walk, we are just moving complexity from one user interaction (converting root to Path) to the other (filtering).

Note: I have just discovered that this website acts like a real-time chat. That’s nice!

barneygale · January 10, 2022, 11:41pm

Maybe the file/directory distinction is important enough to actually make part of the return type. I personally think returning os.DirEntry and asking users to call is_file(), is_directory(), etc might be OK, and that the slightly simplified return type (generator of 2-tuples rather than 3-tuples) feels “pathliby”. At least to my tastes.

For posterity, here’s a rough implementation of my idea. It’s missing an implementation of follow_links. I don’t think onerror or topdown=False are worth porting particularly - can’t you just reverse the results in the latter case? Anyway:

    def walk(self):
        paths = [self]
        while paths:
            path = paths.pop(0)
            with os.scandir(path) as scandir_it:
                entries = list(scandir_it)
            yield path, entries
            for entry in entries:
                if entry.is_dir():
                    paths.append(path._make_child_relpath(entry.name))

Hope I haven’t derailed things too badly! Best of luck getting this feature through in some form

Ovsyanka · January 10, 2022, 11:52pm

Yeah, your implementation is on point. The optimization problem that I mentioned lies in the for loop portion (i.e. we have to go through files even though they are completely unnecessary for our purposes).

I guess these are the main points of our disagreement:

To simplify the return type or not
To reimplement or to wrap

I only insist on not simplifying the return type (as I mentioned before, we can’t truly simplify it without losing some of the essential features of walk). I am ok with reimplementing walk instead of wrapping it.

If you feel like we really should try to simplify it – I am happy to continue this discussion further so that we might find some common ground. Otherwise, we can just finish discussing whether we need to reimplement it or not.

barneygale · January 11, 2022, 12:25am

Happy to discuss keeping the return type in line with os.walk()!

Here’s a (mostly untested, sorry) version that retains the return type, except that the first element is a Path. It preserves pathlib’s approach of avoiding splitting/joining on '/'.

    def walk(self, follow_symlinks=False):
        paths = [self]
        while len(paths) > 0:
            path = paths.pop(0)
            dirs = []
            files = []
            with os.scandir(path) as scandir_it:
                for entry in scandir_it:
                    try:
                        is_dir = entry.is_dir(follow_symlinks=follow_symlinks)
                    except OSError:
                        is_dir = False
                    if is_dir:
                        dirs.append(entry.name)
                    else:
                        files.append(entry.name)
            yield path, dirs, files
            paths.extend(path._make_child_relpath(name) for name in dirs)

What I’m hoping to show here is that spinning our own version isn’t too much of a headache (in addition to being more in-keeping with other bits of pathlib, and faster!)

Ovsyanka · January 11, 2022, 1:30am

Agreed. ~10% to ~30% faster and a little simpler than os.walk. It is also important to note that your code can raise a permission error on the with statement so that must be ignored as well.

We could make it a bit slower and yield paths directly. The (very rough) examples I’ve tested it on give the following results:
os.walk: 9.3815169 seconds
your walk: 6.5178949 seconds
walk2: 7.179286900000001 seconds (however, more tests show that sometimes this version can get even slower than os.walk, though only by a little bit)

def walk2(self, follow_symlinks=False):
    paths = [self]
    while len(paths) > 0:
        path = paths.pop(0)
        dirs = []
        files = []
        with os.scandir(path) as scandir_it:
            for entry in scandir_it:
                try:
                    is_dir = entry.is_dir(follow_symlinks=follow_symlinks)
                except OSError:
                    is_dir = False
                subpath = path._make_child_relpath(entry.name)
                if is_dir:
                    dirs.append(subpath)
                else:
                    files.append(subpath)
        yield path, dirs, files
        paths.extend(dirs)

I am, however, completely fine with keeping it as-is. It’s already pretty great. I decided to try walk2 because it seems to be more in line with the rest of pathlib (i.e. returning paths instead of strings/os.DirEntry)

steven.daprano · January 11, 2022, 11:18pm

os.walk topdown=False is not merely the reverse of topdown=True

Given this directory tree:

demo
+-- a.txt
+-- b.txt
+-- dir1
>   +-- c.txt
+-- dir2
    +-- dir3
    >   +-- e.txt
    +-- d.txt

os.walk('demo', topdown=True) yields items:

('demo', ['dir2', 'dir1'], ['b.txt', 'a.txt'])
('demo/dir2', ['dir3'], ['d.txt'])
('demo/dir2/dir3', [], ['e.txt'])
('demo/dir1', [], ['c.txt'])

and topdown=False yields:

('demo/dir2/dir3', [], ['e.txt'])
('demo/dir2', ['dir3'], ['d.txt'])
('demo/dir1', [], ['c.txt'])
('demo', ['dir2', 'dir1'], ['b.txt', 'a.txt'])

(The specific order that subdirectories are visited may depend on the
order you created them, and the implementation details of the file
system you are using.)

I don’t find the documentation to os.walk clear, but I think that we
can translate it into standard tree traversal modes:

topdown = True:

visit the current node;
visit the node’s children.

topdown = False:

visit the node’s children;
visit the current node.

barneygale · January 11, 2022, 11:33pm

Hmm. But if I reverse your topdown=False example:

('demo', ['dir2', 'dir1'], ['b.txt', 'a.txt'])
('demo/dir1', [], ['c.txt'])
('demo/dir2', ['dir3'], ['d.txt'])
('demo/dir2/dir3', [], ['e.txt'])

And then swap the order of dir1 and dir2 (the order is arbitrary, as you say):

('demo', ['dir2', 'dir1'], ['b.txt', 'a.txt'])
('demo/dir2', ['dir3'], ['d.txt'])
('demo/dir2/dir3', [], ['e.txt'])
('demo/dir1', [], ['c.txt'])

Then we’ve arrived back at your topdown=True example, right?

Is this a breadth-first vs depth-first thing?

BowlOfRed · January 12, 2022, 12:14am

Both are depth-first. The difference is if parents are visited before children or children are visited before parents.

Reversing the output wouldn’t help my use cases. I have needed to run tree walk with tasks on filesystems with over 2B files. In addition, I am doing tasks on files with side effects (like changing permissions), not just printing. In some cases, when I update the children it resets the parent timestamp. So during my walk, I have to visit parents last or they don’t get the update.

Ovsyanka · January 15, 2022, 3:42am

I have looked over os.walk again and I see that if we want to reimplement it, we will have to change only 3 lines out of ~92 lines. @barneygale 's example is neat, but it disregards topdown and onerror arguments, and does not handle the errors that arise during initial call to scandir and next() call to scandir iterator, which is why it is so small.

In reality, we will need to rewrite the entire walk (with minimal changes) and support it separately from os.walk. So I have even more doubts about reimplementing it now. For now, I will write more tests for the wrapper version and wait for more feedback.

Topic		Replies	Views
Using iterative filesystem walk instead of recursive Core Development	42	3642	October 21, 2023
Pathlib and os.path: feature parity and code de-duplication Ideas	6	1122	June 21, 2021
Incrementally move high-level path operations from shutil to pathlib Ideas	21	1939	November 7, 2022
Add pathlib.Path.rmtree Ideas	6	577	May 19, 2022
Request for review: pathlib._PathBase Core Development review-request	3	497	September 11, 2023

Add pathlib.Path.walk method

Related Topics