Using iterative filesystem walk instead of recursive

pochmann · January 2, 2023, 6:58pm

Ah, the recursive one might’ve been slower because of the recursive yielding:

python/cpython/blob/702a5bc4637c82dc011e98b84f0cede98eb08dda/Lib/os.py#L421-L423


      
          # Recurse into sub-directories
          for new_path in walk_dirs:
              yield from _walk(new_path, topdown, onerror, followlinks)

Worst case quadratic time, e.g., with 1000 dirs at depth 1000, it had a million yields.

(That quadratic time recursive yielding is also a still open general question of mine.)

Demo, walking 'd/' * 800 takes ~4 times as long as 'd/' * 400 (because it’s twice as many dirs and they get yielded through twice as many generator iterators):

depth 400:  26 ms
depth 800: 105 ms
depth 400:  28 ms
depth 800: 101 ms
depth 400:  25 ms
depth 800: 111 ms

Code

import os
from time import time

for n in [400, 800] * 3:
    os.makedirs('d/' * n)
    t0 = time()
    for _ in range(10):
      for _ in os.walk('.'):
        pass
    print(f'depth {n}: {round((time() - t0) * 100):3} ms')
    os.removedirs('d/' * n)

Try it online!

barneygale · April 7, 2023, 4:56pm

I’m looking at changing shutil.rmtree() to use an iterative, rather than recursive, filesystem walk. I’d like to use os.walk() and fwalk() to implement it. If we employ a bottom-up walk that doesn’t follow symlinks, then we can simply os.unlink() all the files in filenames, and os.rmdir() at each level.

But there’s a problem: os.walk() and fwalk() add symlinks to directories into the dirnames (rather than filenames) bucket, even when we set followlinks=False!

There’s some discussion on the github issue about how to best to deal with this. Some ideas we’re considering:

We could add a strict_dirnames argument to os.walk() and os.fwalk(), which would cause symlinks to directories to be added to the filenames bucket when followlinks=False
We could add new implementations of walk() and fwalk() that don’t suffer from this issue, probably exposing os.DirEntry objects, perhaps in os.path where they arguably belong.
- Jon has a draft PR here: gh-89727: add os.walkdir and os.fwalkdir by jonburdo · Pull Request #103234 · python/cpython · GitHub
We could implement rmtree() atop pathlib.Path.walk(), which doesn’t suffer from the issue. We’d need to also add a Path.fwalk() method, perhaps private.
- I have a WIP PR here: [WIP] GH-89727: Fix `shutil.rmtree()` recursion error on deep trees. by barneygale · Pull Request #103164 · python/cpython · GitHub
We could implement the iterative filesystem walk in/around shutil.rmtree() itself, without leveraging any walk functions or methods from os or pathlib.
Your idea here?

Would anyone like to suggest the best way forward? It’s not clear to me. I lean towards pathlib, but then I’m rather biased!

jonburdo · April 7, 2023, 8:04pm

Some additional benefits/notes about some of these options:

- Highlights this issue of mishandling symlinks to directories and allows users who are already using os.walk to fix this problem with minimal changes to their code.
- Provides an fs walk analog to os.scandir - very useful for efficiency, convenience, flexibility in shutil.rmtree and also for general use.
- Handles symlinks to directories correctly without relying on pathlib.Path objects. Many projects still intentionally use string paths and os.DirEntry objects for handling large numbers of paths efficiently.
- 1 can be trivially implemented on top. Simply take entry.name to get the values os.walk would give. A boolean kwarg could also be added, which allows users to get string paths instead of os.DirEntry objects.
- A working draft including shutil.rmtree exists.
- Makes nice use of pathlib, possibly without significant performance overhead. (We should double check performance, though. Also note the new import of pathlib)
- A drawback is that this would be more complex. It also seems likely we’d use whatever approach we use for shutil.rmtree for any other recursive-to-iterative conversions we do in the future (i.e. shutil.copytree perhaps)

pitrou · April 8, 2023, 9:00am

This sounds like a good option. Perhaps want to call it os.scantree?

barneygale · April 9, 2023, 3:03pm

Two things that make me lean towards using pathlib:

The pathlib.Path.walk() method does exactly what we need w.r.t symlinks. Brett Cannon mentioned that he only accepted the method because it improves upon os.walk() enough to differentiate it. Its handling of symlinks is one of these improvements IMHO.
The non-fd version can be built entirely of other pathlib methods, which means it can form part of a future pathlib.AbstractPath class. Users would need to implement stat(), iterdir(), unlink() and rmdir() methods, and by doing so their subclass would have a working implementation of rmtree() (along with glob(), walk(), etc). I appreciate this is pretty nebulous and not really a current concern.

csm10495 · April 10, 2023, 5:43am

To be honest adding yet another way to walk the directory structure seems against the Zen of Python to me.

One obvious way to do things… This is giving us at least 3 just via os… not even including pathlib.

steve.dower · April 10, 2023, 10:38am

I haven’t thought about the implications of yet-another-walk-function, but I’d push back against it going into os.path. If you ignore a few exceptions, right now os.path is “PurePath” while os is “Path” (and if you like, ntpath is “PureWindowsPath” while posixpath is “PurePosixPath”).

Basically, you can use any os.path implementation to manipulate path strings without touching the OS. A walk or list function (and yes, the is* functions… too late now I guess) doesn’t belong there.

barneygale · April 10, 2023, 1:30pm

If I count up the functions in os.path and roughly bucket them into pure/impure, I actually end up with slightly more impure (~17) than pure lexical functions (~13). Many of the impure functions could happily live in os, but as you say, might be too late now!

eryksun · April 10, 2023, 1:30pm

A few exceptions? None of these is a pure path operation: exists(), lexists(), isfile(), isdir(), islink(), isjunction(), ismount(), realpath(), getsize(), getatime(), getctime(), getmtime(), samefile(), sameopenfile(), and samestat(). That’s 15 out of 31 functions – almost half. It seems proper to me to a put a function that walks paths in os.path.

steve.dower · April 10, 2023, 3:26pm

Okay, okay, I didn’t actually go and count them

Still, I’d feel better about putting the new walk/scan function with the existing scan/walk functions.

eryksun · April 10, 2023, 4:41pm

I see os.path as the lower-level equivalent of pathlib.Path, but maybe the new function could be implemented in the shutil module instead of either os or os.path. I like Antoine’s suggested name “scantree”. An shutil.scantree() function would fit with the existing functions shutil.copytree() and shutil.rmtree(), and both of the latter could be implemented in terms of scantree() instead of using os.scandir() directly.

jonburdo · April 10, 2023, 6:22pm

os.scantree and os.fscantree are pretty compelling. (Note that shutil.rmtree needs an fd-yielding version as well.) These are like fs walk analogues to os.scandir, while os.walk and os.fwalk are the analogues to os.listdir.

An os.DirEntry-yielding walk function that handles symlinks to directories correctly would also allow the shutil function implementations to remain closer to what they are currently, since they currently rely on os.scandir and do their own recursion.

barneygale · April 14, 2023, 6:34pm

Two Qs about the proposed scantree():

How do I remove a subdirectory from the walk? With os.walk() I can dirnames.remove('.git'),
How do I add a subdirectory to the walk? In os.walk() I can dirnames.append('mynewdir').

kknechtel · April 14, 2023, 6:47pm

Is adding a subdirectory a realistic use case? Semantically, everything that could plausibly make sense to be there should already be there. As far as removal goes, personally I’ve always felt like modifying that list is pretty hacky UI, and it only works in top-down mode anyway. I’d rather be able to pass some kind of predicate for filtering. For more complicated cases, maybe something that injects the os.listdir dependency?

barneygale · April 14, 2023, 7:03pm

Not only realistic, but pretty handy! It’s called out in the docs:

this can be used to prune the search, impose a specific order of visiting, or even to inform walk() about directories the caller creates or renames before it resumes walk() again.

barneygale · April 15, 2023, 5:18pm

I hope I’m not being too keen here, but I’m increasingly convinced that fixing shutil.rmtree() by building upon pathlib.Path.walk() is the right way to go. The process looks like this:

Add pathlib.Path.fwalk() method, so we can avoid symlink races in rmtree()
- Nicely complements the walk() method we added in 3.12.
- Draft PR here
Add optional follow_junctions argument to pathlib.Path.walk(), so we can avoid walking into junctions on Windows in rmtree()
- A small patch.
Implement shutil.rmtree() using pathlib.Path.walk() and fwalk()
- This makes shutil.py about 100 lines shorter, and the rmtree() algorithm more obvious.

Note how we don’t need a function/method that yields os.DirEntry objects at any stage, so this strikes me as the least invasive way to solve the problem. If/when we introduce scantree(), we could always revisit the rmtree() implementation.

Also pinging @Ovsyanka, who previously brought up adding Path.fwalk().

eryksun · April 15, 2023, 6:50pm

Maybe add pathlib.Path.[f]scantree(), with support for follow_junctions, and use it to implement pathlib.Path.copytree() and pathlib.Path.rmtree(). The yielded files and directories could be instances of a subclass of pathlib.Path that wraps a DirEntry object. Use the latter to implement stat() and the is_file(), is_dir(), is_symlink(), and is_junction() tests.

jonburdo · April 15, 2023, 8:52pm

There are a few of options I see:

Don’t support editing of dir entries, at least to start. We don’t need it to implement shutil functions.
dir_entries could be an object which provides custom add and remove functions, which accept strings.

Use a mechanism like generator.send(dirnames) to indicate an alternate list of directory names to be used:

result = []
gen = os.scantree(some_dir)
for entry, dir_entries, file_entries in gen:
    dirnames = [d.name for d in dir_entries if d.name != "__pycache__"]
    if ".git" in dirnames:
        dirnames.remove(".git")
        if "tests" not in dirnames:
            os.mkdir("tests")
            create_test_files(os.path.join(entry.path, "tests"))
            dirnames.append("tests")
    result.append(entry)
    result.extend(file_entries)
    gen.send(dirnames)

This relies on less shared state, since we’re not passing a list to the user to edit and then iterating over it. This resulted in pretty odd behavior for os.walk in the past ( `os.walk()` no longer supports late edits to `dirnames` · Issue #102932 · python/cpython · GitHub )

Any way that we do this requires creation of os.DirEntry objects for newly added directories since we need an entry value to yield for them.

kknechtel · April 15, 2023, 9:26pm

Oh, having a generator and responding to .send sounds quite elegant, yes. That would allow even more flexibility, even. (re-creating the list from scratch!) Of course, it would have to respond to a .send of None by using the list it already has (since next would be equivalent to that), and it might be courteous to type-check what’s passed in and fail fast (since the object could get replaced).

… Well, maybe .sending another sort of iterable could be practical…

I’m still not sure about the utility (deliberately creating more files and folders in the current directory in the middle of the walk, and asking the walk to consider them, sounds very error-prone), but still.

jonburdo · April 17, 2023, 12:47am

Here’s an implementation of os.scantree with dir editing via generator.send: cpython/os.py at 3dcae98c37553e2c34d01e689a81fb7d329daa4d · python/cpython · GitHub
I added it to this draft PR: gh-89727: add os.walkdir and os.fwalkdir by jonburdo · Pull Request #103234 · python/cpython · GitHub