Using iterative filesystem walk instead of recursive

Functions that do a recursive filesystem walk such as os.walk, pathlib.Path.walk, and shutil.rmtree hit the recursion limit on deep directory trees. I suggest these functions be changed to be iterative rather than recursive. I’d think this would improve performance as well. It’s pretty easy to hit the limit especially if generating directory trees programmatically, and this prevents walking arbitrary trees without adjusting the recursion limit.

I created a PR for doing this with os.walk, but wanted to get some feedback before going further: gh-89727: Fix os.walk RecursionError on deep trees by jonburdo · Pull Request #99803 · python/cpython · GitHub

Here’s a relevant issue as well: shutil.rmtree and os.walk are implemented using recursion, fail on deep hierarchies · Issue #89727 · python/cpython · GitHub

5 Likes

I wanted to do this for a while but never had the courage for some reason. Yet your work has inspired me to try to do this for pathlib.Path.walk! Thanks for your commits!

2 Likes

In which kind of use case do you actually generate file trees with a depth in the hundreds or thousands?

2 Likes

I haven’t seen such a case in my career. However, I also do not see much benefit in using a recursive solution. It is slightly simpler but it is also slightly slower and puts an unnecessary limitation on python’s walk which could become problematic some day.

And the changes proposed in the issue above do not make scaling or working with walk problematic so I think they are pretty much worth the slightly increased complexity.

Update:
Forgot to mention: the proposed changes also let us get rid of an extra Path._walk method and os.walk function.

2 Likes

I thought about this idea, but decided that it is easier to increase the recursion limit than to rewrite every recursive filesystem walk in the stdlib. Increasing the recursion limit is more universal solution, and in recent Python releases there is no physical limit on recursion in pure Python code.

I think that we can increase the default recursion limit by order or two. But the code which consumes the C stack should increase the depth by more than 1, to prevent the stack overflow. @markshannon

1 Like

Software bugs, or when extracting less than trustworthy archive inputs.

In either case you need software to clean up the mess manually and may have written your automated cleanup software in Python… :sweat_smile:

9 Likes

That’s a rather good point, thank you. It also goes hand in hand with the ability to correctly handle non-decodable filenames.

2 Likes

I think this is a worthy cause. Maybe the discussion can move to GitHub?

3 Likes

There is already a PR active, gh-89727: Fix os.walk RecursionError on deep trees by jonburdo · Pull Request #99803 · python/cpython · GitHub

That’s great to hear! Thanks for the help and feedback :slight_smile:

I don’t have anything to add except that this is an excellent point. I’ve faced similar problems with IMAP folders hundreds of levels deep, so it’s not unbelievable that this could be a problem with regular filesystem directories.

1 Like

I am skeptical against the claim that it will be faster to not use recursion. But I hope it’s true!

1 Like

“than to rewrite every recursive filesystem walk in the stdlib”

How many are there!? Shouldn’t there be one that’s parameterised?

And surely, an iterative walk would be simply an option; no need to replace anything very much?

@jonburdo and I have measured our versions and I think there was a 6%-7% increase in speed in each. I measured by walking through the entire cpython repository.

That’s not a very good benchmark because it is the one we have. Besides it makes sense: less function calls almost always means better performance, especially if all other operations are the same. Unless, of course, we have JIT which we don’t :))

1 Like

Less function calls yes. But the stack can be faster than the manual list fiddling.

2 Likes

What do you mean by “manual list fiddling”? What’s the difference?

Manually implementing what is natural recursion means you need to keep a stack as some kind of list structure somewhere.

1 Like

Yes, that’s precisely what we’re doing. But how is that worse than an internal stack that’s used in recursion?

I’m not saying it’s worse. I mean, code simplicity wise it’s worse of course but let’s ignore that. I’m saying it’s different. If you remove the function calls (win!) one now pay something for that by having to manually handle the stack (lose!). So one has two numbers X and Y and the win/loss is X-Y. It matters if X is bigger than Y or not. One can’t just say that less function calls means it’s faster.

Aaaanyway, if you see a few percentage points of a win on one scenario, it would be very good to see if that generalizes, it’s certainly reason for optimism.

2 Likes

Me, too. Doesn’t match what I’ve seen previously. When people replaced built-in recursion by implementing their own iteratively in Python, it got slower.

It isn’t even fewer function calls. Instead of the recursive call, you have the stack.append(...) call.