Using iterative filesystem walk instead of recursive

jonburdo · December 15, 2022, 3:54pm

Functions that do a recursive filesystem walk such as os.walk, pathlib.Path.walk, and shutil.rmtree hit the recursion limit on deep directory trees. I suggest these functions be changed to be iterative rather than recursive. I’d think this would improve performance as well. It’s pretty easy to hit the limit especially if generating directory trees programmatically, and this prevents walking arbitrary trees without adjusting the recursion limit.

I created a PR for doing this with os.walk, but wanted to get some feedback before going further: gh-89727: Fix os.walk RecursionError on deep trees by jonburdo · Pull Request #99803 · python/cpython · GitHub

Here’s a relevant issue as well: shutil.rmtree and os.walk are implemented using recursion, fail on deep hierarchies · Issue #89727 · python/cpython · GitHub

Ovsyanka · December 15, 2022, 10:48pm

I wanted to do this for a while but never had the courage for some reason. Yet your work has inspired me to try to do this for pathlib.Path.walk! Thanks for your commits!

pitrou · December 16, 2022, 11:12am

In which kind of use case do you actually generate file trees with a depth in the hundreds or thousands?

Ovsyanka · December 16, 2022, 12:41pm

I haven’t seen such a case in my career. However, I also do not see much benefit in using a recursive solution. It is slightly simpler but it is also slightly slower and puts an unnecessary limitation on python’s walk which could become problematic some day.

And the changes proposed in the issue above do not make scaling or working with walk problematic so I think they are pretty much worth the slightly increased complexity.

Update:
Forgot to mention: the proposed changes also let us get rid of an extra Path._walk method and os.walk function.

storchaka · December 16, 2022, 1:10pm

I thought about this idea, but decided that it is easier to increase the recursion limit than to rewrite every recursive filesystem walk in the stdlib. Increasing the recursion limit is more universal solution, and in recent Python releases there is no physical limit on recursion in pure Python code.

I think that we can increase the default recursion limit by order or two. But the code which consumes the C stack should increase the depth by more than 1, to prevent the stack overflow. @markshannon

gpshead · December 16, 2022, 4:03pm

Software bugs, or when extracting less than trustworthy archive inputs.

In either case you need software to clean up the mess manually and may have written your automated cleanup software in Python…

pitrou · December 16, 2022, 4:14pm

That’s a rather good point, thank you. It also goes hand in hand with the ability to correctly handle non-decodable filenames.

guido · December 16, 2022, 4:30pm

I think this is a worthy cause. Maybe the discussion can move to GitHub?

Jelle · December 16, 2022, 4:35pm

There is already a PR active, gh-89727: Fix os.walk RecursionError on deep trees by jonburdo · Pull Request #99803 · python/cpython · GitHub

jonburdo · December 16, 2022, 5:20pm

That’s great to hear! Thanks for the help and feedback

ericvsmith · December 16, 2022, 5:39pm

I don’t have anything to add except that this is an excellent point. I’ve faced similar problems with IMAP folders hundreds of levels deep, so it’s not unbelievable that this could be a problem with regular filesystem directories.

boxed · December 19, 2022, 7:06pm

I am skeptical against the claim that it will be faster to not use recursion. But I hope it’s true!

peetm · December 20, 2022, 5:31pm

“than to rewrite every recursive filesystem walk in the stdlib”

How many are there!? Shouldn’t there be one that’s parameterised?

And surely, an iterative walk would be simply an option; no need to replace anything very much?

Ovsyanka · December 22, 2022, 2:07pm

@jonburdo and I have measured our versions and I think there was a 6%-7% increase in speed in each. I measured by walking through the entire cpython repository.

That’s not a very good benchmark because it is the one we have. Besides it makes sense: less function calls almost always means better performance, especially if all other operations are the same. Unless, of course, we have JIT which we don’t :))

boxed · December 22, 2022, 2:53pm

Less function calls yes. But the stack can be faster than the manual list fiddling.

Ovsyanka · December 22, 2022, 3:30pm

What do you mean by “manual list fiddling”? What’s the difference?

boxed · December 22, 2022, 4:00pm

Manually implementing what is natural recursion means you need to keep a stack as some kind of list structure somewhere.

Ovsyanka · December 22, 2022, 4:05pm

Yes, that’s precisely what we’re doing. But how is that worse than an internal stack that’s used in recursion?

boxed · December 22, 2022, 4:23pm

I’m not saying it’s worse. I mean, code simplicity wise it’s worse of course but let’s ignore that. I’m saying it’s different. If you remove the function calls (win!) one now pay something for that by having to manually handle the stack (lose!). So one has two numbers X and Y and the win/loss is X-Y. It matters if X is bigger than Y or not. One can’t just say that less function calls means it’s faster.

Aaaanyway, if you see a few percentage points of a win on one scenario, it would be very good to see if that generalizes, it’s certainly reason for optimism.

Stefan2 · January 2, 2023, 5:31pm

Me, too. Doesn’t match what I’ve seen previously. When people replaced built-in recursion by implementing their own iteratively in Python, it got slower.

It isn’t even fewer function calls. Instead of the recursive call, you have the stack.append(...) call.