Add pathlib.Path.walk method

barneygale · January 15, 2022, 1:04pm

Is onerror useful? We don’t have anything like that in pathlib currently.

Ovsyanka · January 16, 2022, 3:21pm

It is only used in walk — even in os. Might be either a strange design decision or a use case unique to walk.

Unlike most os functions, walk is intended to be error-free, which, I think, is why it requires such an argument.

BowlOfRed · January 16, 2022, 6:54pm

The difficulty is that walk does multiple operations. For opening a file, returning data or an error is fine. But for a tree traverse, there can be partial success.

It might be ok to ignore the fact that you can’t enter /proc when looking for your PDF and just continue looking. It might not be ok to ignore errors when you’re trying to correct some permissions in a tree. There’s no way to make a default that works well for both use cases.

Ovsyanka · January 16, 2022, 9:05pm

Okay. So to summarize:

Top down and bottom up modes are not simple inverses of each other.
Onerror mode is quite necessary for the (partial correctness) purposes of walk.
Its return type (arguably) makes sense and is not that easy to simplify
The majority of pathlib functions return Paths, not direntries
Only about 3-5 lines of os.walk will need to be changed if we decide to follow their exact implementation but with paths

So it feels to me that an easy solution is not gonna cut it. We either go with the wrapper (which is bad because it’s slow) or with duplicating most of the os.walk code (which is bad because of code duplication, but pretty fast, and tests can be taken straight from os with minimal changes).

TeamSpen210 · January 16, 2022, 9:24pm

There is a potential third option: modify os.walk itself to add some parameters to do the required changes. It already immediately defers to a private os._walk(), so that could be given some parameters. Downside is that would slow os.walk(), but doing a few boolean/none checks hopefully isn’t significant.

Ovsyanka · January 16, 2022, 11:45pm

Well, we don’t exactly need any new parameters. We could add ‘use_pathlib’ but no one has done that before so it seems bizarre.

Ovsyanka · May 2, 2022, 1:10am

I apologize for disappearing but I’m back. I have tried to recreate os.walk from scratch but with Path objects. Then I ran them all using timeit and got the following results:

import os, os.path, sys, timeit; from pathlib import Path, PosixPath

# p = CPython repository root
p = Path.cwd()
p_str = str(p)

# 19.180382559992722 seconds
print(timeit.timeit("list(os.walk(p_str))", number=1000, globals=globals()))
# 32.1955898909946 seconds
print(timeit.timeit("list(p.recreated_walk())", number=1000, globals=globals()))
# 37.04862585999945 seconds
print(timeit.timeit("list(p.wrapped_walk())", number=1000, globals=globals()))

I understand that @barneygale 's example with simplified implementation was a little faster than os.walk, but I’m afraid it’s because he didn’t fill dirs and files with Path objects and did less error-handling. I profiled all three cases (timed above) and saw that the creation of Path objects is indeed the reason for the slowdown (It’s a little obvious but I tried to approach it scientifically).

As for the idea of adding stuff to os.walk – I completely disagree. To add support for pathlib there, we would need to rewrite multiple parts of os.walk – it’s not just a simple constructor argument. I’m against adding this much complexity to an already overwhelming function, especially considering the possible bugs we could introduce.

Hence I once again do not see much value in rewriting or modifying os.walk. After this much testing and discussion, I see the wrapper solution as the optimal one. ~15% difference doesn’t seem to be big enough to justify introducing this much code duplication.

P.S. if people reading this discussion or the PR disagree with me and voice their opinion, I’d be happy to agree to a recreation of os.walk. os.walk’s implementation is extremely unlikely to have any bugs or to change in general so even if we do follow the recreation route – we don’t risk much. But for now I consider that the safer wrapper option is better.

barneygale · May 2, 2022, 1:40am

Why should dirs and files contain Path objects? In os.walk() they contain names, not paths.

In the case of os.walk() you can generate a full paths like:

for root, dirs, files in os.walk('python/Lib/email'):
    for dir_name in dirs:
        dir_path = os.path.join(root, dir_name)

Under my proposal, the equivalent in pathlib would be:

for root, dirs, files in Path('python/Lib/email').walk():
    for dir_name in dirs:
        dir_path = root / dir_name

barneygale · May 2, 2022, 1:47am

One reason I want to avoid adding a new dependency on os is that it makes introducing an AbstractPath class more difficult. See this PR for an example implementation, and note that we already have importlib.abc.Traversable. Introducing a walk() that’s independent of iterdir() / _scandir() complicates the interface.

Ovsyanka · May 2, 2022, 5:25am

I personally want to convert to paths because every time I use walk, I have to do this conversion :))
Take a look at iterdir – os version will also only return names but pathlib’s version returns Paths. I believe that it makes as much sense here to return Paths.

Essentially if we go the route of yielding names instead of paths, then our wrapper simplifies to:
yield from os.walk(self, *args, **kwargs). And I don’t think pathlib has ever done this kind of thin wrapping (though you probably know better as you regularly commit into pathlib and are more familiar with it). As you said before, pathlib tries to fix rough edges. The discussion above showed us why its arguments are not rough edges but necessities. The return type is also hard to argue. So if we just yield names, we would be improving/simplifying nothing. Pathlib’s version will be no better than os’s version.

As I said, I do not oppose recreating walk without depending on os.walk. But implementing walk without conversion to paths seems weird: why would someone ever use pathlib for that, then? Only if the person has been using paths all along, I guess. Also remember that we need to convert all directories to paths anyway because we will need to make recursive calls into them so all we change is do that conversion earlier and add conversion of files to paths.

barneygale · May 2, 2022, 12:48pm

Take a look at iterdir – os version will also only return names but pathlib’s version returns Paths

I assume this is part of the reason they have different names - iterdir() vs listdir(). Folks would expect listdir() to return names. Or at least, I would

Essentially if we go the route of yielding names instead of paths, then our wrapper simplifies to:
yield from os.walk(self, *args, **kwargs)

Small clarification: root should be yielded as a Path object rather than a str path, right?

Ovsyanka · May 2, 2022, 4:33pm

Your clarification is very on point, yes!

So in order to win an argument with you, I looked through all methods we expose in pathlib and tried to find those that are named the same yet return paths instead of names. And I didn’t find any. Even glob.glob returns paths (though relative which makes them look like names in the non-recursive mode).

So I agree with you now. I also see that you do not want to do a wrapper but a reimplementation, right? If so, I’ll most likely finish everything for the pull request today.

barneygale · May 3, 2022, 6:35pm

Yeah, exactly. I’d love an implementation that uses self._scandir(), because it means users of the future AbstractPath class will get a fully functional walk() implementation just by implementing iterdir() (which will provide a default implementation for AbstractPath._scandir()). Strictly speaking I can re-implement it myself when I add AbstractPath but it will result in some churn.

I’ve thought more about the top_down=False and I’m +1 on supporting that. I think the existing os.walk() implementation might be overcomplicated by trying to support both algorithms in a single function, and I’d recommend you experiment with adding a Path._walk_bottom_up() method which you call conditionally at the beginning of walk(). It may result in cleaner code by removing some repeated ifs/elses. It may also be a little faster. (edit: or it could be neither!)

On onerror: I really don’t like it - it feels like idiomatic javascript rather than python! I wonder if PEP 654 exception groups might offer an alternative here. I suppose the requirement is that we handle OSError gracefully (i.e. don’t walk into the path, but continue to walk into its ancestors/siblings/etc), but still give the user the opportunity to handle exceptions that occurred during the walk.

barneygale · May 3, 2022, 8:42pm

I’m struggling to come up with a viable alternative to onerror that retains all the use cases, specifically the ability to ignore or re-raise exceptions based on a user-defined filter. If we think that use case is compelling I’d be OK with an on_error parameter. I would love for someone to suggest something better though!

Ovsyanka · May 3, 2022, 10:10pm

Then it’s decided. I do not have high hopes about simplifying or speeding it up but I’ll certainly try! I hope you will have time to review my code later.

Ovsyanka · May 8, 2022, 6:03pm

So I split it into walk_up and walk_down, removing topdown argument. I tried to optimize around it but nothing was faster than single-function implementation.

I also fail to understand how to include Path._walk_bottom_up in there for optimization.

Yet I still decided to leave it as two public methods: walk_up and walk_down (naming is crap but I couldn’t come up with anything better. We could do walk and walk_bottom_up, for example). The reason for having it as two methods is because we now have one less argument and each method has only one mode of operation instead of a single method with two modes of operation. I feel like it makes for a less complicated API.

Any suggestions are welcome. I will make a new pull request (the old one got stale and severely outdated)

barneygale · May 8, 2022, 8:02pm

Just to explain what I meant in my earlier post, I was suggesting something like this:

def walk(self, top_down=True, ...):
    if not top_down:
        return self._walk_bottom_up()
    # implementation of top-down here.

def _walk_bottom_up(self):
    # implementation of bottom-up here.

Ovsyanka · May 8, 2022, 8:43pm

Yes, I’ve tried a similar idea. There are a few issues with it:
scandir loop is mostly the same for both modes and is always called at the beginning. It is also required for descent for both modes. I tried to abstract it as a separate method but then error handling gets ugly, slower, and you effectively write more code, not less.

barneygale · May 8, 2022, 8:44pm

Yep fair enough looking forward to the PR!

Ovsyanka · May 8, 2022, 8:45pm

Yeah, hopefully within an hour. 99% of time is spent figuring out how to refactor os.walk and its tests. And in all cases I just give up and copy because the original implementation is pretty great.

Topic		Replies	Views
Using iterative filesystem walk instead of recursive Core Development	42	3716	October 21, 2023
Pathlib and os.path: feature parity and code de-duplication Ideas	6	1142	June 21, 2021
Incrementally move high-level path operations from shutil to pathlib Ideas	21	1969	November 7, 2022
Add pathlib.Path.rmtree Ideas	6	586	May 19, 2022
Request for review: pathlib._PathBase Core Development review-request	3	508	September 11, 2023

Add pathlib.Path.walk method

Related Topics