Add pathlib.Path.walk method

Ovsyanka · May 8, 2022, 11:10pm

The new PR

Docs and tests will be ready tomorrow, news blip will be ready once we decide on the names.

Ovsyanka · May 8, 2022, 11:34pm

I noticed a problem with walk vs fwalk.
First of all, we forgot about fwalk :D.
Second, the behavior of fwalk is inconsistent with walk. I have only glanced through fwalk but noticed many small differences in how they handle errors. Here’s a short comparison:

We might need to add support for fwalk and refactor fwalk/walk to have the same behavior but obviously in other issues/PRs. Or do I misunderstand the use case of fwalk?

Ovsyanka · May 9, 2022, 10:01am

Another question to all users of os.walk:

How often do you replace a directory from dirnames with a symlink (or remove it entirely) in between os.walk resumptions? walk actually handles such behavior gracefully but we think it makes more sense to remove it for optimization purposes. What do you think?

Together with a few other optimizations, it makes os.walk 3% slower on average than Path.walk (though it will still be faster if we have more directories than non-directories). Path.walk_bottom_up is ~25% slower than os.walk(…, topdown=False) but it’s still miles better than what we had in the beginning.

BowlOfRed · May 9, 2022, 3:58pm

Remove it from dirnames or remove it from the filesystem?

I’d never want to remove it from the filesystem unless I was doing a “bottom-up” walk. I can guess at a use for replacing with a symlink in some sort of backup/mirror process, but it’s not one that I have ever attempted.

I might remove it from dirnames if I wanted to prune it from the walk.

Ovsyanka · May 9, 2022, 8:00pm

I’m talking about removing it from filesystem or replacing it with a symlink, without removing it from dirnames

Ovsyanka · May 9, 2022, 9:01pm

One more question for the users of walk:

If you set follow_symlinks to False, would you still expect symlinks to directories to appear in dirnames? Or would you expect them to appear in filenames?

BowlOfRed · May 9, 2022, 9:18pm

As I’m not going to descend into it like a directory, I would expect it to be in filenames.

Ovsyanka · May 9, 2022, 9:36pm

Does anyone have any suggestions about how we could get an opinion of the community at large?

merwok · May 20, 2022, 12:12pm

Can someone summarize the reason for returning strings for dirnames and filenames rather than Path objects?

Thinking about this in the context of Add `rmtree` & `copy` method to pathlib · Issue #92771 · python/cpython · GitHub (adding rmtree and copy methods to Path, wrapping shutil functions), I think that a general idea of pathlib is that the operations there work in terms of paths. If you want strings, you can still use os.path or shutil functions and pass them a Path object, thanks to their support of the fspath protocol. So if equivalent methods are added to Path and you are using them, isn’t it to keep working with Path objects?

barneygale · May 20, 2022, 3:59pm

You often don’t want full paths, you want names, as in the first example in the os.walk() docs:

    if 'CVS' in dirs:
        dirs.remove('CVS')  # don't visit CVS directories

Parity between the two functions is beautiful and easy to explain:

for root, dirnames, filenames in os.walk('python/Lib/email'):
    for dirname in dirnames:
        dirpath = os.path.join(root, dirname)

for root, dirnames, filenames in Path('python/Lib/email').walk():
    for dirname in dirnames:
        dirpath = root / dirname

… and generating a Path object for everything in filenames and dirnames has a performance cost.

My 2c

Ovsyanka · May 20, 2022, 5:50pm

To add to what Barney said, I also wanted to make them paths instead of names (you can take a look above and see that my original implementation actually used paths).

But paths-only implementation is twice as slow and, as Barney mentioned, quite unnatural in real-life uses

merwok · May 21, 2022, 9:49am

Indeed, better to have names on hand rather than extracting it from a Path.

I do not find this argument compelling. There is no such principle of parity between strings and Paths because they are different classes with different interfaces.

But this is very convincing. Thanks for repeating these points!

In the other pathlib discussion (Add `rmtree` & `copy` method to pathlib · Issue #92771 · python/cpython · GitHub), Brett makes this point:

The reasons for adding Path.walk seem to be:

people don’t want to build full paths → they could use os.walk(some_path, ...) + some_path / name
someone said the new method is simpler and faster than os.walk:
- I suppose it’s not possible to get os.walk faster?
- it feels a bit weird that the method in pathlib would have less capability than the function in os, supposed to be lower-level; thoughts on that?

barneygale · May 21, 2022, 10:17am

people don’t want to build full paths → they could use os.walk(some_path, ...) + some_path / name

This isn’t quite correct. To use os.walk() one would need to do:

path = pathlib.Path('Lib')
for root, dirnames, filenames in os.walk(path):
    root = pathlib.Path(root)
    for dirname in dirnames:
        dirpath = root / dirname

Note that dirname is relative to root, not path, and that we must manually construct a Path object for root.

it feels a bit weird that the method in pathlib would have less capability than the function in os , supposed to be lower-level; thoughts on that?

Could you expand on why you think it has less capability than the function in os? I don’t quite follow, but I’m not fully awake yet

barneygale · May 21, 2022, 10:56am

One point I forgot to mention earlier: we’ve built the new method closely around os.scandir() and avoided a couple of stat() calls that are made in the os.walk() implementation. I suspect the new method is faster than os.walk() even when accounting for the cost of constructing Path objects. Will try to get some numbers in the next few days.

Ovsyanka · May 21, 2022, 12:30pm

We have removed a try-except block and a single is_symlink call which made Path.walk(bottom_up=False) around ~7% faster for some usecases (one of which is Path(“cpython”).walk()). Specifically, it is faster when we have a lot of files and not too many directories (because Path construction is still expensive, and the more directories we have – the more paths we will have to construct).

bottom_up=True is still slower than os.walk.

merwok · May 23, 2022, 3:51pm

It’s something I read here (or on the PR?) by one of the implementers of the new method.

Ovsyanka · May 23, 2022, 4:42pm

Not sure if it was ever a case (too huge of a discussion to re-read it) but I can confidently assert that Path.walk has as many features. Unlike os it, however, doesn’t hold the user’s hand when the user changes a directory to a symlink in-between iterations and Path.walk(follow_symlinks=False) considers symlinks to directories to be files.

These two differences give the biggest impact in terms of performance. And they also make walk’s code incredibly simple.

Off topic: I am astonished that such a simple small addition has created so much discussion. Obviously, I understand why we needed this discussion but I can only imagine what horrors the community went through when adding pattern matching and walrus…

barneygale · April 10, 2023, 5:00pm

Just spotted this reddit post about 3.12 changes, where the addition of pathlib.Path.walk() is celebrated in the most-upvoted comment. Nice one @Ovsyanka!

Topic		Replies	Views
Using iterative filesystem walk instead of recursive Core Development	42	3799	October 21, 2023
Pathlib and os.path: feature parity and code de-duplication Ideas	6	1155	June 21, 2021
Incrementally move high-level path operations from shutil to pathlib Ideas	21	2002	November 7, 2022
Add pathlib.Path.rmtree Ideas	6	593	May 19, 2022
Request for review: pathlib._PathBase Core Development review-request	3	521	September 11, 2023

Add pathlib.Path.walk method

Related Topics