Get DirEntry objects collected during os.walk()

BowlOfRed · April 8, 2021, 5:39pm

os.walk() uses os.scandir() to capture stat information on the objects during the walk. Is this information exposed to the caller? I will often want to walk the tree looking for information that was captured by the stat. I don’t want to run a second stat on every file, but the return from the walk is just strs, not DirEntry objects.

Maroloccio · April 8, 2021, 8:45pm

os.walk() uses os.scandir() to capture stat information on the objects during
the walk. Is this information exposed to the caller? I will often want to
walk the tree looking for information that was captured by the stat. I don’t
want to run a second stat on every file, but the return from the walk is just
strs, not DirEntry objects.

I assume you already read these observations from the PEP?

BowlOfRed · April 8, 2021, 11:58pm

I hadn’t read that PEP, just the current documentation on os.walk. But I’m not sure what I’m supposed to understand from it. The section linked seems to be something proposed for os.scandir(). The return values from os.scandir() are fine. What I was hoping for is that os.walk() not only consumes that information but also could expose it to the caller. Unfortunately, the return value of strings was set long ago, so I’m not sure it could be done easily.

I believe if I want to write something like the “get_tree_size()” from that PEP, I can either use os.scandir() directly, but handle the recursion/treewalk myself, or I can use os.walk() for the recursion, but will have to repeat the os.stat() call to gather the size data.

Maroloccio · April 9, 2021, 3:39pm

I meant this part:

However, the stat_result is only partially filled on POSIX-based systems (most fields set to None and other quirks), so they’re not really stat_result objects at all, and this would have to be thoroughly documented as different from os.stat().

It might be that those those stat results aren’t usable in their fetched form?

BowlOfRed · April 9, 2021, 4:39pm

I was reading that portion as saying that not 100% of the stat object was mapped. But as I want similar information to what is already advanced by os.scandir(), I don’t see a problem at that level. It’s just that the calls within os.walk() can’t be reused.

Just to make sure there was no magic going on under the covers, I tried a couple of tests and confirmed that the os.walk() version took almost twice as long for a large tree, presumably due to the repeated stat call.

os.walk() version:

def path_total_size(root):
    du_total = 0
    for dirpath, dirnames, filenames in os.walk(root):
        for file in filenames:
            du_total += Path(dirpath,file).stat().st_blocks
    return du_total

os.scandir() version

def scan_total_size(root):
    du_total = 0
    for dirent in os.scandir(root):
        if dirent.is_dir():
            du_total += scan_total_size(dirent.path)
        else:
            du_total += dirent.stat().st_blocks
    return du_total