Traversing directories with asyncio is incomplete

TobiasHT · February 3, 2022, 4:46am

I have a python source directory containing compiled python3.10 along with the rest of the libraries needed to build python. I wrote two functions, one synchronous and the other asynchronous to measure how long it took for both to traverse the source directory and count the number of files therein basing on their file extensions.

The async function was magnitudes of times faster than the sync function, but I realized later that the traversal for the async function was incomplete. It only traversed the first directories and their children but not the children’s children and children’s children’s children and so forth. Here is the code I wrote for the async and sync functions, hopefully anyone here can help me diagnose the issue as I’m looking around as well.

synchronous function

from pathlib import Path

files = {}
def check (folder):
    for file in folder.iterdir():
        if file.is_dir():
            print("checking: ",str(file)) #added this to track the folders being traversed
            check(file)
        else:
            ext = file.suffix
            if ext in files:
                files[ext] += 1
            else:
                files[ext] = 1

check(Path(r"path-to-python-built-directory"))

asynchronous function

import asyncio
from pathlib import Path

files = {}
async def check (folder):
    for file in folder.iterdir():
        if file.is_dir():
            print("checking: ",str(file)) #added this to track folders being traversed
            asyncio.create_task(check(file))
        else:
            ext = file.suffix
            if ext in files:
                files[ext] += 1
            else:
                files[ext] = 1

async def main ():
    await check(Path(r"path-to-python-built-directory"))

asyncio.run(main())

TobiasHT · February 3, 2022, 5:18am

For the results of the speed test however, the synchronous function took in the range of 25s - 35s to traverse the folder for the first time, but continuous running of the function yielded time in the range of 0.6s - 0.7s.

The asynchronous function took 0.01s to traverse the folder for the first time, and subsequent calls to that async function took 0.0s to traverse the folder. This was the time which was shocking me. Later I realized the traversal was incomplete, so these results aren’t true I guess.

encukou · February 3, 2022, 10:33am

You’re creating a Task here, but never wait for its completion. asyncio.run waits for one task, not necessarily for the other started tasks.

See the note in the docs: “Important Save a reference to the result of this function, to avoid a task disappearing mid execution.”

If you want to run in parallel, do:

task = asyncio.create_task(check(file))
# ... some time later
await task

Or put the tasks in a list and wait for all using something like asyncio.gather.

However, you’re not doing any async IO in this code. Don’t be surprised to find that introducing asyncio will just slow things down.

TobiasHT · February 3, 2022, 12:01pm

Thanks @encukou . I did that the first time I tried to compare the two. However, they ran in almost the same time (0.6s - 0.7s). Then I thought perhaps I was recreating a synchronous world in asynchronous code that way. I was actually trying to recreate what @ambv was doing with urls in one of his videos about asyncio where he achieved concurrency with asyncio.create_task. He doesn’t await tasks there but rather does some sort of scheduling with that method.

encukou · February 3, 2022, 12:30pm

I sadly don’t have time to watch the video, sorry.
There are async functions and libraries for fetching content from the network, so that’s where asyncio can help best, and where it’s best to learn it.
For reading/writing files there are libraries like aiofiles, but I don’t know if anyone made async APi for directory traversal (or if it would help much in real-world applications).

You can still use normal sync functions like iterdir with asyncio of course, but asyncio can’t speed that code up (without something like a threadpool).

TobiasHT · February 4, 2022, 9:46pm

So I figured what I was doing wrong. I didn’t use await in the asynchronous function. So I rewrote the async function to be like this

import asyncio
from pathlib import Path

files = {}
async def check (folder):
    for file in folder.iterdir():
        if file.is_dir():
            asyncio.create_task(check(file))
            await asyncio.sleep(0) #forgot to add this
        else:
            ext = file.suffix
            if ext in files:
                files[ext] += 1
            else:
                files[ext] = 1

asyncio.run(check(Path(r"path-to-python-built-directory")))

This works well and traversed the whole folder in approximately 0.2s which is better than the 0.7s for the synchronous function.

Jelle · February 4, 2022, 11:39pm

Are you sure it traversed the whole directory? It doesn’t really make sense for asyncio to be faster here since you’re not actually doing any asynchronous IO (as Petr already pointed out).

TobiasHT · February 5, 2022, 3:54am

Yes it did. I used some print based logging to make sure i record all the directories that were traversed. That’s why I was able to notice earlier that not all directories were traversed.

In my opinion, I consider reading files and directories some sort of IO in itself. I don’t think IO is only for sockets cause at the end of the day, they are also files.

I’m sure the power of asyncio can be extended into more areas and no limiting it into one use case only.
If you check out @ambv 's asyncio series on edgedb’s YouTube channel, you’ll get to understand the whole paradigm a lot better.

Jelle · February 5, 2022, 5:39am

Traversing directories does IO, but not asynchronous IO, because you’re not awaiting anything. You could use something like https://github.com/Tinche/aiofiles, but as I understand it asynchronous file I/O doesn’t usually help performance.

I tried your code on a large directory and confirmed that the asyncio version is buggy. The synchronous version finds 10175 files, the asyncio one only 7089.

TobiasHT · February 5, 2022, 8:13am

Indeed it’s quite buggy, that’s the reason I raised this topic, because traversing directories with coroutines is incomplete. I’m still investigating it myself, why sometimes it completes and the other times it doesn’t.

However I’d like to acknowledge that coroutine based traversal could be much faster than normal recursion. I can’t solidly prove it as yet because of all these inconsistencies, but preliminary comparisons show so.

steven.daprano · February 5, 2022, 10:56am

I too ran a syncronous and asyncronous version over my home directory,
and the async one is missing a lot of files.

Here is my code:

import asyncio
from pathlib import Path

a = []
b = []

def sync_ls(path):
    a.append(str(path))
    for file in path.iterdir():
        if file.is_dir():
            sync_ls(file)
        else:
            a.append(str(file))

async def async_ls(path):
    b.append(str(path))
    for file in path.iterdir():
        if file.is_dir():
            asyncio.create_task(async_ls(file))
            await asyncio.sleep(0)
        else:
            b.append(str(file))

sync_ls(Path('.'))
asyncio.run(async_ls(Path('.')))

The result is that the sync version finds 260879 files and
directories, but the async version only finds 158456.

When I tried them on a small directory with only 28 items, they both
give the same listing.

I have literally no idea how to debug the async version.

AA-Turner · February 5, 2022, 9:29pm

The problem is

asyncio.create_task(check(file))
await asyncio.sleep(0)

This has the same problem Petr mentioned – you’re not actually awaiting the recursion. His post above has the ways I’d solve this.

encukou:

If you want to run in parallel, do:
task = asyncio.create_task(check(file))
# ... some time later
await task
Or put the tasks in a list and wait for all using something like asyncio.gather .

As eveyone’s mentioned though, there’s no real reason to do this.

A

TobiasHT · February 5, 2022, 10:50pm

awaiting the task is not the problem as the recursion still runs when scheduled with tasks. Awaiting the task will actually recreate a synchronous environment no different from the synchronous function and wouldn’t give us the benefits of concurrency in this case.

What we’re investigating is that why does the traversal not complete in some cases and complete in other cases. This behavior might only be showing in directory traversal right now but could probably show up in some “real world application” as well as you guys are emphasizing, where perhaps a web crawler fails to traverse entire url trees or something.

AA-Turner · February 5, 2022, 11:14pm

Modifying Steve’s code, the below traverses fully. In the original version nothing awaits the task, and so the main function can finish first. It is also more than twice as slow.

wouldn’t give us the benefits of concurrency in this case.

There are no benefits of concurrency in this case!

A

import asyncio
from pathlib import Path

a = []
b = []


def sync_ls(path):
    a.append(str(path))
    for file in path.iterdir():
        if file.is_dir():
            sync_ls(file)
        else:
            a.append(str(file))

async def async_ls(path):
    tasks = []
    b.append(str(path))
    for file in path.iterdir():
        if file.is_dir():
            tasks.append(asyncio.create_task(async_ls(file)))
        else:
            b.append(str(file))
    await asyncio.gather(*tasks)


sync_ls(Path('.'))
asyncio.run(async_ls(Path('.')))

steven.daprano · February 6, 2022, 12:37am

Adam: “In the original version nothing awaits the task”

What I don’t understand is why the original version half works.

If the original version gathered the file names from the top directory,

and then never recursed down into the subdirectories at all, that would

make sense to me. As you say, nothing awaits the tasks, so they don’t

run at all.

But what I don’t get is how they partially run. If nothing awaits

them, how do they run a bit but not to completion?

Adam: “There are no benefits of concurrency in this case!”

For me, the main benefit is as a learning exercise.

But surely the benefit is the same as for any other asyncronous

processing – you don’t have to halt all other processing while the task

runs to completion.

On my computer, walking my entire home directory syncronously takes

about 25 seconds, and that’s on a local file system on a SSD. If I was

traversing a huge remote file system, it might take three or five

minutes, during which time my program can do nothing else.

But if I could do it asyncronously, it might take longer, but my program

could continue doing other things while it was happening. Is that

important? Obviously not for this toy file lister, but for a real

application, maybe.

AA-Turner · February 6, 2022, 12:48am

A guess (not having looked at the internals of asyncio recently) is that as create_task immediatley schedules the coroutine for execution, those that were able to start and finish before the main function (asyncio.run(...)) finishes affect the number.

I’m very much a dabbler in asyncio though, not an expert. I see your point on the rest of it, but that’s more an argument to use ProcessPoolExecutor or ThreadPoolExecutor with a sync function. If the FS was fully remote then async for the network might also have benefits, I think.

A

Jelle · February 6, 2022, 5:21am

I believe Adam is right. I noticed that the asyncio version always produces the same (incorrect) number of files. It would be an interesting exercise to figure out exactly what produces that number. My guess is that every sleep(0) call allows all the other tasks that are currently active to also advance to the next sleep. But if they have any work left to do after the top-level coroutine is done, that work is thrown away.

The reason asyncio doesn’t help here is that the actual IO operations aren’t asynchronous: both the sync and the async versions still call .iterdir() and .is_dir(), and those are synchronous calls. asyncio runs in a single thread, so you don’t get to run any other code while waiting for these IO operations to complete.

But the asyncio version does reorder some of the IO calls, and depending on the state of your file system cache (and who knows what else), you may observe that one call is faster than another.

TobiasHT · February 6, 2022, 9:34am

So I tried to investigate Adam’s statement. I kept track of the tasks that were created. I realized that all tasks were created but some got cancelled.

import asyncio
import time
import sys
from pathlib import Path

files = {}
task_set = set()

async def check (dir):
    for file in dir.iterdir():
        if file.is_dir():
            task = asyncio.create_task(check(file))
            task_set.add(task)
            await asyncio.sleep(0)
        else:
            ext = file.suffix
            if ext in files:
                files[ext] += 1
            else:
                files[ext] = 1

async def main ():
    await check(Path(r"C:\Users\user\Desktop\OpenSourceCode\cpython-3.10"))

if __name__ == "__main__":
    asyncio.run(main())
    total_tasks = len(task_set)
    cancelled_tasks = 0
    not_done_tasks = 0
    for t in task_set:
        if t.cancelled():
            cancelled_tasks += 1
        elif not t.done():
            not_done_tasks += 1
    stats = f"""\t    total tasks: {total_tasks} \n
            cancelled tasks: {cancelled_tasks} \n
            not done tasks: {not_done_tasks} \n
            """

    print(stats)

In this case I got
total tasks: 220
cancelled tasks: 44
not done tasks: 0

I don’t know why I investigated the not done tasks, as all cancelled tasks were set to done anyways, but finally have gotten to see the cause of my frustration. Now I just have to find a way of making sure the tasks don’t get cancelled midway.

encukou · February 7, 2022, 10:38am

That’s correct.
The task gets started and allowed to run for a while. But asyncio.run only waits for the “main” task. After the main task finishes, any other tasks that are still running are cancelled.
To ensure a task run to completion, something needs to await it. (Or do something equivalent to await, e.g. asyncio.run for the main task, or use a low-level callback.)

Why is it done this way?
For me, the most important reason is this: If you care about the work a task does, you need to be notified of a possible error that happened in it. The await is the place where such an exception is raised: it’s what allows an unhandled failure in a task to become a failure of the whole process.