Asyncio for files

I would like to see asyncio for files, using the underlying OS asynchronous IO features (see
Overlapped writing - windows or io_uring for unix.

Blockquote
import aiofile as a
await a.print(f"example")
async with a.open(“myfile”) as f:
await f.readline()…

aoifiles has an api for asynchronous io to files, but the implementation is using synchronous io in a threadpool, and really doesn’t take much advantage of asyncio.

4 Likes

io_uring is still affected by severe vulnerabilities:

2 Likes

Previous discussion: Add support for io_uring to cpython · Issue #85443 · python/cpython · GitHub

1 Like

Well win32 has had asynchronous IO (overlapped IO) since about 1992 , that might be the place to start. Using the technique of aiofiles for operating systems lacking asynchronous io.

It really does seem a shame not to leverage asynchronous io on operating systems that have it.

2 Likes

See pywin32 · PyPI, which provides access to many of the Windows APIs from Python.

It appears to support overlapped I/O too: https://github.com/search?q=repo%3Amhammond%2Fpywin32%20FILE_FLAG_OVERLAPPED&type=code

1 Like

Yes, but not with asyncio. I think there is merrit in implementing the interface provided by aiofiles, with win32 and linux overlapped io.

If this is known, is it actively being worked on by the Linux team and community?

It’s best if programming languages remain OS agnostic and that most functionality is abstracted, except for OS specific functionality. Take the dbm module as example.

Users of Unix-based systems might expect and know of the Unix-dbm functionality and it is known that there isn’t a Windows alternative. I haven’t seen the dbm module used as much in the wild or in modern projects, but it is there for backwards compatibility and Unix systems.

Asyn IO for files is however a functionality to be expected by users of any OS, for obvious reasons. So implementing it only for Windows we would need to implement something for when the code runs on Unix. What do you do? Ignore the expectation of asyn and just provide synchronise behaviour, thus not reacting as expected, or provide and alternative solution to the OS implementation?

For this reason, I think it would be safer to have an additional module, possibly third-party, for Windows async file operations, until the problem is solved on Linux. Then it could safely be implemented in the core Python code.

2 Likes

It is evolving rapidly, which makes implementing it in the Python stdlib challenging. This rapid evolution also increases the number of 0-day security vulnerabilities.

The consensus is to implement it as a third-party library on PyPI.

1 Like

The reasons I think it shouldn’t be implemented as a separate package on PyPi:

  • The interfaces should be very similar to those in io, with a few methods replaced with coroutine methods.
  • The implementations would be very similar to those already within io, so io could be refactored for file reading from an asynchronous source.

It would be easy enough on linux to offer an asynchronous api on top of synchronous io (like aoifiles). So at least the interface would be ready, we could code away, and if a native linux asynchronous api became usable the implementation could be more efficient.

Here are asimple, possible naïve, approaches that might work, with minor changes to the io module:

Add sync methods:
In io:

Add to IOBase these functions or coroutine functions:
async_readable , async_readline , async_readlines , ‘async_tell’, async_writable , and async_writelines.

Add to RaIOBase:
‘async_readino’, ‘async_write’

Add to BufferedIOBase: ‘async_read, async_read1, async_write’

Add to TextIOBase: ‘async_read’, ‘async_readline’, ‘async_write’

In the base classes, the stub for for the blocking methods would be to await for the coroutine method. All the concrete subclasses in the stdlib presumably replace the stub methods anyway.

  1. Any future file like objects that are asynchronous would be able to be used with code expecting synchronous file like objects, provided there is an event loop.
  2. We have an API to develop asynchronous file like objects the would be pretty standard. Now as suggested asynchronous file io can be implemented outside of the standard lib.
  3. A little bit of refactoring in io would allow us to use the existing code for buffering etc. to develop asynchronous file io using features available in the operating system such as built in asynchronous io, or just another thread/thread pool for blocking io (like aiofiles).

would this approach make sense?

There are many libraries on PyPI to access files asynchronously.

Which one should be implemented in the Python standard library?

None of them. Let’s just get some standard interfaces in io and some reusable code, so it becomes dead easy to implement asynchronous versions of the various implementations of io, and so we can pass them around as file like objects.

These coroutines can be implemented within the asyncio module, where both high-level and low-level API designs are already defined.

However, these coroutines should yield better performance numbers than the io module. This will justify their presence in the asyncio package. In other words, these should not be more blocking than the blocking approach.

For example, aiofiles is consistently slower than io on my computer using an SSD drive. See benchmark:

Benchmark
import asyncio
import time
import timeit

import aiofiles


async def aio():
    t = timeit.default_timer()
    async with aiofiles.open('aio.test', mode='wb') as f:
        await f.write(b'0' * 1024 * 1024 * 1024)

    t = timeit.default_timer() - t
    print('aio: ', t)

    t = timeit.default_timer()
    async with aiofiles.open('aio.test', mode='wb') as f:
        for i in range(1024):
            await f.write(b'0' * 1024 * 1024)

    t = timeit.default_timer() - t
    print('aio small chunks: ', t)


def io():
    t = timeit.default_timer()
    with open('io.test', mode='wb') as f:
        f.write(b'0' * 1024 * 1024 * 1024)

    t = timeit.default_timer() - t
    print('io: ', t)

    t = timeit.default_timer()
    with open('io.test', mode='wb') as f:
        for i in range(1024):
            f.write(b'0' * 1024 * 1024)

    t = timeit.default_timer() - t
    print('io small chunks: ', t)


async def run():
    io()

    time.sleep(1.0)

    await aio()


asyncio.run(run())

Given the complexity of these libraries, their maintenance burden is too high. Furthermore, the performance gain from utilizing these operating system’s async IO facilities might not be noticeable in Python, as CPython could become the bottleneck in this scenario.

aiofiles is a meaningless performance comparison as it adds a pile of overhead emulating asynchronous io with workers in thread pools.

What i am calling for is an asynchronous interface on every file like object, which if not overridden simply. alls the synchronous one. I can write my code as if asynchronous and if that code receive a file object that actually does asyncio then ther performance will be better.