Asyncio for files

dougransom · August 5, 2023, 4:53pm

I would like to see asyncio for files, using the underlying OS asynchronous IO features (see
Overlapped writing - windows or io_uring for unix.

Blockquote
import aiofile as a
await a.print(f"example")
async with a.open(“myfile”) as f:
await f.readline()…

aoifiles has an api for asynchronous io to files, but the implementation is using synchronous io in a threadpool, and really doesn’t take much advantage of asyncio.

elis.byberi · August 5, 2023, 8:12pm

io_uring is still affected by severe vulnerabilities:

Eclips4 · August 5, 2023, 11:16pm

Previous discussion: Add support for io_uring to cpython · Issue #85443 · python/cpython · GitHub

dougransom · August 5, 2023, 11:17pm

Well win32 has had asynchronous IO (overlapped IO) since about 1992 , that might be the place to start. Using the technique of aiofiles for operating systems lacking asynchronous io.

It really does seem a shame not to leverage asynchronous io on operating systems that have it.

elis.byberi · August 6, 2023, 12:07am

See pywin32 · PyPI, which provides access to many of the Windows APIs from Python.

It appears to support overlapped I/O too: https://github.com/search?q=repo%3Amhammond%2Fpywin32%20FILE_FLAG_OVERLAPPED&type=code

dougransom · August 6, 2023, 1:51am

Yes, but not with asyncio. I think there is merrit in implementing the interface provided by aiofiles, with win32 and linux overlapped io.

e-dreyer · August 6, 2023, 2:59pm

If this is known, is it actively being worked on by the Linux team and community?

e-dreyer · August 6, 2023, 3:08pm

It’s best if programming languages remain OS agnostic and that most functionality is abstracted, except for OS specific functionality. Take the dbm module as example.

Users of Unix-based systems might expect and know of the Unix-dbm functionality and it is known that there isn’t a Windows alternative. I haven’t seen the dbm module used as much in the wild or in modern projects, but it is there for backwards compatibility and Unix systems.

Asyn IO for files is however a functionality to be expected by users of any OS, for obvious reasons. So implementing it only for Windows we would need to implement something for when the code runs on Unix. What do you do? Ignore the expectation of asyn and just provide synchronise behaviour, thus not reacting as expected, or provide and alternative solution to the OS implementation?

For this reason, I think it would be safer to have an additional module, possibly third-party, for Windows async file operations, until the problem is solved on Linux. Then it could safely be implemented in the core Python code.

elis.byberi · August 6, 2023, 11:47pm

It is evolving rapidly, which makes implementing it in the Python stdlib challenging. This rapid evolution also increases the number of 0-day security vulnerabilities.

github.com/python/cpython

Add support for io_uring to cpython

opened 07:54PM - 10 Jul 20 UTC

closed 06:17PM - 19 Apr 23 UTC

2f3b0058-d572-4436-82d8-062775910f62

type-feature topic-asyncio 3.10 topic-IO

BPO | [41271](https://bugs.python.org/issue41271) --- | :--- Nosy | @ronaldousso…ren, @giampaolo, @benjaminp, @dimaqq, @ammaraskar, @cooperlees, @corona10 <sup>*Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.*</sup> <details><summary>Show more details</summary><p> GitHub fields: ```python assignee = None closed_at = <Date 2020-07-18.07:55:00.370> created_at = <Date 2020-07-10.19:54:06.976> labels = ['type-feature', '3.10', 'expert-IO'] title = 'Add support for io_uring to cpython' updated_at = <Date 2021-09-27.05:30:30.039> user = 'https://github.com/cooperlees' ``` bugs.python.org fields: ```python activity = <Date 2021-09-27.05:30:30.039> actor = 'terry.reedy' assignee = 'none' closed = True closed_date = <Date 2020-07-18.07:55:00.370> closer = 'ronaldoussoren' components = ['IO'] creation = <Date 2020-07-10.19:54:06.976> creator = 'cooperlees' dependencies = [] files = [] hgrepos = [] issue_num = 41271 keywords = [] message_count = 8.0 messages = ['373477', '373699', '373703', '373704', '373734', '373868', '373884', '402656'] nosy_count = 7.0 nosy_names = ['ronaldoussoren', 'giampaolo.rodola', 'benjamin.peterson', 'Dima.Tisnek', 'ammar2', 'cooperlees', 'corona10'] pr_nums = [] priority = 'normal' resolution = 'later' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue41271' versions = ['Python 3.10'] ``` </p></details>

The consensus is to implement it as a third-party library on PyPI.

dougransom · August 7, 2023, 3:11am

The reasons I think it shouldn’t be implemented as a separate package on PyPi:

The interfaces should be very similar to those in io, with a few methods replaced with coroutine methods.
The implementations would be very similar to those already within io, so io could be refactored for file reading from an asynchronous source.

dougransom · August 7, 2023, 3:13am

It would be easy enough on linux to offer an asynchronous api on top of synchronous io (like aoifiles). So at least the interface would be ready, we could code away, and if a native linux asynchronous api became usable the implementation could be more efficient.

dougransom · August 7, 2023, 5:54pm

Here are asimple, possible naïve, approaches that might work, with minor changes to the io module:

Add sync methods:
In io:

Add to IOBase these functions or coroutine functions:
async_readable , async_readline , async_readlines , ‘async_tell’, async_writable , and async_writelines.

Add to RaIOBase:
‘async_readino’, ‘async_write’

Add to BufferedIOBase: ‘async_read, async_read1, async_write’

Add to TextIOBase: ‘async_read’, ‘async_readline’, ‘async_write’

In the base classes, the stub for for the blocking methods would be to await for the coroutine method. All the concrete subclasses in the stdlib presumably replace the stub methods anyway.

Any future file like objects that are asynchronous would be able to be used with code expecting synchronous file like objects, provided there is an event loop.
We have an API to develop asynchronous file like objects the would be pretty standard. Now as suggested asynchronous file io can be implemented outside of the standard lib.
A little bit of refactoring in io would allow us to use the existing code for buffering etc. to develop asynchronous file io using features available in the operating system such as built in asynchronous io, or just another thread/thread pool for blocking io (like aiofiles).

would this approach make sense?

elis.byberi · August 7, 2023, 8:01pm

There are many libraries on PyPI to access files asynchronously.

aiouring - Uses liburing.so
linux-aio - Uses libaio
libaio - Uses libaio
aiofiles - Uses Thread pool
aiofile - Uses POSIX AIO

Which one should be implemented in the Python standard library?

dougransom · August 8, 2023, 2:28am

None of them. Let’s just get some standard interfaces in io and some reusable code, so it becomes dead easy to implement asynchronous versions of the various implementations of io, and so we can pass them around as file like objects.

elis.byberi · August 9, 2023, 6:10pm

These coroutines can be implemented within the asyncio module, where both high-level and low-level API designs are already defined.

However, these coroutines should yield better performance numbers than the io module. This will justify their presence in the asyncio package. In other words, these should not be more blocking than the blocking approach.

For example, aiofiles is consistently slower than io on my computer using an SSD drive. See benchmark:

Benchmark

import asyncio
import time
import timeit

import aiofiles


async def aio():
    t = timeit.default_timer()
    async with aiofiles.open('aio.test', mode='wb') as f:
        await f.write(b'0' * 1024 * 1024 * 1024)

    t = timeit.default_timer() - t
    print('aio: ', t)

    t = timeit.default_timer()
    async with aiofiles.open('aio.test', mode='wb') as f:
        for i in range(1024):
            await f.write(b'0' * 1024 * 1024)

    t = timeit.default_timer() - t
    print('aio small chunks: ', t)


def io():
    t = timeit.default_timer()
    with open('io.test', mode='wb') as f:
        f.write(b'0' * 1024 * 1024 * 1024)

    t = timeit.default_timer() - t
    print('io: ', t)

    t = timeit.default_timer()
    with open('io.test', mode='wb') as f:
        for i in range(1024):
            f.write(b'0' * 1024 * 1024)

    t = timeit.default_timer() - t
    print('io small chunks: ', t)


async def run():
    io()

    time.sleep(1.0)

    await aio()


asyncio.run(run())

Given the complexity of these libraries, their maintenance burden is too high. Furthermore, the performance gain from utilizing these operating system’s async IO facilities might not be noticeable in Python, as CPython could become the bottleneck in this scenario.

dougransom · October 15, 2023, 3:05pm

aiofiles is a meaningless performance comparison as it adds a pile of overhead emulating asynchronous io with workers in thread pools.

What i am calling for is an asynchronous interface on every file like object, which if not overridden simply. alls the synchronous one. I can write my code as if asynchronous and if that code receive a file object that actually does asyncio then ther performance will be better.