Ever since gh-120754: Speed up open().read() by reducing the number of system calls I have been trying to reduce Buffered I/O overhead. The I/O stack is really effective in a lot of cases as is evident by the minor evolution needed since PEP 3116 in 2007. Over the past decade I/O devices have improved substantially in latency and throughput as well as CPython developing async, free-threading, and multiple interpreters. With those improvements I think there is good reason both from systems and Python to look at reducing overheads in Python I/O. In particular for the “Buffered I/O” layer I’ve been studying:
~15% overhead reading small files. Code in CPython optimizes around this by using FileIO and/or TextIOWrapper directly instead of using open() (ex. .pyc writing). This is also in broader community where open(buffering=0) is somewhat common but can lead to partial writes (FileIO doesn’t retry partial writes; Other open invocations guarantee write-all or exception)
seek system calls when no seeking is used or needed (ex. read-only stdin, write-only stdout, read/write a single buffer, etc)
Locking and critical sections: buffered uses a single shared buffer and tracks position inside the file absolutely (every call must touch it currently).
Reducing copies of buffer data.
I think by reworking the “Buffered I/O” layer (BufferedReader, BufferedWriter, BufferedRandom, …) can fairly significantly improve performance and setup for longer term I/O performance projects (ex. utilizing newer system I/O APIs, batching operations, …). I’ve brought this up offline with a couple groups as well as prototyping locally. At a high level, I think there are significant improvements available:
Remove the absolute position tracking (and contention coming from it)
Operate on a list of buffers as core primitive
a. When possible, just refer to user-provided buffer rather than copying (ex. bytes)
b. Use writev to write multiple buffers simultaneously
Reduce/remove critical sections and locking in at least read-only and write-only cases (stdin, stdout, stderr)
I have the next several months free and am hoping to focus directly on this project. Definitely need core developer help in figuring out the right tradeoffs (ex. handling seek, flush, buffer protocol / bytes-like, …), drafting a PEP for actually shipping, finding the right benchmarks to measure performance, right tools to validate behavior, etc.
I am attending PyConUS and hope to discuss in person there as well as online here.
I want to take a moment to thank those who have helped review my Issues, PRs, and ideas so far. I’ve found engaging with the Python development really rewarding, and hope I can help contributing back to a language I’ve had the fortune to use for over 15 years.
I think a good start would be to expose those capabilities at the raw IO level, e.g. FileIO. It is also a less risky endeavour than introducing additional complexity in the buffered IO layer.
You seem to be hinting that it’s important to provide multi-threaded writes of disjoint buffers. Is that a common use case?
Your 3 examples here are non-seekable streams that are not capable of explicitly-positioned IO. How do you plan to remove locking while keeping correct behavior with them?
Having a leaner and faster I/O subsystem would be great! Let’s just pay extra attention to not sacrifice simplicity or safety because of speed. For example, the current situation that you mention that passing buffering=0 to open may lead to partial writes; that is surprising and probably unexpected.
I’m not saying that we should not allow particular ways of using the I/O subsystem, like “let me do this, I don’t want buffering because whatever, I’ll take care of some trickyness because of that”, but high level interface should be always simple and correct.
You mean in the case where the write length is bigger then the kernel will accept?
Usually when buffering=0 is used the user want each write to be written atomically (as defined by the kernel) and it does not get split.
. I also think fairly independent in implementation from BufferedIO changes I’m proposing. The BufferedIO changes I’m proposing should reduce overhead in existing code without any changes, and with additional projects can get even bigger improvements (ex. batched I/O).
Main reason for including these in proposal is to try and validate the designs, engineering, and investment in this project works well with those projects. I prototyped using a context manager to hold a io_uring ring which served as a “deferred” I/O error collection point + asynchrony. There’s some cool improvements possible with that (ex. dispatch reading lots of small files in parallel with a single interpreter/thread), but is also a lot of new code and primitives that I’m not sure are the right ones. Likely needs to be an experimental third party module first.
My goal is open().write(), open().read(), and print() even in just single-thread code should be faster. For print() today each object passed in is serialized to string then written (usually with a memcpy) into the BufferedIO buffer (ignoring -u and tty vs. non-tty stdout, as well as WindowsConsoleIO for brevity here). Based on my measuring reducing the copies (particularly of 1KB-100KB objects) into the shared buffer should notably increase performance.
Ideally the model also scales well for multi-threaded code but that is a want to me rather than a requirement currently. With free-threading I think it’s likely in “adopting multi-threaded” code that print() (or logging) will occur in multiple distinct OS threads simultaneously and am hoping this work improves scalability of that / helps make all the threads faster. A lot of library or more optimized code I’ve encountered does things like a logger per thread / process + logging aggregator to keep things fast. Hoping to make that less necessary.
One of the questions I’m hoping for help to answer . To some extent, if I/O is multi-threaded and unordered can Python rely on the system Kernel to order it? Does the IO stack need to synchronize it (which BufferedIO somewhat does implicitly today)? What are acceptable orderings there?
My current thought is to use tarfile and zipimport / zipapp as validators for this to help measure correctness and ensure performance is maintained or improved. A couple ideas I’ve looked at but don’t have any strong opinions on yet, and am definitely looking for more ideas:
seek working as a deferred / buffered operation (just like a read or write). Get position seek sums buffered operation state to calculate needed result.
Treat position-changing seek as a “flushing” operation (all buffered changes to that point need to be sent to OS).
Note for both of these, if the user seeks / moves around underneath the BufferedIO without explicitly flushing first behavior would change. I think that is a reasonable behavior change to have? (The existing buffered has a lot of edge cases around this itself).
My personal thought / idea for resolving is users can manually make a FileIO if they want, and it works exactly as specified and documented; a special-purpose tool. Changing FileIO is possible but not worth the risk. I want to change open(buffering=0) to actually return a Buffered{Reader,Writer} with a buffer size of 0 which would mean “dispatch I/O immediately when get data” (don’t buffer in Python) with the no partial writes without an exception behavior. All open() invocations then guarantee write-all or exception. I think that is a reasonable feature change path. Definitely changes return type and could break some code, but I think worth it for the improved semantics with FileIO still existing for people who require the previous behavior.
The Python I/O code at the underlying call level (_Py_write_impl) doesn’t actually do this today as some kernels have limits how big a write() is allowed to specific device types…
For the open() builtin buffering= just lets you choose between line buffering, fixed-size in-python buffering, or 0 which is a special value that (currently) the “Raw I/O” (FileIO typically) should be returned directly without a buffering wrapper. Note that reading text via open (ex. open("README.rst", "rt")) requires buffering today / disabling it is explicitly disallowed in that case.
I’m not sure what the question means, but ultimately Python file objects are an abstraction over kernel-provided IO facilities. So it makes sense that what Python can do is bound to kernel-exposed semantics.
The problem here is more to ensure that Python file objects don’t have race conditions or inconsistencies of their own.
Buffered IO merely synchronizes its own state (which strikes me as necessary, unless you come up with an extremely clever lockless implementation). It might also synchronize raw IO as a side effect, but that’s not part of the semantics.
What’s dictated by the current semantics, though, is the reliance of all reads and writes on an implicit “file position” that’s also updated by those reads and writes, and that has to be carefully handled when several threads issue IO operations on the same file object. It might be a good idea to add “read at” and “write at” primitives that ignore the internal file position.
(beware that ReadFile on Windows is annoying in that regard: even if you give it an absolute position to read from, it will still update the file’s current position… is there a modern alternative to that @steve.dower ?)
At what points buffers are flushed is indeed an implementation detail, so you can freely change it (except when flush is called, of course).
Can you make a separate discussion for this? That’s unrelated to the rest of the topic.
I think you misread it? _Py_write_impl does not retry on partial writes. At most a single successfulwrite call is issued.
Kind of. Using overlapped IO (the OVERLAPPED structure) with a synchronous[1] file handle lets you set the position to start reading from (see this section of the ReadFile docs), though it still updates the file pointer afterwards. But if every reader is doing it this way, then none of them will care about the initial file pointer position.
This seems like a proposal that can pretty easily be put together as a standalone library initially. The current _io module isn’t critically tied into the core runtime[2], and its type hierarchy could be replicated as much or as little as someone liked. Most of the IO hierarchy is transparent to callers, so from modernio import open is likely enough to upgrade any existing code, and I think we’d be incredibly open to merging it into the core runtime after it’s proven itself (provided it’s designed for being merged in - so pure C, etc.).
I agree that people should use either the explicitly-positioned APIs or the implicitly-positioned APIs, but not both on the same file object (and arguably, any multi-threaded reader should only use explicitly-positioned APIs).
If we introduce explicitly-positioned APIs, we’ll have to document that, and also mention that the behavior when mixing both API styles is system-dependent (because POSIX pread doesn’t update the file position).
I mean more generically (and totally leave you to your good understanding): the general / external /simple interface should be as safer and with less surprises as possible, and the very particular cases where the user may seek for special behaviours should be exposed in ways that are obvious that “stuff may not behave as we always expected”.
Absolutely, though the IO stack could manage its own current position and still make it transparent for regular users. Users who want to directly mess with file descriptors are already in trouble on Windows (it’s emulated… barely), and those who get as far as a native HANDLE can cope with caveats.
Years ago I dreamed of replacing the IO stack on Windows with a properly native one, rather than trying to wrap up the C runtime’s emulation of POSIX semantics. fileno() is the only real problem, but I daresay it could be made to be slow/limited emulation when called and not before (while the CRT can be made to “open” arbitrary HANDLEs, it makes all sorts of assumptions that make it generally not a good idea).
Currently BufferedIO has a lock and for every write (and most reads) acquires it exclusively with ENTER_BUFFERED; that means BufferedIO forces an order to the I/O calls with that lock.
FileIO much more closely exposes the underlying POSIX semantics just adapting from Python objects to system calls and doesn’t do any locking; rather multiple simultaneous or independent read() or write() calls are handled by the underlying system/kernel as to what happens (ex. ordering writes from multiple threads).
As a specific example, I tend to think even in -u / PYTHONUNBUFFERED mode developers want a single print() to not be interleaved with any other output even in the presence of threads with multiple print() happening simultaneously, but that is not what the implementation actually guarantees today. With unbuffered mode, each individual write happens immediately. This results in strace python -u -c "print('test', '1')" outputting 4 system calls (on Linux each tends to be fairly atomic / non-interleaved):
There I think ideal for user would be to actually have one “atomic” per POSIX standard call (ex. writev) that is ordered before/after other threads through the newline (line buffering). Some code works around this by pre-building an individual buffer and writing once. I’d like to get tools to be able to address that more precisely. It comes up in “line buffered” mode vs. fixed-size buffered mode as well (is python writing to a tty for stdout or not; CPython run as a subprocess vs at terminal will result in different set of system calls today). I think people want the whole line to not be interleaved across threads. I think it’s part of the reason stdout and stderr have quite so many “modes of operation” today (see: sys.stdin docs).
Python’s current “Buffered I/O” layer doesn’t expose primitives for that. Moving to a list of buffers I think can add primitives to help express that at least for in-cpython cases; if those are found to be nicely simple/general can expose them more broadly. Make existing cases faster, resolve issues, and in time enable new features and optimizations.
This is some of the design tradeoffs that I’m hoping to work on with developers to figure out the right options / directions. There are a number of open issues especially around writing / printing during interpreter/thread shutdown as well as “read + write” on buffered I/O objects that I think with some thought can close the issues or at least significantly improve behavior.
Note I’m explicitly not looking to expose “write at” or “read at” here, that would definitely be new additional features. I think that’s interesting (particularly pwritev with RWF_ATOMIC), but a lot differently scoped project. I want to focus on making existing code and cases faster, lower overhead. In particular, the ~15% measured overhead in BufferedIO writing a single pre-composed buffer to a single file. Similar overheads reading files in fixed size chunks (readall bypasses that somewhat, but still can have moderate buffered overhead).
If I remove the BufferedIO lock, that may change behavior. I think we can keep compatible behavior with a atomically appended to list of operations / buffers / byte objects and when the aggregate buffer size passes a threshold (ex. DEFAULT_BUFFER_SIZE) do a single writev which I think will mirror current behavior sufficiently. That is really hard to measure without testing against lots of cases. Fortunately, there’s lots of usage in CPython community and the test suite to validate all the edge cases :).
In particular, the Buffered locks around “absolute position tracking” and having a single fixed-size in-memory buffer in my studying of the APIs aren’t required to meet the requirements of BufferedReader/BufferedWriter/BufferedRandom today. Definitely how the current implementation works, but I think both the special common cases (ex. write-only and read-only) as well as regular cases (ex. write this bytes/buffer to a regular file / pathlib.Path.write_text) and complex cases (ex. lots of seeking with small reads/writes ala tarfile or zipapp) can be implemented with less overhead without those two pieces of state explicitly tracked using a list of operations/buffers and calculating position only when needed / requested.
I’d like to work with CPython developers to prove that theory out with code. Definitely if it doesn’t prove true, then will be lessons learned and will delete the experimental work / keep the current code as is. I hope with the gh-120754 / FileIO changes helped demonstrate building piece by piece to something definitely better/faster/simpler and listening to feedback both from people and tools / measurement to improve existing code without breaking critical guarantees.
It’s not implementable without reworking buffered to allow a size-zero buffer which is part of what I want to add as part of this work. This work is a prerequisite to me . Agreed out of scope for initial work though, linked to the existing discussion which didn’t seem to get enough traction to actually motivate changing it independently.
Agreed it doesn’t retry partial writes. It’s more that it may itself split any write / buffer passed to which means code using it doesn’t have a guarantee the buffer will be written “atomically” in any form, esp. across platform. On all platforms no write will ever be passed through that goes over _PY_WRITE_MAX. On some platforms and for some “Raw I/O” devices that split size is smaller.
I looked at this but because Buffered I/O lives in the middle of the I/O stack it is particularly difficult to do. The python test suite works to do this with _pyio and _io across the existing io tests but often what ends up being tested isn’t what is intended (ex. test cases that call open directly instead of self.open which defers to the two implementations).
I definitely think a compile time or ideally run-time way to swap between the two implementations would make validating new vs. old a lot easier. _io is used in early startup and the behavior / performance there, particularly within importlib and zipimport matters a lot to me to validate and not regress. I don’t know a way to test/validate that without being in the cpython repository.
The order is undefined as far as users are concerned, so that doesn’t matter. The goal of the lock is not to force a particular (undefined) order, but to ensure that the internal state of the Buffered object remains consistent.
I think so, but I’m not sure that’s relevant to the “reworking buffered I/O” discussion. Up until now, we were implicitly talking about binary I/O, not text I/O.
Well, I’m not sure which kind of multi-threaded IO you’re hoping to do without such primitives? Unless you’re talking about very simplistic file formats, you want to know at which file position your IO calls will operate, and that’s not possible if you issue implicitly-positioned IO from multiple threads.
I don’t think it’s worth discussing such details until there’s an implementation that shows it to be technically possible (and reasonably debuggable/maintainable as well).
Such a guarantee is impossible to make due to the semantics of the corresponding system calls, so what is the problem exactly? The only guarantee you get is that the partial write that is actually done is atomic.
Agreed it is behavior users should not be relying on. Unfortunately when a system maintainer upgrades the system python and the behavior of a tool they rely upon changes they often report it against CPython. To me acknowledging and planning for that is important to try and make as smooth an upgrade path as possible for community, developers investigating, and triagers handling reports.
I can build the an equivalent demo with sys.stdin.buffer / the BufferedWriter avoiding the TextIOWrapper, just not as short of a one-liner . From my reading of _io_TextIOWrapper_write_impl and perf profiles I’ve looked at of writing text to files there isn’t any locking in that layer, the locking per write in user space is entirely in BufferedWriter at the moment. With Python in UTF8 mode the Text I/O work is a really small part of the profile which is really nice.
The big one I think users will run into with free-threading is multi-core scalability of writing to sys.stdout. A single print() in a thread could have a significant performance impact. Currently all threads printing to sys.stdout are serialized in user space by BufferedWriter’s lock. From my understanding and experience POSIX compatible kernels handle multi-threaded contention writing to a single file descriptor and often do so with significant optimizations. I’d like to get Python closer to the speed of the underlying kernel here; reduce the Python BufferedWriter overhead. To reiterate though: my main focus is improving single thread cases, multi-thread gains are a nice to have.
As a concrete example .pyc file reading in a single thread currently does this set of system calls in main:
There Python, in the io.open_code hook, opened the .pyc file for binary buffered reading. That created a BufferedReader which calls seek to get the absolute position, allocates a internal buffer, and creates a lock. BufferedReader.read() is then called without a size which defers to _bufferedreader_read_all and in this case, since the BufferedReader buffer is empty, defers entirely to FileIO.readall. FileIO.readall uses the fstat result from its initial open to allocate a file size + 1 byte bytes object, uses one read system call to fill it, and a second to validate it’s actually at the end. readall guarantees handling of cases where the file size changes between open() and the readall call so always goes until a non-zero length read system call returns size 0.
In that particular case the BufferedReader internal buffer, position tracking, and locking are all unnecessary / overhead introduced by “Buffered I/O”. They happen because of how BufferedReader is structured today. By reworking “Buffered I/O” I want to reduce or preferably eliminate that excess work.
In my measuring and experiments many cases can be improved by reworking Buffered I/O. That includes ones with many small reads, a mix of read + seek, a single big write, many small writes, and a mix of write + seek. The overheads of the current structures were small compared to older kernel and I/O device speeds/overheads. Currently they’re a significant portion of the runtime.
Chicken and egg a bit here. I’m trying to build consensus that it’s worth core developer time and effort to review if written, what features are important to people far more experienced and familiar with CPython than me, find mentor(s), and get buy in for the work it would take to ship such changes in a release. If the only thing needed to make progress is prototype code happy to sit down and write that.
Prototype code is certainly progress. And there’s a set of people who will be convinced by working code (and another by the performance of that code), as well as people who will be convinced by a clear architecture, nicer type hierarchy, or other less concrete aspects of the design. Doing a bit of all of them is the best way to build a consensus, as they’ll all convince someone, and that gets more people convinced.
That said, working, production-ready code is the hardest to argue against. If you can build a library that code can opt into and it’s good enough to run benchmarks on, you’ve already got the best chance of it being taken seriously by the most people. There are a few people around (I won’t name them - they’re reading already I’m sure) who might take something standalone and integrate it into massive apps at huge companies and report back. That sort of thing is convincing in a way that logical arguments without code are not.
The key point is not to benchmark and criticise what’s already there, but to create something that could be there and benchmark that. If it has to be done in place, then you’ll be building it in a fork of CPython for the time being. If you can do it as a library, it’s much much much easier for people to test and get involved in. It’s incredibly rare that any significant contribution will get merged without changes anyway, so the process of integrating a library is going to be about the same as integrating work that’s been done in-tree.[1]