Reworking "Buffered I/O" in CPython

:+1:. I also think fairly independent in implementation from BufferedIO changes I’m proposing. The BufferedIO changes I’m proposing should reduce overhead in existing code without any changes, and with additional projects can get even bigger improvements (ex. batched I/O).

Main reason for including these in proposal is to try and validate the designs, engineering, and investment in this project works well with those projects. I prototyped using a context manager to hold a io_uring ring which served as a “deferred” I/O error collection point + asynchrony. There’s some cool improvements possible with that (ex. dispatch reading lots of small files in parallel with a single interpreter/thread), but is also a lot of new code and primitives that I’m not sure are the right ones. Likely needs to be an experimental third party module first.

My goal is open().write(), open().read(), and print() even in just single-thread code should be faster. For print() today each object passed in is serialized to string then written (usually with a memcpy) into the BufferedIO buffer (ignoring -u and tty vs. non-tty stdout, as well as WindowsConsoleIO for brevity here). Based on my measuring reducing the copies (particularly of 1KB-100KB objects) into the shared buffer should notably increase performance.

Ideally the model also scales well for multi-threaded code but that is a want to me rather than a requirement currently. With free-threading I think it’s likely in “adopting multi-threaded” code that print() (or logging) will occur in multiple distinct OS threads simultaneously and am hoping this work improves scalability of that / helps make all the threads faster. A lot of library or more optimized code I’ve encountered does things like a logger per thread / process + logging aggregator to keep things fast. Hoping to make that less necessary.

One of the questions I’m hoping for help to answer :slightly_smiling_face:. To some extent, if I/O is multi-threaded and unordered can Python rely on the system Kernel to order it? Does the IO stack need to synchronize it (which BufferedIO somewhat does implicitly today)? What are acceptable orderings there?

My current thought is to use tarfile and zipimport / zipapp as validators for this to help measure correctness and ensure performance is maintained or improved. A couple ideas I’ve looked at but don’t have any strong opinions on yet, and am definitely looking for more ideas:

  1. seek working as a deferred / buffered operation (just like a read or write). Get position seek sums buffered operation state to calculate needed result.
  2. Treat position-changing seek as a “flushing” operation (all buffered changes to that point need to be sent to OS).

Note for both of these, if the user seeks / moves around underneath the BufferedIO without explicitly flushing first behavior would change. I think that is a reasonable behavior change to have? (The existing buffered has a lot of edge cases around this itself).

Discussed changing that in another thread, generally agreed it’s unexpected: Change `open().write()` to guarantee all bytes will be written or an exception will be thrown.

My personal thought / idea for resolving is users can manually make a FileIO if they want, and it works exactly as specified and documented; a special-purpose tool. Changing FileIO is possible but not worth the risk. I want to change open(buffering=0) to actually return a Buffered{Reader,Writer} with a buffer size of 0 which would mean “dispatch I/O immediately when get data” (don’t buffer in Python) with the no partial writes without an exception behavior. All open() invocations then guarantee write-all or exception. I think that is a reasonable feature change path. Definitely changes return type and could break some code, but I think worth it for the improved semantics with FileIO still existing for people who require the previous behavior.

I just found open() of read-write non-seekable streams broken · Issue #64273 · python/cpython · GitHub which may complicate that plan some…

The Python I/O code at the underlying call level (_Py_write_impl) doesn’t actually do this today as some kernels have limits how big a write() is allowed to specific device types…

For the open() builtin buffering= just lets you choose between line buffering, fixed-size in-python buffering, or 0 which is a special value that (currently) the “Raw I/O” (FileIO typically) should be returned directly without a buffering wrapper. Note that reading text via open (ex. open("README.rst", "rt")) requires buffering today / disabling it is explicitly disallowed in that case.

(the docs don’t currently match implementation of that… an open issue I really should get back to, but have been hoping to just make it so “open() always does a cleaner behavior”, Update FileIO comments, documentation to match implementation · Issue #129011 · python/cpython · GitHub, open() built in's doc does not say what the buffering default of -1 means · Issue #93600 · python/cpython · GitHub)