Change `open().write()` to guarantee all bytes will be written or an exception will be thrown

cmaloney · November 13, 2024, 7:26pm

I’ve been looking a bit at what open() does under different circumstances, and found what to me is surprisingly different contracts for behavior:

with open("testfile.txt", mode='wb') as f:
    print(f"{type(f) = }")
    count = f.write(b"test")

with open("testfile.txt", mode='rb') as f:
    print(f"{type(f) = }")
    buffer = f.read()

assert len(buffer) == count, "Should have read and written everything"

with open("testfile.txt", mode='wb', buffering=0) as f:
    print(f"{type(f) = }")
    count = f.write(b"test")

with open("testfile.txt", mode='rb', buffering=0) as f:
    print(f"{type(f) = }")
    buffer = f.read()

assert len(buffer) == count, \
        "Usually reads and writes everything, but not guaranteed"

This differs

.write() both cases in the CPython codebase and ones I’ve looked at in open source code expect a write() to either write everything or throw an exception. CPython code and quite a few projects use open(buffering=0) or _io.FileIO directly with a single call and seem to expect that behavior[1]. For read() this is relatively safe because read-all or fixed size does fullfill that contract, and most the time write() does as well. The partial write only occurs under some cases, and CPython already retries a number. It comes up when there’s system limits (ex. getrlimit on Linux[2]) that the user encounters.
.read() in the docs doesn’t talk about PEP-475 retries. In code comments the readall case isn’t noted currently.

This difference in behavior is what was originally specified by PEP-3116, “Raw I/O” isn’t supposed to make more than one system call generally while BufferedIO will loop. With PEP-475 a lot of cases were updated to try for better standard behavior.

I’m curious if it would be better to leave the behavior as is or if there are ways we could improve the standard I/O patterns.

I’ve been pondering at a couple possible tweaks:

Update read() code comments and docs that it retries per PEP-475 and include the readall behavior (read() without a size retries until it finds EOF as indicated by a read() returning size 0).
Add a documentation warning around FileIO, and its write method that users must check the return size to guarantee everything has been written.
Make BufferedIO able to be used without a zero-sized buffer (pass through) efficiently. Update so open(buffering=0) always uses BufferedIO so code gets the “read and write are batteries included” behavior they expect when using open().
Potentially updating write() in FileIO to retry partial writes. That will likely make code that relies on non-blocking I/O slower (goes from partial write returning size to an exception being thrown)

Curious other thoughts on how to possibly make this safer for what seems to be commonly used pattern which contains a subtle bug in the Python ecosystem.

a. Code Search buffering=0 language:Python
b. gh-126606
write(2) - Linux manual page

Note that a successful write() may transfer fewer than count
bytes. Such partial writes can occur for various reasons; for
example, because there was insufficient space on the disk device
to write all of the requested bytes, or because a blocked write()
to a socket, pipe, or similar was interrupted by a signal handler
after it had transferred some, but before it had transferred all
of the requested bytes. In the event of a partial write, the
caller can make another write() call to transfer the remaining
bytes. The subsequent call will either transfer further bytes or
may result in an error (e.g., if the disk is now full).

elis.byberi · November 13, 2024, 8:46pm

That is already documented.

cmaloney · November 13, 2024, 9:12pm

My thought is adding a warning / more attention box that size must be checked (and possibly a linter / code style rule). CPython importlib had it off, and quite a few packages like numpy and tensorflow contain cases of it. It’s a usage pattern that seems to be common but has bugs (be nicer to me to change the behavior, but that has a lot of things it could break). gevent made a “WriteISWriteallMixin” to change the behavior. Numpy allows io.FileIO but doesn’t check the write size (numpy/numpy/lib/format.py at main · numpy/numpy · GitHub). So it feels like the information isn’t getting to end users or was lost somewhere along the way and needs its visibiltity raised.

elis.byberi · November 13, 2024, 9:22pm

Adding a hint in the open() documentation would make it more visible, as the io module directs users straight to open().

The easiest way to create a text stream is with open(), optionally specifying an encoding:

f = open(“myfile.txt”, “r”, encoding=“utf-8”)

Not that users are likely to read the open() documentation either, but that hint would still be helpful for seasoned users.

xitop · November 14, 2024, 1:26pm

I’d like to add a brief note, that there is a difference between writing bytes (where partial writes are possible) and writing text strings.

ericvsmith · November 14, 2024, 3:25pm

Are you saying partial writes aren’t possible if a file is opened in text mode? That would be great. Is it documented somewhere? I’ll admit that all of the io abstractions make it hard to find things like this in the docs.

xitop · November 14, 2024, 4:46pm

This is how I understand the io.TextIOBase.write:

write(s, /)

Write the string s to the stream and return the number of characters written.

and it makes sense, because - contrary to writing bytes in chunks - a partial write, e.g. one byte of a 3-byte UTF-8 char would make it very hard to write the next chunk.

elis.byberi · November 14, 2024, 5:16pm

I was referring to open() because its documentation covers the buffering parameter. A link to RawIOBase.write() would be useful there to explain unbuffered I/O.

Unbuffered Text I/O is not supported:

open('/dev/random', buffering=0)
# ValueError: can't have unbuffered text I/O

cmaloney · November 14, 2024, 5:36pm

Yes, TextIO is largely safe / all text ends up written or an exception is raised (and I think where the user expectation of that comes from).

The BufferedIOWriter code does retry partial writes (so will get to all data written or an exception). And TextIO (generally) requires BufferedIO generally / gets it by default. TextIOWrapper is used directly around a FileIO (which some code does) contains a write loop for its various write paths as well so should largely be safe (although that could likely use a more comprehensive audit, some cases don’t seem to check the buffered write size…).

Splitting individual Unicode characters / partial utf-8 encoded character writes can and does happen, has had a couple bugs largely reported in WindowsConsoleIO to date (gh-110913, gh-82052). There are some additional code paths I think have latent bugs around that (ex. _Py_write in fileutils caps at _PY_WRITE_MAX and doesn’t watch for multi-byte character boundaries), but usually don’t happen to things which read immediately and occur when have particularly large chunks of data to a TTY.