I’ve been looking a bit at what open() does under different circumstances, and found what to me is surprisingly different contracts for behavior:
with open("testfile.txt", mode='wb') as f:
print(f"{type(f) = }")
count = f.write(b"test")
with open("testfile.txt", mode='rb') as f:
print(f"{type(f) = }")
buffer = f.read()
assert len(buffer) == count, "Should have read and written everything"
with open("testfile.txt", mode='wb', buffering=0) as f:
print(f"{type(f) = }")
count = f.write(b"test")
with open("testfile.txt", mode='rb', buffering=0) as f:
print(f"{type(f) = }")
buffer = f.read()
assert len(buffer) == count, \
"Usually reads and writes everything, but not guaranteed"
This differs
.write()
both cases in the CPython codebase and ones I’ve looked at in open source code expect awrite()
to either write everything or throw an exception. CPython code and quite a few projects useopen(buffering=0)
or_io.FileIO
directly with a single call and seem to expect that behavior[1]. Forread()
this is relatively safe because read-all or fixed size does fullfill that contract, and most the timewrite()
does as well. The partial write only occurs under some cases, and CPython already retries a number. It comes up when there’s system limits (ex.getrlimit
on Linux[2]) that the user encounters..read()
in the docs doesn’t talk about PEP-475 retries. In code comments the readall case isn’t noted currently.
This difference in behavior is what was originally specified by PEP-3116, “Raw I/O” isn’t supposed to make more than one system call generally while BufferedIO will loop. With PEP-475 a lot of cases were updated to try for better standard behavior.
I’m curious if it would be better to leave the behavior as is or if there are ways we could improve the standard I/O patterns.
I’ve been pondering at a couple possible tweaks:
- Update
read()
code comments and docs that it retries per PEP-475 and include the readall behavior (read()
without a size retries until it findsEOF
as indicated by aread()
returning size 0). - Add a documentation warning around
FileIO
, and itswrite
method that users must check the return size to guarantee everything has been written. - Make BufferedIO able to be used without a zero-sized buffer (pass through) efficiently. Update so
open(buffering=0)
always uses BufferedIO so code gets the “read and write are batteries included” behavior they expect when usingopen()
. - Potentially updating
write()
in FileIO to retry partial writes. That will likely make code that relies on non-blocking I/O slower (goes from partial write returning size to an exception being thrown)
Curious other thoughts on how to possibly make this safer for what seems to be commonly used pattern which contains a subtle bug in the Python ecosystem.
- a. Code Search buffering=0 language:Python
b. gh-126606 - write(2) - Linux manual page
Note that a successful write() may transfer fewer than count
bytes. Such partial writes can occur for various reasons; for
example, because there was insufficient space on the disk device
to write all of the requested bytes, or because a blocked write()
to a socket, pipe, or similar was interrupted by a signal handler
after it had transferred some, but before it had transferred all
of the requested bytes. In the event of a partial write, the
caller can make another write() call to transfer the remaining
bytes. The subsequent call will either transfer further bytes or
may result in an error (e.g., if the disk is now full).