Simple `Reader` and `Writer` protocols

storchaka · August 29, 2024, 6:55am

The problem with “file-like” protocols is that everyone has differently expectations from them (there is a similar problem with “bytes-like”). There are text and binary streams (well, this can be solved by generics), but there is also non-blocking IO, which need special handling (and very few code supports this). Some code only needs read() (with or without argument?), other needs readline() or iteration. They may work with a class that supports only single method. There is a code that requires readinto(), read1(), peek() or readexactly(). They need more rich protocols. If we consider all combinations of methods, the number of protocols grows exponentially with the number of methods.

codecs.StreamReader has the read() method with different signature: read(self, size=-1, chars=-1, firstline=False). Is it compatible with this protocol? Oh, and there are asynchronous streams, how does runtime protocol check work with them?

srittau · August 30, 2024, 3:05pm

I’ve now run a script on the top 1000 PyPI packages to extract some information about the usage of the various methods. The script can be found here: Calculate I/O attribute usage in Top 1000 Python packages · GitHub

The script looks at function arguments annotated with IO, BinaryIO, or TextIO and looks what attributes/methods are accessed. Notable omissions: It doesn’t look at instance attributes, and it can’t determine how often __iter__ was used. It also doesn’t look at usage of existing protocols from _typeshed or private protocols, which a few libraries have started to use.

Here are the results:

Packages examined: 908
Files examined: 147480
I/O vars: 17125
Total occurences: Counter({'write': 379, 'read': 136, 'seek': 61, 'decode': 56, 'readline': 39, 'input': 38, 'tell': 33, 'name': 31, 'flush': 28, 'write_line': 18, 'fileno': 17, 'isatty': 14, 'close': 12, 'write_error': 12, 'write_error_line': 9, 'encode': 8, 'set_verbosity': 8, 'error_output': 7, 'is_debug': 6, 'buffer': 5, 'writelines': 4, 'supports_utf8': 4, 'mode': 4, 'getvalue': 4, 'seekable': 3, 'set_input': 3, 'output': 3, '__parameters__': 3, 'stdout': 2, 'encoding': 2, 'readlines': 2, 'decorated': 2, 'interactive': 2, 'is_interactive': 2, 'read_bytes': 2, 'getbuffer': 2, 'closed': 1, 'original_stdout': 1, 'readable': 1, 'is_verbose': 1, 'read_line': 1, 'is_very_verbose': 1, 'wait': 1, 'remoteaddress': 1, 'execmodel': 1, 'numpy': 1, 'close_intelligently': 1, '_mode': 1, 'isascii': 1, 'items': 1})
Unique occurences: Counter({'write': 132, 'read': 94, 'decode': 56, 'seek': 35, 'readline': 24, 'flush': 24, 'tell': 23, 'name': 19, 'fileno': 14, 'isatty': 12, 'write_line': 12, 'close': 10, 'encode': 8, 'input': 8, 'error_output': 6, 'write_error_line': 5, 'write_error': 5, 'writelines': 4, 'is_debug': 4, 'mode': 4, 'seekable': 3, 'output': 3, 'supports_utf8': 3, '__parameters__': 3, 'buffer': 2, 'encoding': 2, 'set_input': 2, 'readlines': 2, 'is_interactive': 2, 'read_bytes': 2, 'getvalue': 2, 'getbuffer': 2, 'closed': 1, 'stdout': 1, 'original_stdout': 1, 'readable': 1, 'is_verbose': 1, 'interactive': 1, 'set_verbosity': 1, 'decorated': 1, 'read_line': 1, 'is_very_verbose': 1, 'remoteaddress': 1, 'wait': 1, 'execmodel': 1, 'numpy': 1, 'close_intelligently': 1, '_mode': 1, 'isascii': 1, 'items': 1})

Unique occurrences de-deduplicates accesses in the same function, so it’s the more interesting statistic. decode and encode are most likely false positives from cases where TextIO | str or some similar annotation was used.

After these findings, I’m happy with the protocols I originally suggested. I assume that a fair amount of readers will use __iter__, which isn’t reflected here. The inclusion of readline() is arguable, but it complements __iter__() and is a natural fit.

One option would be to split Reader into Reader (which includes only read()) and LineReader (which includes __iter__ and readline().)

In the future, a protocol like Seeker or Seekable could be interesting, although it would probably be more interesting if we had protocol composition.

tusharc · August 30, 2024, 4:31pm

I also support this proposal. These make communicating intent—“just give me something I can write to”—easier. I think covering the common cases (particularly in light of empirical data from PyPI, which I appreciate!) is a net benefit.

srittau · December 5, 2024, 7:03pm

bluetech · December 5, 2024, 8:47pm

Having Reader and Writer would be great!

Including readline (and iteration) in Reader has some notable downsides:

Once you require read, line-reading is necessarily derived functionality, and hence is redundant in a fundamental protocol like Reader. By which I mean, it can be implemented on top of read outside of the protocol.
Unless I’m mistaken, including readline requires the implementation to either buffer, support peek (also a buffer), or very slowly read one character at a time.
Since the protocol is using arbitrary T, it is not clear what a “line” means. (BTW, what is the reason T is not bound to str/bytes?)

As prior work, Go has the io.Reader interface, which only has Read of bytes, and in my experience is regarded as good and successful (including the other io interfaces). Although in Python I would not go so far as to omit str support.

I’m referring to Go rather than other languages, because its interfaces are similar to Python Protocols in that they don’t support default/provided methods.

I think, if the aim to add more io protocols in the future, like seeking, closing, reading/writing at offsets, readinto, then the best course of action is to either design the entire hierarchy up front, or to “play it safe” and keep the protocols minimal, such that they would definitely fit in future expansions.

I think the protocol should specify some “laws” in addition to the type signature. Most importantly, the EOF behavior, and short reads/writes.

Monarch · December 5, 2024, 9:50pm

-1 on readline and __iter__ in a such a fundamental Protocol. Reader should only have the read method. @bluetech already raised some great points. read() and write() seem to be used overwhelmingly more than every other method, so I don’t think users should be forced to implement the much less used methods just to satisfy these new protocols. Personally, I’ve rarely ever needed anything more than read() and write() but when I did need something more it usually involved several of the more obscure methods without a clear “winner”.

mikeshardmind · December 5, 2024, 10:44pm

I can’t see using the protocols as PR’d here in either personal or professional projects. These define more in some ways and less in others than I rely on.

I agree largely with using protocols for this kind of thing, as it behaves much better than ABCs, but I disagree with making it runtime checkable as there are enough issues with runtime checkable protocols not checking everything needed as it is.

To be clear, I’m not really opposed to this existing as is, but as a data point, I can’t find any places in my personal or professional code where switching to this would be appropriate. If others find it useful, my objections to it are more philosophical to not thinking this captures what it means to be a reader or writer accurately, and those who find it does describe their expectations should be able to use it without issue.

srittau · December 6, 2024, 12:45pm

Ironically, I’m with you here – at least in the case of Reader. I will continue to use either custom protocols or the finer protocols from _typeshed. Long term I hope some of the finer-grained protocols from _typeshed will make it into the standard library, but I think this has to wait until we get easy protocol composition (def foo(stream: SupportsRead | SupportsSeek)). But there is a lot of code out there that uses typing.IO (or its subclasses) for convenience, which is problematic for the reasons outlined above. And this proposal is mostly intended to provide an easy-to-use alternative for IO for the common case – for users that don’t want to use the more cumbersome alternatives.