I’ve now run a script on the top 1000 PyPI packages to extract some information about the usage of the various methods. The script can be found here: Calculate I/O attribute usage in Top 1000 Python packages · GitHub
The script looks at function arguments annotated with IO
, BinaryIO
, or TextIO
and looks what attributes/methods are accessed. Notable omissions: It doesn’t look at instance attributes, and it can’t determine how often __iter__
was used. It also doesn’t look at usage of existing protocols from _typeshed
or private protocols, which a few libraries have started to use.
Here are the results:
Packages examined: 908
Files examined: 147480
I/O vars: 17125
Total occurences: Counter({'write': 379, 'read': 136, 'seek': 61, 'decode': 56, 'readline': 39, 'input': 38, 'tell': 33, 'name': 31, 'flush': 28, 'write_line': 18, 'fileno': 17, 'isatty': 14, 'close': 12, 'write_error': 12, 'write_error_line': 9, 'encode': 8, 'set_verbosity': 8, 'error_output': 7, 'is_debug': 6, 'buffer': 5, 'writelines': 4, 'supports_utf8': 4, 'mode': 4, 'getvalue': 4, 'seekable': 3, 'set_input': 3, 'output': 3, '__parameters__': 3, 'stdout': 2, 'encoding': 2, 'readlines': 2, 'decorated': 2, 'interactive': 2, 'is_interactive': 2, 'read_bytes': 2, 'getbuffer': 2, 'closed': 1, 'original_stdout': 1, 'readable': 1, 'is_verbose': 1, 'read_line': 1, 'is_very_verbose': 1, 'wait': 1, 'remoteaddress': 1, 'execmodel': 1, 'numpy': 1, 'close_intelligently': 1, '_mode': 1, 'isascii': 1, 'items': 1})
Unique occurences: Counter({'write': 132, 'read': 94, 'decode': 56, 'seek': 35, 'readline': 24, 'flush': 24, 'tell': 23, 'name': 19, 'fileno': 14, 'isatty': 12, 'write_line': 12, 'close': 10, 'encode': 8, 'input': 8, 'error_output': 6, 'write_error_line': 5, 'write_error': 5, 'writelines': 4, 'is_debug': 4, 'mode': 4, 'seekable': 3, 'output': 3, 'supports_utf8': 3, '__parameters__': 3, 'buffer': 2, 'encoding': 2, 'set_input': 2, 'readlines': 2, 'is_interactive': 2, 'read_bytes': 2, 'getvalue': 2, 'getbuffer': 2, 'closed': 1, 'stdout': 1, 'original_stdout': 1, 'readable': 1, 'is_verbose': 1, 'interactive': 1, 'set_verbosity': 1, 'decorated': 1, 'read_line': 1, 'is_very_verbose': 1, 'remoteaddress': 1, 'wait': 1, 'execmodel': 1, 'numpy': 1, 'close_intelligently': 1, '_mode': 1, 'isascii': 1, 'items': 1})
Unique occurrences de-deduplicates accesses in the same function, so it’s the more interesting statistic. decode
and encode
are most likely false positives from cases where TextIO | str
or some similar annotation was used.
After these findings, I’m happy with the protocols I originally suggested. I assume that a fair amount of readers will use __iter__
, which isn’t reflected here. The inclusion of readline()
is arguable, but it complements __iter__()
and is a natural fit.
One option would be to split Reader
into Reader
(which includes only read()
) and LineReader
(which includes __iter__
and readline()
.)
In the future, a protocol like Seeker
or Seekable
could be interesting, although it would probably be more interesting if we had protocol composition.