Simple `Reader` and `Writer` protocols

srittau · August 21, 2024, 12:20pm

As we all know, I/O is a particularly difficult part to type, especially safely. When the typing module was introduced, the IO, TextIO, and BinaryIO classes were supposed to represent “file-like” objects. Unfortunately, the introduction of these classes precedes protocols, so they are concrete classes that need to be sub-classed for other classes to be considered compatible with them. They are also fairly broad and many I/O classes don’t fully implement the required protocol, sometimes leading to unsafe calls.

In typeshed we have defined a few fairly tight protocols to alleviate these problems and we aim to use these tight protocols if possible. I also encourage library authors to do the same. Still, they are sometimes a bit clunky to use (especially since we don’t have a convenient method to compose protocols in type annotations) and they have a discoverability problem.

Therefore I suggest we add two fairly simple, fairly tight protocols to the typing module that will probably be good enough for 90% of use cases (number entirely made up) where IO, BinaryIO, and TextIO is currently used, a reader and a writer class. Something along the line of this:

@runtime_checkable
class Reader[AnyStr](Iterable[AnyStr], Protocol):
    def read(self, n: int = ..., /) -> AnyStr: ...
    def readline(self) -> AnyStr: ...

@runtime_checkable
class Writer[AnyStr](Protocol):
   def write(self, s: AnyStr, /) -> int: ...

This is not a final proposal, and we’d need to put a bit more research into which methods are most used in practice, but just to give an idea. This splits the tasks of reading and writing (since consumers of file-like objects will usually do either but no both), and leaves out the more esoteric features like file seeking and physical file management, including closing files. These are still available for IO and its sub-classes or more specific protocols.

I think this could reduce a lot of the problematic uses of IO etc. and would be a big step forward for the safe, easy-to-use typing of I/O in Python.

srittau · August 21, 2024, 12:22pm

The Writer class should probably not use AnyStr, but a custom type var bound to something like str | Buffer or even be unbound. (Same for Reader, really.)

rchen152 · August 22, 2024, 12:52am

I like this idea. These seem like they would be much easier to use and understand than the existing IO base classes.

NeilGirdhar · August 22, 2024, 2:35am

Isn’t that a benefit? It gives the writer a guarantee that their subclass is an instance of the superclass they want (otherwise they have to assert on the type), it makes isinstance checking explicit rather than implicit, and probably faster. And if the base classes ever change, then the subclasses will reflect those changes.

Can you illustrate some benefits of Reader/Writer protocols rather than Reader/Writer base classes?

srittau · August 22, 2024, 2:58am

I’m not sure I understand the question. There currently are no Reader/Writer base classes, only the very broad IO classes. But using base classes instead of protocols would mean that every existing class supporting I/O would need to be changed to include these (abstract) base classes. I don’t see that happening and don’t see any benefit over using protocols. Also, isinstance checks are not mixing well with duck typing, which is very prevalant when it comes to I/O in Python.

NeilGirdhar · August 22, 2024, 3:11am

Right, that’s what I’m asking. I’m still confused as to why that would be? Couldn’t you make IO derive from these, or were you planning on removing those from IO?

In the ABC world, isinstance always works, which Is one reason I prefer that pattern.

srittau · August 22, 2024, 7:56am

Of course IO could derive those if we’d use ABCs instead of protocols, but I can’t find any good reason in favor of ABCs here, only disadvantages: Every I/O class (even third-party and legacy ones) would have to derive from those ABCs (and that not only takes years, it’s also unlikely to happen comprehensively), it adds extra base classes to the MRO, and would require extra imports. The same arguments basically apply to any protocol.

isinstance also works with protocols. (I’ve sneak edited the required decorators into my example above.)

NeilGirdhar · August 22, 2024, 12:31pm

Don’t they already derive from IO?

Should we itemize the benefits of ABCs and compare them to protocol?

Benefits of ABC:

isinstance(A, B) always returns true if and only if A implements B whereas for protocols, it can be fooled by methods that don’t quite match the interface just because they have the same name. These false positives are going to be pretty common when you have common method names like read and write.
isinstance(A, B) will always returns true even if A or B’s definition changes whereas for protocols changing either A or B can silently break isinstance.
If A’s definition changes in a way that’s incompatible with the interface B, then in the ABC case, the type checker will complain whereas with a protocol the type checker stays silent.
Having an MRO is an explicit declaration of intent: It says, I expect my class to implement a given interface.
It makes isinstance(A, B) faster since it will simply scan the MRO rather than verifying that all of the methods are in the interface.
Type checkers process isinstance/issubclass constraints faster with MROs than with protocols (AFAIK).

Benefits of protocols:

It requires old inheritors who want to inherit from Reader or Writer, but not IO to explicitly mention the ABC. Are there many such classes?
It requires new inheritors who want to inherit from Reader or Writer, but not IO to explicitly mention the ABC. (Generally, explicit is better than implicit, so I personally don’t find this convincing.)
The MRO gets larger. (I’m not sure why you think that’s such a big deal though?)
Did I miss any benefits?

I think in general protocols are really useful when you can’t add the base class, for example, when:

the concrete classes are basic types like int that don’t admit new base classes, or
when the amount of existing code that would have to change is large enough that adding ABCs is too onerous.

If the latter is true in this case, then I agree with you that protocols make sense. Is that the case? Could you name a few Readers/Writers that don’t derive from IO?

Daverball · August 22, 2024, 1:30pm

The IO classes in the standard library are not all the IO-like classes there are. There’s many frameworks that implement their own duck-typed IO classes that don’t inherit from IO and likely never will for the sake of simplicity.

A few come to mind, but just to name one, so that one has been named: webob.response.ResponseBodyFile

NeilGirdhar · August 22, 2024, 1:32pm

Interesting! But FYI, the one you linked wouldn’t work with the given protocols since it has no read or write methods

Daverball · August 22, 2024, 1:34pm

It has a sneaky write instance attribute. The object is basically a proxy object for the Response instance that created it.

srittau · August 22, 2024, 1:35pm

Not necessarily. I don’t think any stdlib class does at runtime, although we pretend they do at type checking time. isinstance doesn’t work:

Python 3.12.3 (main, Jul 31 2024, 17:43:48) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from typing import IO
>>> isinstance(open("x", "w"), IO)
False

There are also classes that can’t derive from IO, since they don’t implement the full interface, like io.BufferedIOBase.

I don’t expect many third-party I/O classes to derive from it.

NeilGirdhar · August 22, 2024, 1:36pm

Ah, my mistake! Didn’t see that. I guess type checkers will be okay with that too.

srittau · August 22, 2024, 1:49pm

The same is true for abstract base classes:

>>> from abc import ABCMeta, abstractmethod
>>> class ABC(metaclass=ABCMeta):
...     @abstractmethod
...     def foo(self) -> None:
...         raise NotImplementedError()
... 
>>> class Impl(ABC):
...     def foo(self, x: int) -> None:
...         pass
... 
>>> isinstance(Impl(), ABC)
True

I don’t see how? Can you give an example?

Again, I don’t see how? A type checker will notice the incompatibility when trying to pass the changed object. If you want to notice these changes at definition time, you can always derive a class from the protocol.

Yes, basically every class in the stdlib, and potentially many, many third-party classes.

And we can’t. Because there are potentially thousands of third-party classes out there over which we don’t have any control. Also, if we’d add IO or the Reader and Writer classes starting with Python 3.14, we still have 5+ years of incompatibility to deal with.

NeilGirdhar · August 22, 2024, 2:28pm

I think you’ve misunderstood. The ABC case is returning true because Impl derives from ABC, and returns false if it doesn’t. The protocol case can exhibit false positives when some class happens to have a write method even though that method doesn’t match the interface. This is one way that protocols are inferior to ABCs.

I get literally hundreds of false positives doing rg "def write\(" in my src directory from projects like tensorflow, MyPy, NumPy, cpython, etc.

If you were to add a method to the protocol, various classes that previously would have returned true to issubclass may now return false. Also, if someone modifies a derived class and changes the name of a method slightly, then it will no longer inherit from the protocol. This will not trigger any error at the definition. This is unlike ABCs.

I’m talking about definition time. If you’re going to explicitly inherit from the protocol, then you may as well explicitly inherit from the ABC. The main benefit of protocols is their implicit nature, in my opinion.

The stdlib is easy to repair. If there are many existing their party classes that don’t already inherit from related stdlib classes (which could easily be modified), then I agree with you that this would be a good use of protocols.

Right, good point.

From my point of view, it’s too bad there’s no way to maneuver it in such a way that in 5 years, we are in a place where Reader and Writer are ABCs rather than protocols.

Jelle · August 22, 2024, 4:03pm

@NeilGirdhar just so I understand better, is your argument that ABCs are always preferable over Protocols? If not, what makes these Reader/Writer classes different from other classes that makes it so Protocols are preferable?

Not necessarily, since many of the relevant classes are implemented in C and cannot easily use multiple inheritance or inherit from a Python-defined ABC or protocol.

NeilGirdhar · August 22, 2024, 4:21pm

For the list of reasons I gave, I prefer ABCs when they’re not too onerous to use. But I concede from Sebastian’s last post that this seems to be a case where it is onerous.

Oh, I didn’t realize that C classes had trouble adding base classes.

Just so we’re clear though, we’re talking about an ABC in the sense of a class with abstract methods. Not necessarily one that inherits from abc.ABC. And this would just be interface inheritance (no member variables, no issues with super(), etc.).

moi90 · August 24, 2024, 9:48am

I find myself introducing such protocols ad-hoc in my code fairly often because of the aforementioned problem of classes that implement reader/writer/whatever protocols without inheriting from any specific base class. It would be cool to have them included in Python.

randolf-scholz · August 28, 2024, 10:33am

But Protocols are ABCs (type(Protocol) directly inherits from ABCMeta), so what is the actual advantage/difference at definition time, assuming @abstractmethod is added to the methods of the Protocol definition?

NeilGirdhar · August 28, 2024, 11:40am

In my comment that you quoted, I’m saying that the benefit of protocols over ABCs is that you implicitly inherit from them, and you don’t have to explicitly inherit from them. In this comment, I’m using ABC in the computer science meaning (an ordinary class with only abstract methods). I’m contrasting ABCs with protocols.

I listed what I think the benefits of ABCs over protocols are in another comment.