Removing `Sequence[str]` as base class of `str`

erictraut · June 28, 2024, 7:06pm

The typeshed builtins.pyi stub indicates that the str class derives from Sequence[str]. This is somewhat consistent with runtime behavior in that Sequence is a registered virtual base class for str (and hence isinstance(str, Sequence) evaluates to True), but it’s also somewhat inconsistent because Sequence is not a real base class of str and does not appear in its __mro__.

Many mypy and pyright users have requested that str not be assignable to Sequence[str] or Iterable[str]. This is a frequent source of bugs, and it’s something a static type checker could catch.

Here are some historical threads that discuss this pain point:

As a contributor to pyright, I’m always reluctant to override the type information provided in typeshed with custom logic in the type checker. Doing so has almost always resulted in unintended side effects, surprising behaviors, and maintenance nightmares. I’m even more reluctant to override the subtyping rules documented in the typing spec. I’d much prefer that we collectively decide on a desired behavior and reflect this in the type stubs.

I just did an experiment where I changed the typeshed definition of str to not derive from Sequence[str]. The fallout of this change (as measured by mypy_primer output) was much less than I anticipated. It affected only six repos in mypy_primer (out of many dozens). Across these six repos, there were only about ten code sites that were affected. Approximately half of them appear to be actual bugs. Most of the remaining cases affect only test code. All of them have a straightforward and obvious fix: specify str | Sequence[str] if the intent is to support both types.

Would it make sense to make this change in typeshed? Has this been suggested and tried in the past?

I admit that such a change will have some backward compatibility impact, but I think it will be relatively minor. If we can agree that this is the right thing to do, it might make sense to “rip the bandaid off”.

AlexWaygood · June 28, 2024, 7:16pm

I agree that there are good reasons for experimenting with removing Sequence[str] as a base class of str, but I don’t think this is one of them, unless you’re also proposing to remove (Mutable)Sequence as a base class of bytes, list, tuple, memoryview and bytearray. The vast majority of collections in the Python standard library do not inherit from the abstract base classes in collections.abc or typing at runtime.

beauxq · June 28, 2024, 7:22pm

I would very much like a way to exclude str when I want a collection of str - this has caused me a lot of difficulty trying to make things type safe.

But I think it would not make things better to make the inconsistency between typeshed and runtime behavior.

If typeshed says str isn’t a Sequence then the runtime isinstance(str, Sequence) needs to match - and that doesn’t really seem like a good idea.

beauxq · June 28, 2024, 7:28pm

The useful-types library has SequenceNotStr

But that leaves the problem unsolved for Iterable[str] and other collection types.

mikeshardmind · June 28, 2024, 9:08pm

my 2c on this is that it’s the right thing to do, but that there’s not a good and low-breakage path to doing it; doing it right now without other changes would cause other worse design warts.

str doesn’t conform to Sequence, as has been explored before (see __contains__ contravariance violation), but this will introduce a new wart in its place with isinstance(str, Sequence) I believe this to be a worse wart than the current issues with Sequence[str] if not addressed as part of this.

Generally speaking, I don’t think it’s safe for type checkers to narrow on isinstance involving ABCs or types that have defined __instancecheck__ or __subclasscheck__, but changing that behavior would violate longstanding user expectations. I also don’t think we can remove the registration that happens at runtime, even if with the benefit of hindsight we would say this isn’t something we would register. These are methods of bypassing subclassing which are neither typesafe nor possible to be checked for type-safety and are explicitly escape hatches.

This was on a list of things I thought would likely never be entirely fixed due to the history and many decisions that predated the type system, so I’m pleasantly surprised to see some appetite to address this one, but this was also one of the harder ones from that list to have a suitable long term fix for everyone in my estimation.

beauxq · June 28, 2024, 9:44pm

After we have difference types, I think the best solution is to have a strict rule in the type checkers (maybe not enabled by default) to disallow Sequence[str] (and other applicable collection types) unless they have a union or a difference with str

def foo(x: Iterable[str]) -> None:  # report problem, with suggestions below
    ...

def goo(x: Iterable[str] | str) -> None:  # ok
    ...

def hoo(x: Iterable[str] - str) -> None:  # ok
    ...

(depending on the syntax decided for difference types)

But this solution doesn’t work until we have difference types.

mikeshardmind · June 28, 2024, 9:49pm

Difference types have other problems of soundness that make them inappropriate for narrowing this case.

Consider the following:

def outer(s: Sequence[str]):
    inner(s)

def inner(s: Sequence[str] - str):
    ...

outer("boom")

This would require proof at every call site.

Disallowing Sequence[str], Iterable[str], etc on it’s own does not avoid this issue, it would apply to user defined structural types too, and to more than just str

beauxq · June 28, 2024, 9:56pm

But following my suggested strict rule, how would you annotate outer? (since bare Sequence[str] would not be allowed)

(I accept that there may still be problems with my strict rule suggestion - probably some adaptation of this.)

mikeshardmind · June 28, 2024, 10:14pm

This applies to all negated types, even those not the subject of discussion right now. You can’t use a negated type without proof the negation holds. Special casing Sequence[str] to be the one case you don’t need to check this anymore would require everyone writing more type info that isn’t even accurate, which is harmful to user intuition and ease of use.

The important parallel is below:

class A:
    ...

class B(A):
    ...

def function_returning_seqa() -> Sequence[A]:
    ...

def runtime_penalty_req(s: Sequence[A] & ~Sequence[B]):
    ...

How do you ever call this function safely? If you know it is a sequence[A], this is a covariant context, and B is always a possibility unless removed. you need positive proof that it isn’t B

def seq_narrower(s: Sequence, inner_type: T) -> TypeIs[Sequence[T]]:
    ...

def safe():
    s = function_returning_seqa()
    if not seq_narrower(s, B):
        runtime_penalty_req(s)

What you’re proposing here is to special case the exact negation : Sequence[str] - str from this, and require everyone to write Sequence[str] | str to mean Sequence[str] as it exists today and Sequence[str] - str to mean the “correct” way.

The special casing alone should sink this, this won’t compose well if this type arises inside a more complex type

Liz · June 28, 2024, 10:23pm

Honest question, do people actually rely on isinstance(x, Sequence) where x is a str? I’ve only seen discussion about how this is always an api design issue. How hard would it actually be to just fix all of this, runtime and typing?

beauxq · June 28, 2024, 10:42pm

The only way I see to totally fix it would be a fundamental change to the Python language.
The problem is that when you iterate through a str, the type yielded is str

for x in "hello":
    print(type(x))

I think this ^ is the api design issue. But we can’t change that without at least a change to Python 4.

If we didn’t have this as a fundamental part of Python, sure, people could make their own data structures with this problem, but at least we could more easily avoid it and recommend against designing data structures this way. But with this being a fundamental part of Python, we don’t really have a way out of it.

There’s no way to avoid it being a special case.

mikeshardmind · June 28, 2024, 10:58pm

That’s not a problem here, you may want to see the prior discussions on the matter. it isn’t an issue that str is Iterable[str], the problem is a combination of Sequence having str be registered as a subclass at runtime, and a lie in the typeshed to match that, allowing str to claim compatibility with Sequence (it isn’t compatible, as mentioned above, it’s currently in the typeshed as being a subclass of Sequence, which required type ignores to make happen due to the incompatibility)

beauxq · June 28, 2024, 11:04pm

I have never seen a single person complain about that for any practical situation (only for theoretical situations).

All of the linked issues in the original post and the many more issues I’ve seen are just as much a problem with Iterable[str].

mikeshardmind · June 29, 2024, 12:40am

There might be a way to do a slightly more powerful (still ugly) fix here

Adding a rule where for all G[T] where T is a type that would normally be compatible with G[T], T is not assignable to G[T] and requires a specification of G[T] | T, ensuring that the use of G[T] to include T is intentional.

This covers all basic cases with str and does not require modifying Sequence, str, isinstance, or waiting for type negation. I’m unsure of the impact or if this causes collateral damage for recursively defined types.

beauxq · June 29, 2024, 12:45am

That sounds very worth exploring.

Liz · June 29, 2024, 12:47am

If you change that to be for any non recursive definitions G[T] and T, all of that, it shouldn’t mean we can’t also fix the typeshed right?

adriangb · July 1, 2024, 10:19pm

I won’t speak as to the consequences or history but I will say this is also a common request from Pydantic users (who want Pydantic to essentially prevent the bugs).

srittau · July 2, 2024, 8:28am

I would generally prefer if we’d try to move to stubs that more closely resemble the runtime implementation. In the past we’ve used quite a few hacks to make things work one time or another, but often while these hacks help in one particular case, they make life harder in other cases. As the typing spec and type checkers have matured, often these hacks are more of a liability.

That said, moving off these hacks can be a challenge, but a challenge that will be worth it, in my opinion. One idea for these “virtual abc base classes” could be to have a special decorator like @collections.abc.register that can be used to decorate classes:

@register(Sequence)
class str:
    ...

This of course would need special casing by type checkers.

Also it might be worthwhile to consider how we could replace the ABCs with protocols or turn them into protocols somehow.

petercordia · July 2, 2024, 2:31pm

Would there be any chance of describing str as Sequence[Char] instead of Sequence[str]?

Char could be defined as: any string char such that

for c in char:
   c == char

Jelle · July 2, 2024, 5:06pm

Note that @Sequence.register already works:

>>> from collections.abc import Sequence
>>> @Sequence.register
... class B: pass
... 
>>> B
<class '__main__.B'>
>>> issubclass(B, Sequence)
True

So we could use this in stubs and reflect the runtime accurately. However, we’d need to amend the spec to require type checkers to support this decorator.