Removing `Sequence[str]` as base class of `str`

matthewyu0311 · July 3, 2024, 5:54am

char (code point to be pedantic) is still a subtype of str. Such a type may be useful outside of the Sequence[str] vs Sequence[char] debate though, for example, when typing functions that accept a single Unicode code point such as ord() and those in unicodedata.

In the language spec, strings are defined as

immutable sequences of Unicode code points.

but there seems some room for interpretation as to how well this conceptual sequence corresponds to collections.abc.Sequence. The language itself has made some practical exceptions regarding strings(strings are *unpack-able, but strings, like bytes, don’t match sequence patterns in pattern matching), and I as a user for one do appreciate these bits of practicality. Maybe this is an opportunity to clarify what kind behavior of sequence we want to encompass in collections.abc.Sequence.

Liz · July 3, 2024, 7:21am

Should type checkers support this at all though? str isn’t a sequence.

Right now, the typeshed places Sequence[str] as a base class for str. This involves needing several type ignores because this would otherwise violate rules for subclassing (being incompatible)

github.com

python/typeshed/blob/683ff51acb234b8a71dc63207bb53d65b8cb42f2/stdlib/builtins.pyi#L433-L612


      
          class str(Sequence[str]):
              @overload
              def __new__(cls, object: object = ...) -> Self: ...
              @overload
              def __new__(cls, object: ReadableBuffer, encoding: str = ..., errors: str = ...) -> Self: ...
              @overload
              def capitalize(self: LiteralString) -> LiteralString: ...
              @overload
              def capitalize(self) -> str: ...  # type: ignore[misc]
              @overload
              def casefold(self: LiteralString) -> LiteralString: ...
              @overload
              def casefold(self) -> str: ...  # type: ignore[misc]
              @overload
              def center(self: LiteralString, width: SupportsIndex, fillchar: LiteralString = " ", /) -> LiteralString: ...
              @overload
              def center(self, width: SupportsIndex, fillchar: str = " ", /) -> str: ...  # type: ignore[misc]
              def count(self, sub: str, start: SupportsIndex | None = ..., end: SupportsIndex | None = ..., /) -> int: ...
              def encode(self, encoding: str = "utf-8", errors: str = "strict") -> bytes: ...
              def endswith(

This file has been truncated. show original

github.com

python/typeshed/blob/683ff51acb234b8a71dc63207bb53d65b8cb42f2/stdlib/builtins.pyi#L585-L586


      
          # Incompatible with Sequence.__contains__
          def __contains__(self, key: str, /) -> bool: ...  # type: ignore[override]

What would it even mean for type checkers to support this via register? If the answer to that is that str remains a subtype of Sequence to type checkers, then the type system becomes less sound, not more. Right now, the damage is limited to places where people have told type checkers to “shut up” with a type ignore or an unsafe cast.

Daverball · July 3, 2024, 7:51am

The way I understand the suggestion is for type checkers to add support for virtual subclasses in the way they work at runtime, i.e. they are only considered a subtype as far as isinstance/issubclass checks are concerned, but everywhere else it would not be considered a subtype because Sequence is neither part of the mro, nor does it structurally match str.

This seems like a pretty good solution to me, because it properly models the runtime behavior. But it is also asking type checkers to add explicit support for virtual subclasses of ABCs and it makes narrowing the type after isinstance/issubclass checks more complex and wouldn’t compose well with TypeIs/TypeGuard functions that internally use either of those checks. So the potential for soundness holes is still there.

mikeshardmind · July 3, 2024, 7:53am

If that’s the goal, then understanding register isn’t sufficient

We don’t need a special case understanding of register, just an understanding of which types as the second argument to isinstance might have differing runtime behavior from typing behavior to avoid narrowing on those types.

mikeshardmind · July 3, 2024, 11:07am

So, most of the options various people have presented are not mutually exclusive options to improve the situation here. Some of them improve different aspects of the situation than others, so maybe we should compare what gets improved on them directly, what the costs are, and what the impact is for different interested parties (typecheckers and users, primarily)

Just changing the typeshed for str

If we just change the typeshed entry for str to not inherit from Sequence[str], This “fixes” the user perspective on Sequence[str], but not Iterable[str]. It improves user’s expressibility, It improves soundness, has no complexity changes for type checker implementations, no new special casing is strictly required. It adds a cognitive complexity if anyone has a case where they rely on isinstance(x, Sequence), but that also subjectively removes a cognitive complexity of needing to evaluate every function that accepts Sequence[str], does it intend to accept str

Changing the typeshed for Sequence

We could instead change the typeshed for Sequence so that Sequence[T].__contains__ is only guaranteed to accept T, then augment the various types that subclass it and accept object with more information. This would improve soundness, but have no other positive or negative impacts.

Adding a rule so that T is not assignable to G[T]

This would improve soundness and expressibility for users. It does not change soundness directly. It would increase complexity for typecheckers, but the rule generically applies and does not require special casing specific types. It solves both Iterable[str] and Sequence[str] cases presented alone. By the logic above, this likely improves cognitive complexity for users in some cases, but may harm it in others. This might warrant additional user diagnostic info or a specific callout for the str case (The only one which can occur from composition of builtin types) in the typing docs.

Specifically, for SubscriptedType: ST, and GenericType: GT, if GT is not exactly ST, and ST is a subtype of GT[ST], ST is not assignable to GT[ST] despite the subtyping relation, and requires writing GT[ST] | ST.

Do nothing and wait for type negation

Type negation requires either positive proof at runtime or violating principles of gradual typing. It doesn’t play well with covariant return types, which could have a wider type than expressed. I’ve been exploring this as part of all of: intersections, user denotable Not, and an experimental non-specification compliant type-system designed around soundness.^[1] While it could help here, it would increase the number of runtime uses of isinstance rather than decrease them, and would decrease ergonomics if done without violating principles of gradual typing over the current status quo (where the best option is check once, in the function that cares)

Change isinstance behavior to be aware of typing escape hatches

Improves soundness, disruptive to ABC use, increase in complexity for typecheckers. Likely no significant change to ergonomics.

typecheckers gain awareness of ABC.register

Same as above, but more limited in scope.

Add special type aliases

This could be used as a stopgap to make annotating intent easier, but this would exclude some valid cases. The current definition in useful_types is fragile, and possible to break in a few ways, these include slightly less, but none of these allow str itself.

type CommonSeqStr = list[str] | tuple[str]
type CommonIterableStr = Generator[str, Any, Any] | dict[str, Any] | set[str] | CommonSeqStr

Personal Assesment

A combination of Adding a rule with T not assignable to G[T], and modifying Sequence in the typeshed (not modying str in the typeshed!) so that str can be compatible with its definition would improve both soundness and expressibility with minimal churn for users. It would come at a cost of increased complexity for typecheckers and would likely require buy-in from existing typecheckers to accomplish. The other steps here would be possible to entertain separately from these to continue improving other aspects beyond the initial handling of what users currently feel is missing.

I believe this would also be the least disruptive option. Altering when typecheckers should narrow in isinstance would be the logical followup, but is significantly more complex and likely to have mismatches with user expectations, and I think requires a lot more care in evaluating.

This is currently only an exploratory experiment, and not something ready to use, and may not ever be suitable for production use. ↩︎

srittau · July 3, 2024, 11:11am

Exploratory typeshed PRs would be welcome. This could help use judge the impact of the suggested changes.

mikeshardmind · July 3, 2024, 11:50am

Impact of changing sequence in typeshed here seems to have a small, but non-negligible of collateral damage. There might be an alternative change possible that hits less. It also better highlights that type variable bounds might benefit from gaining capabilities in the future, but this would be another increase in overall system complexity ^[1]
Without more than a cursory exploration into the new issues, this seems to catch a few legitimate issues, and stops catching at least one false positive. I don’t think the new issues are false positives from what I changed, there are a lot of issues that look like:

Unsupported operand types for in (“str | None” and “Sequence[str]”) [operator]

which would be the desired outcome of fixing it from this direction, I think. (None in “some str” is a RuntimeError, other cases where you know the input can’t be in the Sequence statically are also possible to avoid with alternate construction…)

Right now, the way variance is calculated for typevariables, and the inability to place generic bounds on a typevariable is limiting for sound expressibility of type constraints as they apply to methods of generic types. ↩︎

Liz · July 3, 2024, 12:12pm

That looks like a lot of new issues at first, but there’s 15 affected libraries

of those 15, 2 are positively changed, no longer having errors.

of the remaining 13

7 only have issues of the form Something | T in Container[T], where Something is disjoint with T (the most common version is None via Optional)

2 have an issue of the form object in Container[T], where T is more specific than object.

The remaining 4 are a bad interaction with numpy’s ndtype/ndarray, which looks to be possible to fix centrally in the stub for numpy.

I was expecting a much larger impact.

beauxq · July 3, 2024, 5:56pm

I think it’s important to make it clear what problems we’re dealing with:

There are 2 different problems being discussed here, and these 2 problems are only slightly related to each other.

Problem A

The type of the argument for __contains__ (whether it is object or _T_co) is currently inconsistent and technically unsafe.

This is why str is technically not a valid Sequence.

Although it is technically unsafe, and working to make the Python type system more safe is worthwhile, we should note that not very many users run into issues with this problem.

Problem B

Users often want to design an API like this:

def foo(x: Iterable[str]):
    ...

(Iterable[str] could be replaced with Sequence[str] or Collection[str] or others, but Iterable and Sequence are the most common.)

It is usually a bug if someone passes a str to this function.

foo("abc")

Users expect the type-checker to report an error for this, because this isn’t a type they’re expecting, but it doesn’t report an error.
And (in the case of Iterable) there isn’t any good way to annotate this so the type checker will report an error for this bug.

This is a common issue that many users run into, and they report issues with mypy and pyright and other typing projects.

One of the most significant ways to describe how these 2 problems are related to each other is that:
The existence of one problem makes it possible to have a hacky-workaround-half-solution to the other problem.

SequenceNotStr

jshumphrey · July 3, 2024, 8:55pm

I absolutely agree that Problem B is an important side of this discussion - but I believe I’m drawing the opposite conclusion: because of Problem B, SequenceNotStr is not a good solution.

The problem that users often have (myself included, when I ran into this behavior for the first time and started reading about this issue) is that they don’t realize/remember that str is compatible with Sequence[str], and as a result, this code is expected to be correct and bug-free:

def foo(x: Sequence[str]):
    for arbitrary_length_string in x:
        ...

Annotating x as SequenceNotStr would certainly cause a type-checker to report an error when analyzing foo("bar"). However, adding that annotation requires the user to expect the bug; in order to annotate x with SequenceNotStr, the user must already know that Sequence[str] can be dangerous, and must know that SequenceNotStr exists to help mitigate the issue.

For users that are aware of this potential bug and would like to annotate their variables more specifically in order to preclude it, SequenceNotStr makes sense. But that doesn’t help this other group of users that don’t anticipate the bug and get bitten. (Perhaps this is really a separate “Problem C”, where the user isn’t aware of potential problems with Sequence[str].)

(I apologise for any major misunderstandings; I’ve read a lot of discussion on this topic, but this is my first post here.)

dimaqq · July 4, 2024, 6:56am

I’m curious about the bugs that Eric found.
There’s output in the linked experiment… I wish there was a processed version pointing to specific bugs with some explanation why those are bugs.

mikeshardmind · July 4, 2024, 12:32pm

bugs with this are often logical bugs:

def tokenize_text(s: str, locale: Locale) -> Sequence[str]:
    """Split text into tokens based on locale rules,
    preserving association of punctuation."""

def justify_text(width: int, locale: Locale, s: Sequence[str]) -> str:
    """Justifies a sequence of textual tokens in a column,
    minimizing excess whitespace
    while preserving boundaries between tokens.
    uses locale rules for breaking up tokens that exceed column width. 
    """

These functions are probably possible for people to imagine how they would work. With both of them exposed in a library it would clearly be an issue catchable by a type-checker if someone is calling justify text with a string that hadn’t been tokenized yet ie

justified = justify_text(30, "en-US", "this isn't going to work how the user or library author intends at all")

The general point of Sequence is to be relatively permissive about the container, you’re fine just being able to iterate over it and index into it, but that fact that Sequence[str] includes str (It’s the only Sequence[T] = T situation other than Sequence[Sequence]) has negative outcomes when the intent isn’t to iterate over text by unicode codepoint. By not treating str as Sequence[str], misuse based bugs like this would be possible to catch statically. Not all bugs like this are going to be as obvious as this one is.

Jelle · July 4, 2024, 1:52pm

This would work for the intended purpose (make it so str is not assignable to Iterable[str] or Sequence[str]), but it also creates a weird special case that might confuse users in other contexts. For example, I could easily imagine users writing a protocol like SupportsAdd[T] where SupportsAdd[int] describes something that can be added to an int. Such a user would want int to be compatible with SupportsAdd[int].

There are also some details of the specification to consider. What if G has multiple type parameters? What if it has multiple but some of them have defaults?

mikeshardmind · July 4, 2024, 3:55pm

I can’t. This isn’t expressible generically in a meaningful way as a protocol with only a single typevariable due to operator overloading and the presence of __radd__. Best you can do is below, which for any sensible use would also need annotating the resulting type of the addition.

class AddableTo(Protocol[T, R]):
    def __add__(self, other: T) ->  R: ...
class RAddableTo(Protocol[T, R]):
    def def __radd__(self, other: T) -> R: ...

type AddableWith[T, R] = AddableTo[T, R] | RAdaddbleTo[T, R]

and this is still incomplete, because a bad order of operations with add and radd defined in conflicting ways is possible.

This would result in multiple type parameters, only one of which matches the input type which would definitely avoid the issue.

Considering the actual purpose here is distinguishing between an intended and unintended inclusion when two types are overlapping, we can formulate this more precisely, and get the intended result. There are a few slightly more fringe issues possible, so we can take it a step further and really narrowly tailor this, then consider widening it at a later date if someone has a use for it to apply more broadly.

Given a generic type GT and a subscription of that type ST, where GT is a generic type that accepts only a single covariant type variable, if ST <: GT[ST], ST is not exactly GT, and ST is not a gradual type, ST is not assignable to GT[ST]

the inclusion of covariance there will catch a host of other constructed cases I’ve never seen in the wild.

We could go even more narrow, and say that in the above, swap ST for str, but I’m trying to avoid special casing specific types here.

Liz · July 4, 2024, 4:29pm

I appreciate that you’re exploring this and trying to reconcile everything, but in reading where this has gone, especially this part:

It’s important to remember that neither Sequence nor Container should be covariant to begin with, so you’re hedging that construction on that implementation detail that’s not quite correct either.

after seeing it explored, I’ve come to a different conclusion about a mix of ways to handle this.

remove sequence from str’s bases in the typeshed, add a note that isinstance isn’t safe on abcs, wait for type negation to fix the rest. maybe fix that all of these container types should be invariant too thanks to their methods.

beauxq · July 8, 2024, 10:19pm

One of the main reasons that people use Sequence is because it’s covariant rather than invariant.

If we’re going to treat it as invariant, then we need a covariant replacement (with only the methods that can be safely covariant).

Liz · July 8, 2024, 11:33pm

Many people using it for covariance are introducing unsafety in doing so. Passing a list (invariant) to function expecting a covariant sequence is unsafe, but typecheckers are allowing that. We don’t need a covariant sequence type.

mikeshardmind · July 8, 2024, 11:47pm

I think the discussion on fixing variance in general within python needs to be another topic, and doesn’t need to be a predicate to fixing Sequence[str] vs str

davidfstr · July 12, 2024, 12:39pm

From the POV of making the default behavior of type checkers catch common mistakes like accidentally passing a str where an Iterable[str] is expected, I’d be in support of making Sequence[str] and Iterable[str] not be supertypes of str at type checking time.

I have no opinion RE whether isinstance(my_str, Sequence) or isinstance(my_str, Iterable) should continue to return True at runtime or not. That’s not a behavior I’ve ever relied upon IIRC.