Type intersection and negation in type annotations

I think my preferred resolution here is:

  • Not type gets added
  • some type alias like typing.StrictStr is added for str & ~typing.Iterable[str]
  • type checkers can set --strict-str to mean that str is treated as StrictStr

After that, making --strict-str the default (or, for mypy, making it part of --strict) can be handled as tool maintainers see fit.

There’s one really hard part here, based on what I’ve seen the typing experts say, which is the first part, Not. After that, it’s all seems like it should be easy.

EDIT: Sorry, I meant Not (~) and Intersection (&) above, not just Not.

1 Like

I would like to see type intersections too, but I don’t think type intersections or negation are necessary for this problem.

Like you say, type negation is “the hard part”, so I’m proposing skipping it and doing two easy things (bulleted points) instead.

1 Like

This implies that it’s the string itself that’s different - that some strings are iterable and some are not. Much more commonly, what happens is that one situation wants to treat strings as iterable and another situation wants to treat them as non-iterable.

If the idea of two operators doesn’t make sense, would it be possible to have “type subtraction” rather than using the and-not notation? So Sequence - Sequence[int] would mean “a sequence of anything that isn’t integers”. That way, there’s no semantics needed for the negation without the intersection. Downside: It’s less neatly matched with the “this or that” notation for type union.

1 Like

Since str is a structural subtype of Iterable[str], it follows that str & ~Iterable[str] is Never.
Instead, one needs a type alias for non-recursive collections, i.e. Iterable[T] & Not[T]

3 Likes

Is this the case, or does str implement the interface of Iterable[str]?

If the latter, then what is ~Iterable[str]? Is it the (infinite) set of all methods that Iterable[str] doesn’t have? Knowing that Iterable inherits from object, does this set omit __str__() -> str, __eq__(Any) -> bool etc?

Yeah, that’s why I’m unsure of the value of ~Type as a concept - it’s ONLY valid in conjunction with &, and vice versa. Hence thinking that type subtraction may be a better choice.

3 Likes

Well, isn’t this exactly what a structural subtype is? Wikipedia says:

I guess in python it is called an implicit subtype instead (from PEP 544 – Protocols: Structural subtyping (static duck typing) | peps.python.org):

If a class includes a protocol in its MRO, the class is called an explicit subclass of the protocol. If a class is a structural subtype of a protocol, it is said to implement the protocol and to be compatible with a protocol. If a class is compatible with a protocol but the protocol is not included in the MRO, the class is an implicit subtype of the protocol.

I think X ≤ ~T should simply be equivalent to X≰ T, i.e. any type X that is not a subtype of T would be valid.

Adding to the type negation thread. How about this concrete use case?

From the typeshed, replace

@overload
def is_dataclass(obj: type) -> TypeGuard[type[DataclassInstance]]: ...
@overload
def is_dataclass(obj: object) -> TypeGuard[DataclassInstance | type[DataclassInstance]]: ...

with

@overload
def is_dataclass(obj: type) -> TypeGuard[type[DataclassInstance]]: ...
@overload
def is_dataclass(obj: object & ~type) -> TypeGuard[DataclassInstance]: ...

Resulting in a stricter type? Or have I misunderstood something?

Thanks for the correction above, and the further thoughts on this.

Subtraction may be what’s really wanted here.
We want str_minus_iter to be an expressible type. Whether that’s expressed in terms of Iterable or not, or how the type algebra expands to support it, doesn’t really matter to the vast majority of users.

I have no doubt that typeshed/stdlib will have various uses for whichever addition is made.

My basic thought is that the type system doesn’t currently allow this idea to be expressed, so asking for str to be special-cased is asking type checkers to implement a behavior which does not have any kind of parity with types that users can write. That seems likely to confuse developers sooner or later, and could even lead to more divergent behaviors between different type checkers.

Aren’t changes to typeshed’s definition of str required regardless? Even if the type system develops a way to express “str minus Iterable”, that won’t make sense as a type since str itself includes the methods that satisfy Iterable, so the set of allowed types would be empty.

1 Like

The proposal is type checkers special case str and no change to typeshed is done for it. This would be rule type checkers implement that go outside normal type system rules.

Rules like this have been done before. When you subclass a type normally parent methods are checked for liskov compatibility. That is not for init and code wise the type checkers just make a special case and ignore that method for that rule. Similarly in past implicit optional was thing and was special rule type checkers had for None.

Maybe I completely misunderstand the nomenclature here, but don’t we want Iterable[str]_minus_str to be a type?

That is any Iterable of strings except str itself.

I understand that there is the idea that this would all be easier if str was not iterable. And maybe it would – but the fact is that it IS iterable, and it would be a very large breaking change to change that now (maybe py4, but that’s not relevant now).

So I think the goal here it to have a way to specify that, e.g. a function is expecting Iterable[str], but doesn’t want to accept str – and that doesn’t require a new “type” of string – which wouldn’t be practical anyway – regular old strings are all over the place in code, and always will be.

I think the easiest way to solve the atomic string issue is to do what pytype did, which is to make type checkers not consider string to be iterable (they also blocked container, but that’s not necessary imo). They have observed that this finds many bugs and has a very small cost. Specifically, the cost is having to call cast when you want to use a string as an iterable, or passing a string to a function that requires a generic iterable, collection, or sequence. Both cases are not that common.

One proposal in this thread of using type negation would mean rewriting all annotations of iterables (e.g., Collection[T]) to be Collection[T] - str. The advantage of this is that it’s opt-in, but being opt-in, you don’t find bugs until you opt-in. It has a much larger churn cost than the other solution. Even if we had type negation, I don’t think this is a good way to do it. And we’re not even close to having type negation since we would probably need to implement intersection first (even if you write it as subtraction, the evaluation is probably by conjunction).

The strict string (str - Iterable[str]) proposed at the top of the thread doesn’t work for the reason that Randolf points out.

1 Like

The way you have answered this makes it sound to me like it is categorically impossible to define a type, with the current type system or a future one, which describes a string which is not iterable.

But that doesn’t ring true to me, so it seems like I am somehow misunderstanding you. I would appreciate some help filling in the gaps.

If we accept that str defines some interface, then surely we can construct a type which has that same interface but doesn’t have __iter__? Other languages have managed to implement exclusion and subtraction types – have they only done so by committing some significant taboo and by breaking type theoretic rules?

I happily accept that str - Iterable[str] is the wrong way to phrase this, and that it means the bottom type Never. But I would appreciate some explanation of either

  • how it is possible to define AtomicString (as it was called in the original thread) using type operations other than subtraction

or

  • what it is about python’s type system in particular that makes AtomicString impossible

I think this narrows the space of costs to consider too much.

Some other costs off the top of my head:

  • Furthering the divergence of type checker semantics (mypy vs pytype vs pyright vs pylance…)
  • Library maintainers cannot publish types which use AtomicString or IterableString explicitly

Surely the real problem is that str isn’t compatible with that type (by definition) so such a type isn’t useful in practice?

The key thing is that it’s a type that explicitly prohibits iterability. Having a type that is satisfied if you have a set of methods, but doesn’t care what extra methods you have is easy. But that’s not what people want from “atomic strings” as far as I can tell.

1 Like

I still think this is approaching the problem from the wrong direction – IIUC, the problem is not that strings are Iterable, but that they are specifically Iterable[str]. And most of the time a function is expecting a Iterable[str] it will work with a str, but not with the result that’s expected – technically it’s a value error, not a type error, but practically, it’s a type error (and I think the most insidious common one in Python since integer division was removed).

So what would solve the problem is a way to define any iterable[str] except not str itself – i.e,

Iterable[str] - str

Yes, that would be a bit klunky, considering that’s it’s probably the most common use case [*] – but if there was a way to define it, then it could be given an alias, and no more klunk.

What I wonder is what other applications there are for “subtracting” types – if this is the only one, then probably better to put a kludge [uh – special case] in the type checkers.

[*] maybe not unheard of to take both an Iterable[str] and an [str] – this has been an ongoing issue long before static typing – and there’s plenty of code out there (at least in my codebases :slight_smile: ) like:

def fun(bunch_o_strings):
    if isinstance(bunch_o_strings, str):
        bunch_o_strings = [bunch_o_strings]
    for string in bunch_o_strings:
        do_something(string)

And that can be correctly typed as Iterable[str] right now.

And that doesn’t just “fix” the type error, it provides a “nicer” API – maybe the same people that want strict typing wouldn’t like such a squishy API, but my scripting users do :slight_smile:

I think the answer here may be that it’s impossible to describe a type that doesn’t exist, but pretend that an existing type is that type. It wouldn’t be the least bit hard if there actually WAS an AtomicString

But anyway: these issues with str being an iterable of str have existed for years, long before static typing – so maybe it’s too much to expect the type system to solve it without making any changes – if folks really need an AtomicString, then maybe make one?

But i don’t think the problem actual is that str is iterable – that’s actually handy sometimes. The problem is that str yields str when you iterate – and that goes on forever.

When you have nested lists, you can iterate and get a list, and iterate and get a list, but eventually, you get to some non-list type.

When you iterate a numpy array you get a lower rank numpy array, but eventually you get a scalar.

When you iterate a str what you always get is a length-one str – forever. So maybe rather than an AtomicString, we need a Char.

Maybe the type system could consider a string an Iterator[Char] – so it would not satisfy Iterator[‘str’] anymore. And I think that could be done without there actually being a Char.

Hmm – then you would need to cast when you DO iterate through a string and want them to be considered strings. OR could you define a type in the type system that doesn’t actually exist?

Maybe back to Iterable[str] - str – if there could be a way to spell that.

Stephen, of course one could define e.g. a typing.Protocol that matches the signatures of str, but lacks the __iter__ method. But from a typing perspective it doesn’t make much sense to do this because likely, when you use strings, you also will use methods that one way or another make use of the str.__iter__ method. It’s not that “AtomicString”, or rather more succinctly “NonIterableString” is impossible, but it seems not like a type that is particularly useful.

Like, what problem would this type solve? The usual issue people have is that they want to type hint a function that takes a collection of strings. Being good typists, they want to use collections.abc generics such as Iterable[str]/Collection[str]/Sequence[str] instead of list[str]/tuple[str, ...]. However, since strSequence[str], this matches str as well which is usually not intended. And the solution to this is to define a non-recursive collection Sequence[T] & ~T as I explained in my previous comment.

I like that! that is exactly the point here.

However, would this be used anywhere other than str?

For instance, it’s certainly possible (and common?) to represent a 2D list as a list-of-lists – which would be a recursive collection – would you want to disallow that in code that otherwise takes a generic Iterable? maybe.

What is confusing about this for me is that what’s “special” about str is not just that it’s a “recursive” iterable – but also that it’s an iterable that will only ever yield one type: str itself, but really a non-defined type, which is a length-1 str [*]. I don’t think there’s any other type in the stdlib that is like that, but of course there are any number of custom types that might be. Perhaps that can be taken advantage of?

[*] I haven’t kept track, but maybe it would be useful to look at what numpy has been doing – I know there’s been effort to type numpy arrays properly, where the size (or rank, at least) of the array is part of the definition --any lessons there?

Another place where this happens is when working with matrices/tensors. For example, when working with pytorch, iterating over a tensor of shape (10,) returns 10 instances of a tensor of shape (). Here, one might want to annotate a collection of tensors of possibly incompatible shape (so that they cannot be stacked as a single tensor with more axes). But annotating it as Iterable[Tensor] runs into the same problem because Tensor is iterable itself.

For numpy, it is slightly differently, because numpy returns special scalar objects instead of 0-dimensional tensors. however, the same still applies if one considers iterating over a higher dimensional tensor.

Does this not suggest that the right abstraction here is “constrained types” - “str with 1 character”, or “tensor with 0 dimensions”? Other examples might be “positive integer”, for example. It may be that the type system can’t express such types. So be it - there’s plenty of runtime semantics that the type system can’t express, that’s in the nature of it being a static system.

Type intersection and negation might be useful concepts. That’s still to be established. And they seem unrelated to constrained types, or expressing more precisely the type of the value you get when iterating over (or indexing!) a string or tensor.