An interesting pytype experiment, and a possible extension to strings

I believe that a typing.SequenceString type solves your requirements, Paul. At some point the code does have to tell the type checker that using the sequence protocol is intentional. In your function this would be done by using the SequenceString type in the function argument’s annotation in the function signature.

If I understand OP correctly, the type checker would then understand that strings passed to this function are supposed to also be iterable and so on.

This gets me thinking that the automatic promotion idea here is somewhat underspecified in that it’s unclear whether the string in question should keep being a SequenceString when passed further down the call stack. It’s impure if it isn’t but somewhat useless if it is (because that particular string object will no longer reject being used for iteration elsewhere).

If SequenceString were implemented as a typing.NewType of str, that would require explicit promotion which you don’t like, but it would also keep the string object typed as a SequenceString across function boundaries, which seems to me defeats the purpose.

So, all in all I start to think this is indeed too hacky.

IterableOfStrButNotStr seems cleaner indeed because a str would just not type check at the function call boundary. But the name implies this is just a special case of intersection and negation, which was split into a separate topic here.

Yes, it absolutely is that – but I’m naively imagining that a type checker could implement that somehow with a special case, rather than providing a general way to express negation. I thought that’s what this thread was about – special casing str, because it is indeed special.

I don’t see how having a SequenceString helps here – in library code anyway, users need to be able to pass in a regular string, don’t they? you’d have to have everything typed from top to bottom with this special case.

I like how you wrote “Sequence / Iterable / Collection”. This is slightly different than Pytype, who also blocked Container. I don’t think Container should be blocked.

Okay. Based on various comments in this thread, it appears that str.chars is too ambitious. We can amend this proposal to just recommend casting where necesary.

Okay, let’s remove the config switch from this proposal. I thought it would make it easier to accept, but it had the opposite effect!

I understand the sentiment, but it’s not possible to catch misuses of str being used as Iterable while at the same time accepting str being used as Iterable. The question is which is worth more? Is catching the misuses worth adding casts to your code? The Pytype people seem to believe that it does based on their experiences with their users.

I understand wanting to give people a way to opt out, but even if such a type (say StrAndSequesnce) existed (and as Christopher points out, it would be tricky to implement because it has to match strings), when would you use it? Nearly always, you either want to accept string or iterable (but not string). You almost never want to accept both. Changing annotations to StrAndSequence is probably similar work in most situations as adding cast appropriately, but has the downside that you lose out on the catching of misuses of strings. So all the work, but no benefit.


So, in short, the new proposal is to replace in the typeshed the following line:

class str(Sequence[str]):

with

class str(Container[str], Sized):

So, to be clear - we are talking here about causing uses of str as an iterable to fail type checking. And we are assuming that this is an acceptable change, and the general rules about backward compatibility (breaking changes need a deprecation period, for example) don’t apply because this is “only” a change in type checkers.

How popular does type checking have to get before it’s treated as a first class part of the language, subject to the stability guarantees that the rest of the language provides? The argument “you don’t have to use typing” is wearing thin at this point - I considered saying “I’ll just type my strings as Any then”, but assumed that was too passive aggressive.

And the workaround offered to code that gets broken is “change your code”. But adding a .chars method won’t be acceptable, so people should cast. Why won’t .chars be acceptable? The reasons given seem just as relevant to cast. It says “allow me to use this existing method that isn’t being deprecated or removed”. No other methods that I’m aware of need special permission before you use them, other than ones being deprecated (where you might need to switch off a warning).

:person_shrugging: I don’t have real world code (that I know of) where this would be relevant. And if I do, I probably will just switch off typing (maybe by using Any). My argument here is mainly about the principle (of compatibility breaks, and typing no longer being special in that regard). But I’ve said my piece, so I’ll leave it at that.

4 Likes

A plan for gradual rollout across the ecosystem is an essential part of any proposal here.
Changing str in-place in a single step seems very rash.

If the type system can express str - Iterable[str] – or whatever flavor of that is palatable to those who are better versed in type theory than I – then it can presumably also express the thing that str is today.

It’s valuable to be able to write all three of these values side-by-side:

  • iterable string
  • non-iterable (“atomic?”) string
  • str with no specified meaning with respect to the other two, to be interpreted by the type checker

Then, once we have all three types available to type checkers, users can opt-in incrementally:

def num_vowels(s: typing.IterableString) -> int:
    # lower() returns an IterableString too?
    return sum(1 for ch in s.lower() if ch in "aeiou")

def strict_casefold(s: typing.AtomicString) -> typing.AtomicString:
    return s.upper().lower()

def casefold(s: str) -> str:
    return s.upper().lower()

This makes the most sense in the context of a type-checker flag which starts out with a lenient default, like how no-implicit-optional was handled.

1 Like

Did you see this comment? str - Iterable[str] is simply Never.

Even if you could define AtomicString, then functions that accept Iterable[T] will continue to accept strings, which is exactly the kind of bug that we are aiming to prevent.

All of your points make perfect sense, and I appreciate your patient and valuable feedback. As you say, you are arguing from principles and not real world code. Could observing results from the MyPy Primer be convincing to you? Maybe we should try that to evaluate

  • how much code would require casts, versus
  • how many bugs are uncovered?

(And if you’re tired of this thread, no worries, didn’t mean to pull you back in :smile: )

My understanding from that thread was that the combination of Intersection with Not is distinct from Minus, and that the former can’t express this but the latter can.

I don’t want to get hung up on the exact way that these types are expressed, an area where I am no expert, unless it is not possible in any future iteration of the type system to have a coherent view of both AtomicString and IterableString.

If it is not possible to extend the type system such that both of those things can be defined (I’d be surprised), that should give us pause about the whole idea.


You mean that with AtomicString added, but without redefining str?

That is an intentional feature of my suggestion. Don’t break existing things until you have a migration path setup.

IMO, you shouldn’t tell people that you’re going to treat str as an AtomicString in python 3.14 (or whatever future version) if the two things can’t coexist. That’s simply too hard of a switch to throw.

Not directly, no. What would make me feel more comfortable with this discussion is if it was more focused on the question of backward compatibility. Has anyone actively checked large projects like pip, rich or sympy to see if they would be impacted by this change? Has anyone looked at how we’d introduce deprecation warnings before making this change?

I really don’t think “it’s typing, not core Python” is a reason for having lower compatibility standards these days.

Intersection and not is, by definition, the same thing as minus.

The suggestion doesn’t involve changing Python.

Just FYI, those three packages are checked by the MyPy primer. (The list is here.)

Great point. It would be really nice if MyPy could have a deprecation period in which it raises warnings whenver string matches Iterable.

2 Likes

You’re linking to the code, not to any actual data. But nevertheless, thanks for the information.

As a “typing skeptic”, I share Paul’s concerns here. I’m still not totally sure whether there are any changes being proposed to the language (adding str.chars would be a change, but yes, backward compatible). But It does seem that there is a proposal to change something in the stdlib typing module, yes?

Strictly speaking, it would only affect those that choose to use typing. But the fact is that typing is creeping into the Python ecosystem a LOT now – to the point where (I know from experience) many newbies think it’s recommended, if not actually required. And some of those folks very quickly start changing their code to make the type checker happy, rather than changing the type annotations.

All that is about messaging and education – but in regard to this thread, the point is that folks that are not strictly needing to do, or choosing to do, typing will be affected. And it’s not just newbies - maybe I’m confused, but this conversation seems to be focused on use cases within a system: i.e. where the same person (group) is writing both the type annotations and the calling code. But for libraries with type annotations, the users of those libraries should be able to ignore the distinction between an “atomic string” and an “iterable string” – I’m still not sure if that’s the case with this proposal.

Python is a highly dynamic language that we are adding optional static type checking to – the type checkers should contort themselves to match the language, not the other way around. Which means, as awkward as it might be, rather than essentially changing what type checkers think a str can do, the typing system should have a way to express:

“An iterable of str that is not a str

Rather than re-defining what a str is and adding “a ‘str’ that is also an iterable”

Yes, I know that “An iterable of str that is not a str” is far more commonly needed than “a str that is also an iterable”, but the str is and has always been iterable, which is why I think the typing systems should adapt themselves to what’s already there.

Heck, maybe a ugly hack, but could Iterable[str] be special cased by the type checkers to reject str by default? That’s the actual problem isn’t it? I understand that Iterable[str] is not strictly a single type – but couldn’t we pretend that it was?

That would be an interesting question to look into – in the pytype user community (and everywhere, I guess) – how often do they want a “non iterable string” other than in the context of Iterable[str]?

I guess what all this boils down to is that there IS a “typing problem” – but the problem is with Iterable[str], not with str itself – adding a static type system on top of a dynamic language is going to be messy now and again.

PS: please tell me what FM to R to answer this, but I’m confused as to what the problem is with Paul’s example:

def uppercase_and_vowel_count(s: str) -> tuple[str, int]:
    vowels = sum(1 for ch in s if ch in "aeiou")
    return s.upper(), vowels

Even if str is interpreted as non-iterable by type checkers, wouldn’t the type checker simply check that you specified str, and the caller is passing a str, and you’re all set?

I can see that this would mess up static type analysis – the code in the function couldn’t be correctly analyzed to know what type s is supposed to be – but it doesn’t have to, it’s been specified. Are there tools that check the actual code to see if it’s type-correct? And if so, couldn’t that tool be special cased to know that when you specify an actual string, you do mean an actual string?

Side note: I have used iteration through strings a LOT – but as I think about it, that’s because I use it in a lot of toy examples when teaching Python – I’m not sure I’ve used it in production code ever – the string methods provide most of the functionality I might use that for out of the box.

No, just the typeshed.

The proposal is essentially to make strings atomic in the typeshed.

That’s similar to this proposal. One major difference is that this proposal blocks all Iterable[T] from matching string. There are also a variety of consistency problems with the “ugly hack”.

Right, the proposal doesn’t affect calling the function.

s has type str. The problem is in the generator—if string is not considered iterable—you would need a cast, which some people find (understandably) objectionable.

Yes, that’s what type checkers do. What we’re discussing is the behaviour of these tools.

That’s similar to the proposal, yes.

Thanks, yes, that’s what we’re hoping!

I can’t image any issues withstr not satisfying any Iterable[T] – it does, after all, not satisfy any of the others anyway. But it’s only Iterable[str] that actually causes problems, yes?

My point is that IIUC, the proposal is to change what str means, rather than what Iterable[str] means, and that has impacts when you do want to use a string as an iterable – i.e. when you directly type a variable as an str.

and also with changing the meaning of str – maybe those inconsistency problems are insurmountable, so be it.

Sorry to be chiming in from a position of ignorance – I really am not that familiar with the latest in the typing scene – and I know that.

But I think there’s been an issue in the Python community: the folks working on typing have clearly said that typing is and will remain optional – so folks that are not all that interested don’t pay attention.

But typing, while optional, is making its way into the full Python community. And once in a while directly effects non-static typing users (see the hoopla over PEP 563).

So I keep an eye on proposals that I think may affect me down the road. I may be misunderstanding this one, but it strikes me that it’s designed to make it easy on the type checkers, rather than consistent with what Python strings really are – maybe that will have minimal to no impact to folks like me, let’s hope so.

Thanks for being patient and responding to my questions :slight_smile:

The problem is str matching any Iterable[T].

Changing Iterable[str] wouldn’t fix the problem for generics, and would have other consistency issues. Just to give some examples, if some generic class C[T] with method f(self, x: Iterable[T]) would accept strings with your proposal. Upon specializing it to C[str], it would no longer accept strings. This is an example of inconsistency.

I don’t agree that changing str has consistency issues. It may require annoying casts, but everything remains consistent. (Edit, on reading this again, I guess you mean that the typeshed would be inconsistent with the language. Yes, I agree, but that’s not what I meant by consistency issues.)

The goal is for the impact on you to be positive rather than minimal by helping you to find common errors such as a accidentally passing a string to an iterable consumer while hopefully not requiring too many annoying casts. I think a good error message and introductory period (where warnings are given on str matching Iterable would hopefully mitigate confusion

My pleasure.

Hmm – maybe, but isn’t that common? – if I have an tuple[int] – it would get accepted as an Iterable but rejected as not an Iterable[str] as well. Isn’t that the whole point of specifying what an Iterable holds? And I’m not making this up – isn’t an array.array or numpy array very much a container, and an Iterable, but not of arbitrary types?

So it makes perfect sense that str would be an arbitrary Iterable, but not an Iterable[str] or an Iterable[int], or, in fact any other specific type.

If we were to start all over again, I might give Python a character type – then str would an Iterable and an Iterable[char], but not an Iterable[str] – which is exactly what it is – we just don’t have a type for length-1 string. and if we don’t really need an actual Iterable[char] because that’s spelled str already.

This can all work OK, because str is one type that is virtually never duck typed – there won’t be any other Iterable[char], so we don’t need a way to spell it.

Is an Iterable type that can only yield one type not compatible with the type system?

I’ve certainly written classes that are iterable, but always yield one and only one type.

First of all, it’s very common to have generic iterables (including containers and sequences). Many of these containers can be accidentally constructed with strings. list(s) where s is a string is an example of Iterable[T] matching string.

Numpy arrays are not iterables (or containers or sequences). This was discussed lengthily here.

It’s not logical for string to match arbitrary iterables if it won’t match Iterable[str].

I think that would result in the same atomicity problems. If we were to start all over again, I’d make strings non-iterable and expose the characters via a method. Anyway, there’s a discrepancy with the semantics of the container. Every other container checks membership of its elements—but string checks membership of substrings.

I think you’re misunderstanding the problematic nature of strings matching with generic containers. Your character idea would still have this problem.

AIUI the proposal is actually for Iterable[str] - str, which should be plenty of options.

The general confusion over what’s being proposed suggests that one matter of concern with the proposal is “how would we document and explain it in a way that didn’t confuse users?” :thinking:

2 Likes