An interesting pytype experiment, and a possible extension to strings

There have been many discussions about string’s problematic matching with iterable. Last year Pytype did an interesting experiment. They reject matching str against Container[str] , Iterable[str], Collection[str], Sequence[str]. They then recommend that users that want to match against sequence-like strings should annotate with str | Sequence[str].

I really like this approach, so I was wondering if we could consider:

  • Changing the definition of strings in the typeshed from
    class str(Sequence[str]):
    
    to
    class str(Container[str], Sized):
    
    (and thereby removing Iterable[str], Collection[str] and Sequence[str] also),
  • removing str.__iter__ in the typeshed, and
  • possibly adding a method to strings like:
    class str:
      def chars(self) -> Sequence[str]:  ...
    
    to provide access to the blocked interfaces.

This would increase type safety at the minor cost of calling chars() or cast explicitly.

(Edit: Renamed to atomic strings per Steven’s suggestion below. Updated proposal to block iterable, but not container. Updated proposal to remove the unpopular flag. Made chars an optional suggestion. Explicitly remove str.__iter__ to prevent iterable protocol match.)

6 Likes

Is the method necessary? list(s) will still be a sequence of individual characters, unless you’re planning to make far more fundamental changes than just type checking.

Given the strict-strings flag, list(s) would fail type-checking since list requires an iterable.

Ah, gotcha.

I quite like this idea, although I suspect it might break a lot of things. The linked issue says it found “400 potential errors, 30 of which were verified as genuine bugs”. That’s a pretty high rate of false positives, and even if the fix is simple, it’s still a lot of churn. I hadn’t heard of pytype before, so I don’t know how popular it is, but I bet there’s a lot more people who would be affected if mypy (for example) did this.

But in any case, the linked thread seems to be an active discussion in the typing community, so honestly I’m happy enough for the experts over there to evaluate the idea and decide.

2 Likes

7 posts were split to a new topic: Type intersection and negation in type annotations

It’s not really “strict-strings”. Its more like “atomic strings”: treating strings as atomic, not as “strict”.

If type checkers can accept a flag to reject strings from Collections, Iterables etc isn’t that type negation? The only difference is that the set of types being negated is hard-coded to str alone.

So it seems to me that the type checkers would have to implement type negation, but artificially limited to a single type.

Having atomic strings at runtime is a frequently desired feature. Here’s a radical suggestion for people to shoot down:

  • Introduce a new abstract type, “basestr” which implements all the string functionality except for iteration etc.

  • Subclass that as a new builtin, “atomicstr”, with “a-string” syntax: s = a"spam eggs and spam" would be an atomic string.

  • Atomic strings don’t just type-check as atomic but behave as atomic at runtime as well.

  • Regular str inherits from basestr, and adds back in iteration etc.

Disadvantages:

  • There’s a bit of work needed to rearrange the string implementation, but it is mostly refactoring, and a one-off cost.

  • Two new builtin names.

  • One more string prefix.

  • str methods will have a teeny bit more overhead. But that should be negligible.

  • Can’t be ported back to older Pythons.

Advantages:

  • Easy for both static and runtime type checks to specify that you want atomic strings, regular strings, or both.

  • Type checkers don’t have to implement anything special. atomicstr and str are different types.

  • Dealing with strings as atomic non-iterable objects becomes much easier for everybody, not just users of type hints.

  • No need for the meaning of type hints to depend on a non-obvious global flag.

  • Fully backwards compatible.

2 Likes

They don’t have to negate anything. They can just have a simpler definition of str that doesn’t expose the interfaces.

Introduce a new abstract type, “basestr”

I considered something like this, but the advantage of making the change on str is that there is no ugly conversion from str to basestr everywhere, and typically only a very tiny fraction of lines will change.

Does this flag get set by the caller e.g. mypy --atomic-strings file.py or does it get set as a global by the module being checked?

Global per module or global per process?

def spam(s:str, n:int) -> MyClass:
    ...

Which definition of str does this function need? How would the reader find out?

I would set it globally by project (in the MyPy configuration file), the same way that other errors are enabled or disabled.

Per-module would make this feature more work to implement. I’m not sure that’s worth it?

“atomic strings”

I agree, that’s a better name. I’ll edit the proposal up top.

1 Like

Can we create a new thread for the type negation ideas, and move those comments there? I think it’s an interesting idea that deserves some discussion. (I’d also like to add a concrete case for them). However, they’re fairly unrelated to this idea.

(Flagging my own comment so that the moderators might take a look.)

2 Likes

This was not merely an “experiment”. This has been Pytype’s behavior since mid-2021.

I think adding a flag to type checkers for this is overkill. Just adopt the new behavior. On each of them.

Adding yet another method to the public str API for a typing specific use case is a hard sell. It duplicates an existing API (iteration) that is not going to be deprecated and removed.

It is better for everyone to just fix incorrect annotations on existing very rare APIs that actually want to accept a str as an iterable or sequence of single character strs to declare it explicitly using a str | Sequence[str] style union annotation.

Practicality beats purity in this case. Special case str. We don’t need to go into deep type theory expressions of how to represent the concept in some logically pure form.

5 Likes

There are a number of instances (not very rare!) in our code base where we specifically require a string but as part of handling we iterate over its characters. This change would completely disallow this behaviour (according to typing), requiring an unnecessary refactor.

Even a flag would be very disruptive to my colleagues as this translates to a buried option in PyCharm’s settings that all of us would have to set (or another ignore-comment for each of these locations).

I should say that making the refactor (and casting to Iterable) is not that much effort, and if this went through I would be happy to make those changes.

I understand. We have very different usage patterns. I rarely use string’s sequence interface. And, I consider refactoring to be a pretty minor annoyance. For example, MyPy 1.0 just added support for typing.Self, so I immediately converted all of my TypeVars to Self. The code is simpler and easier to read. Similarly, when I drop Python 3.8, I’ll run this convenient script to refactor automatically. Someone could do something similar for most strings based on inferred types. It might take some of the sting out of refactoring (yes, there are still code reviews, rebasing pull requests, churn, etc.).

Even a flag would be very disruptive to my colleagues as this translates to a buried option

Fair enough. There are benefits to those of us who value stricter type annotations, but I see your point.

Another benefit is it removes a wart:

['a', 'b', 'c']  in ['x', 'a', 'b', 'c']  # False
'abc' in 'xabc'   # string is unlike other containers.

With the chars property, x in s.chars would return true iff some s.chars[i] == x (like other sequences) whereas x in s would keep the current meaning.

Adding yet another method to the public str API for a typing specific use case is a hard sell. It duplicates an existing API (iteration) that is not going to be deprecated and removed.

Well, you caught me: that was my secret long term dream, yes. I think I’m often motivated by the ideal language I’d like to see Python become irrespective of transition pain.

Practicality beats purity in this case. Special case str . We don’t need to go into deep type theory expressions of how to represent the concept in some logically pure form.

You may be right. I worry about being too cavalier with type annotations. The same argument created def f(*args: T) instead of def f(*args: tuple[T, ...]), which many people consider to be a design error in retrospect.

If you only do what Pytype is doing (block str < Sequence[str] only, but not str < Sequence[T] as well) then you may get some very weird cases when strings are used in overloads, generic classes, etc. Making this “practical choice”, I think, would require very careful consideration of consequences.

On the other hand, moving the sequence interface on string to a property would be totally safe, but as you and Laurie point out, could be a very annoying transition.

1 Like

I split off the discussion of type intersection and negation to its own topic.

In terms of special-casing str to not include its sequence nature from the type checker’s perspective, I am very interested in this idea. I found the “string is an iterable of strings” wart when working on PEP 484 and it seemed to me at the time that type intersections and negations might be a solution there. The idea to include those concepts was shot down then as it would broaden the type algebra beyond what the proponents of PEP 484 were ready to implement at that time. I accepted strings being iterable as an inevitable part of Python as a change in the default str behavior would bring “Python 4”-style backward compatibility breakage. Making this a type checker-only feature makes perfect sense to me.

In my time at Facebook, I observed this being one of the cases where Python’s type system as currently defined cannot help catch obvious programming errors. Those errors aren’t as common as missing None checks, and aren’t as tricky to debug compared to some other classes of bugs. I mean, when this happens, a well-placed unit test will discover the problem very quickly. Even just running the code rarely succeeds with this kind of bug and the data mismatch is curious and unique enough that with some experience it becomes easy to spot what went wrong.

This wart in particular contributed to type-annotated code to lean into concrete collection types. You don’t say Iterable[str] even if you only iterate. You say list[str] because it’s simpler to type, doesn’t require an import, and works around the “strings are iterables of strings” issue altogether. This is sometimes wasteful in terms of both efficiency and flexibility, but it turned out to be good enough of a workaround for me to drop pursuing this.

Now, having a type checker option to exclude the Sequence / Iterable / Collection nature from strings, that sounds like a workable solution! Especially that it’s all static analysis, it still behaves the same at runtime. Then all it needs to recover the excluded functionality is a cast() to inform the type checker that iteration is actually explicitly needed.

I’d say it’s worth trying it out in mypy too, as passing a single string where a collection of them was expected does occasionally happen and is a time waster for everybody involved. It is disappointing that the type checker is unable to spot the error in this case. I would use this mode of the type checker if it were available, and I would advertise for everybody to use it.

I am less excited about str.chars and ideas to transition to str excluding iteration, indexing, etc. Using data from the experiment, 30 bugs caught in 400 cases is barely above noise level so it suggests a change like that would be mostly churn.

Finally, the config switch being global per invocation works in a mono repo environment where all code is game for modification if needed. In the open-source world where a good chunk of your code is third-party libraries, this will be somewhat more tricky because some code will always emit the wrong kind of string or accept the wrong kind of string. Casting every time would certainly be possible but some casts would be pretty ugly when what you’re passing isn’t a string but (ironically) a collection of them like a dictionary or list, and so on.

1 Like

I’m curious. If this were to be done, then consider the following function (yes, it’s a made up toy example).

def uppercase_and_vowel_count(s: str) -> tuple[str, int]:
    vowels = sum(1 for ch in s if ch in "aeiou")
    return s.upper(), vowels

How would I annotate that so it would be valid, and a call uppercase_and_vowel_count("abc") would typecheck?

I don’t consider that function to be in any way unreasonable, so I do not want to have to change it - I’d only be willing to alter the type annotations.

OP suggests:

def uppercase_and_vowel_count(s: str | Sequence[str]) -> tuple[str, int]:
    vowels = sum(1 for ch in s if ch in "aeiou")
    return s.upper(), vowels

That’s workable and certainly an option. I personally think we can do better because str | Sequence[str] explicitly allows for any kind of sequence of strings which might be overly permissive if all we want is to index or iterate over a single string.

An explicit typing.SequenceString would solve that. There is precedent of “special strings” with LiteralString.

Any automatic type specialisation will fail with eg “Sequence has no attribute upper”. I think casting is required:

def count_vowels(s: Iterable[str]) -> str:
    return sum(1 for ch in s if ch in "aeiou")

def uppercase_and_vowel_count(s: str) -> tuple[str, int]:
    vowels = count_vowels(cast(s, Iterable[str]))
    return s.upper(), vowels
2 Likes

And that’s precisely what I disagree with. Casting to tell the typing system about runtime shenanigans that can’t be inferred statically is one thing, but having to add a runtime cast (and apparently refactor my code to add a helper function) isn’t a reasonable requirement to deal with the typing system not accepting perfectly legitimate code.

I’m fine with a way to catch unintentional use of str as an iterable. I’m even OK with the str type being repurposed as a “non-iterable string” (although I think people are being remarkably casual about backward compatibility here - typing has been around for long enough now that we should be more respectful of people’s existing codebases IMO). But I don’t think it’s remotely acceptable to leave developers with no way to express the type “what str was before it got changed”. Breaking legitimate uses of str as a type is one thing, breaking them with no workaround in the typing system is entirely different.

9 Likes

Forgive my ignorance here – I’m confused:

It seems this function is designed to work with strings (certainly iterable of char that has an upper() method) – and it’s typed as str, so according to the Type Checker, passing a str would be OK, but passing, e.g. a list of strings would not. Isn’t that what you want?

If type checkers are set to pretend that str is not an iterable, then this would still work, even though you are iterating through the string.

What am I missing?

I really like Python’s dynamic nature, and am not particularly excited about type checking, but there are two type bugs that have bitten me a LOT:

  1. the whole integer division thing – and THAT’s be solved for years with Py3’s “real division”

  2. An str is in iterable of str’s – the topic at hand. So I find it ironic that the ONE type issue i would like some help with has not been helped by type checking up to today – that’s why I’m engaging with this thread. Anyway, the example of a function designed work with str is not that problem (As it happens, there are essentially no iterables of a char other than str – and virtually no duck typing of strings, either, so if you want an str, specify an str)

As we all know, the problem is a function that requires an iterable of strings, and people can accidentally pass in a single string:

def read_files(filenames: Iterable[str]):
    for filename in filenames:
        process_a_file(filename)
        ...

I’m sure we’ve all seen errors like “Filename ‘r’ not found” – which, as @ambv points out, is obvious and common enough that those of us that write python immediately recognize it. But my users, who write scripts with my library, but are not experience programmers, get very confused!

In practice, I tend to runtime type check (and fix) this with this code:

def read_files(filenames: Iterable[str]):
    if isinstance(filename, str):
        filenames = [filenames]
    for filename in filenames:
        process_a_file(filename)
        ...

But a lot of folks don’t like such a flexible (kludgy) API :slight_smile:

And worse, I’ve seen this code (not sure why)

if not isinstance(filenames, list):
    filenames = [filenames]

which passes all unit tests because no one thought to test a non-list iterable :slight_smile:

I like some kind of negation – “every iterable of strings except str” – but as practicality beats purity, simply telling the type checker not to consider a str to be an Iterable[str] seems totally fine to me. And thd fact is that there are a lot of Iterables, and a lot of sequences, but only one str – so special casing str makes some sense.

Another thought: perhaps explicit is better than implicit – wouldn’t a new type:

IterableThatsNotAString[str] [*]

address the issue as well? Though I don’t know there’s any way to define that today.

Sure, it wouldn’t let you do Sequence[str] safety – but is that all that common? If so, then SequenceThatsNotAString.

[*]horrible name, but maybe someone can come up with something better