An interesting pytype experiment, and a possible extension to strings

Fair :slight_smile: Though I suspect that most people won’t concern themselves with the full details of type subtraction, and will just use a predefined type alias that means “an iterable of strings but not a single string”.

Agreed. But that wasn’t the original proposal, which said

Maybe the proposal has changed somewhere along the line, but I don’t recall anyone explicitly saying that the new proposal was simply to add “a predefined type alias that means “an iterable of strings but not a single string”” somewhere…

The proposed typeshed change would remove __getitem__ from str methods, implying s[0] becomes illegal to type checkers (but s.startswith('x') remains legal).

That is inevitable because __getitem__ makes something iterable (though not an instance of Iterable), but it seems to be a high price. It is however consistent with the suggested name AtomicString, i.e. indivisible.

(Clarified “to type checkers”)

Surely blocking s[0] is unacceptable? No-one is seriously proposing that indexing strings should become a type error, are they? :astonished:

Also clarifying: I understand this is only about type checkers, but I still can’t believe people are seriously suggesting that type checkers should error on indexing a string…

1 Like

Should have been clearer and added “illegal to type checkers”, will edit

But even then, indeed, that’s what I meant by “high price”

(to better my own understanding, I made a Minimal proof of concept for atomic and iterable string types · GitHub)

I don’t think it would. Why do you say that?

I don’t think the type checkers uses the sequence protocol to discover iterability. E.g., MyPy 1.0 doesn’t.

Right. No one is proposing that.

You’re right, this is not a part of this suggestion.

My thinking:

  • an object with __getitem__ using integer indices and __len__ is a sequence
  • a sequence object is iterable even without __iter__ method
  • however, without __iter__ such an iterable (lowercase ‘i’) object is not an Iterable (with capital ‘I’), as Iterable checks for method __iter__ (and not for __getitem__)
  • confusingly, because in the collection.abc type hierarchy Sequence (with capital ‘S’) indirectly subclasses Iterable, such a sequence object (lowercase ‘s’) also is not a Sequence
  • consequently, to prevent an object from being iterable (lowercase ‘i’), it cannot have __getitem__
  • and so I assumed (apparently incorrectly, sorry) that removal of __getitem__ was proposed for NonIterableString

Very true. Strangely, MyPy recognises that the sequence protocol suffices for iter().

That leads to the counterintuitive situation that a NonIterableString object with a __getitem__ method:

  • can be iterated over at runtime
  • is flagged by the type checker when iterated over
  • but is acceptable to the type checker as argument to iter(), after which iteration over the resulting iterator is also okay

I’m glad that such is not the intent, and fully agree.

But I’d like to understand how the outlined conundrum would be resolved then.
Is the intent to use iter() to revert a NonIterableString into an Iterable (with capital ‘I’), in other words with an @overload like def iter(s: NonIterableString, /) -> type(iter(IterableString())): ...?
Please explain.

Thanks for outlining your thinking. I see your point now. Yes, even with the proposed change, type-checkers would still not complain about iteration over strings wrapped by iter. I guess that’s an artifact of type-checkers choosing to partially support the sequence protocol. Thanks for pointing this out! I didn’t realize this would be the case.

(For the record, I’ve always preferred the sequence ABC to the sequence protocol.)

I disagree that the third one is ever useful. It effectively means “It’s a string of some sort, but I don’t know which, and have to search for some obscure flag or environment variable, in some non-obvious place.”

There are surely places where one will want to accept either an iterable string or an atomic string or both, but there are surely no places where one actively wants to make str ambiguous as to which it is.

2 Likes

Huh, you must have hated Python 1.5 and friends, when the sequence protocol was literally the only way to get iteration.

One good thing about the sequence protocol days was that we never had those interminable arguments about giving iterators a length :slight_smile:

Going back to the original post:

I have a genuine question here, the answer of which would at least clarify my thoughts on all this – I feel like we’ve been talking past each other bit.

(I know that is totally redundant to anyone paying attention to the conversation, but I’m trying to clarify it all for myself, so I’m being wordy…(

I am well aware of the problem with code expecting an Iterable of strings, e.g a list of strings, and when folks pass a single string in, the code works fine, but you get a bunch of one-character strings which is usually not what’s expected. This has been an issue for years, well before anyone was trying to do static type checking. And it sure seems like the type of error that a static type checker could catch.

So: Problem 1: how can you tell the type checker to fail when an str is passed when an Iterable[str] is expected?

Certainly one way to do that is to tell the type checker that a str is not an Iterable at all.

However, str is an iterable – so then the type check will get confused / complain when it is used that way. Which is why I don’t think this is a good solution.

Now, finally, the question:

Do folks think it’s important that str not match with Iterable in all other (Or most other) contexts, other than Iterable[str]?

That is, is it a goal to not pass str into most of itertools and the like without machinations or type errors?

Personally I think that would just make a lot of thing more awkward – maybe not now, but certainly when type checking becomes more ubiquitous.

TL;DR: breaking (OK altering) all use of str as an Iterable to solve the Iterable[str] problem seems like overkill. But maybe I’ve simply missed the goals here if the actual goal is to make type checkers think str is not iterable, then, well, that’s way to do it obviously.

There are plenty of other iterable contexts in which string is inappropriate. Most of NetworkX’s interface accepts iterators and then goes to great lengths to try to treat strings as atomic.

As has come up multiple times, the reality is that there is no way to block problematic uses without also blocking intentional ones. In my opinion, the tradeoff should be based on the benefit of blocking compared with the awkwardness induced on coerrect code.

Maybe the best way forward is to get the MyPy Primer to run an experiment just like the Pytype people did. I only proposed this because the Pytype people had good results.

Thanks – I remember that thread – it started in 2012, and was originally about ABCs, not static type checking – and I think the conclusion was that np.ndarray should not inherit from Sequence – which is does not now, 10 years later:

In [20]: arr
Out[20]: array([0., 0., 0., 0., 0.])

In [21]: isinstance(arr, collections.abc.Sequence)
Out[21]: False

And this was because there are a couple ways in which a numpy array isn’t quite a Sequence (can’t remember what, but it’s definitely a mutable sequence, but not a MutableSequence :frowning: – this shows the power of Python’s Duck typing and the weakness of the ABC system.

Anyway, numpy arrays are very much iterables – unlike strings, they are used as iterables, very, very often. And indeed:

In [22]: isinstance(arr, collections.abc.Iterable)
Out[22]: True

Which is good, in my mind.

I absolutely am – that’s what I mean tby “talking past each other” in my last post.

But I still don’t get it – yes, a_str in another_str means something somewhat different than a_value in a_generic_container – I DO get that – what I don’t get is why you want the type checker to catch an error like that, if indeed it is an error – is that an error that is likely to occur? no idea. But I do know that a_str in another_str is a VERY common idiom.

I can see why folks might think it’s OK to mess with the iterability of strings, but messing with 'in` would be very disruptive.

NOTE: maybe a Char type would help with the in problem too? (Not thought out, that would probably require a realized type and a lot of code changes – I’m not proposing it)

I looked back on some of the threads originally pointed to in the OP – and dug a little deeper. Turns out ideas similar to mine about making str an Iterable[Char] have been proposed in the past (I didn’t think it was that unique) – and as far as I can tell, fizzled out without resolution, e.g.:

https://mail.python.org/archives/list/typing-sig@python.org/thread/ENTSMRILZN5YERQFSTWJXLDGX7KGH5DG/

Maybe it was rejected in another thread i haven’t found. If so, it would be good to know why.

It’s not a mutable sequence. For one, it doesn’t support __delitem__. It’s not even a sequence since it doesn’t expose index or count. In fact, the new array API does not expose __len__ or __iter__ either.

I also was tempted to ask for inheritance from Sequence. Then I read the thread and realized that the problem is naive user expectations. The ABC system is fine. NumPy arrays are not sequences—not should they be. This is not the “weakness of the ABC system at all”.

You may be intereted to know that they are not iterable in the new array API:

import numpy.array_api as xp
from collections.abc import Iterable
x = xp.asarray([1, 2])
isinstance(x, Iterable)  # False

Honestly, I made a similar proposal as this thread for arrays years ago.

I understand the convenience of passing things around, but there are many issues with numpy arrays being sequences, and I think if you’re curious, you should read through the thread for convincing arguments.

and yet you can still iterate them:

In [28]: x
Out[28]: Array([1, 2], dtype=int64)

In [29]: for i in x:
    ...:     print(i)
    ...: 
1
2

Statically typing numpy arrays is a very hard problem, but I hate to see more disconnect between how things are typed and how they can be used. I never cared about the ABC issue because I never used it – I never use isinstance(an_abc) at all.

I won’t say any more, because I haven’t been paying much attention, maybe I should start.

Yes because of the sequence protocol. To block that, you’d have to make __getitem__ raise something other than IndexError.

That sounds great to me – however, to be clear and it was not clear to me until I dug into the old notes and pytype’s docs:

The Pytype people had good results with making str not satisfy Iterable[str] – it does not make str no longer an Iterable.

I for one, am +1 on that.

I think you are proposing something far more disruptive.

For those interested that may not have followed the details:

Here is the pytype FAQ entry:

And here is a little experimental code:

from collections.abc import Iterable


def it_of_str(names: Iterable[str]):
    for i, name in enumerate(names):
        print(f"name {i}: {name}")

def generic_it(an_iterable: Iterable):
    it = iter(an_iterable)
    while True:
        try:
            print(f"item: {next(it)}")
        except StopIteration:
            print("All Done!")
            return

print("Doing it right:")
it_of_str(["Chris", "Bob", "Nancy"])

print("\nOopsie:")
it_of_str("Chris")

print("\nGeneric Iterable:")
generic_it(range(5))

print("\nPassing str to generic Iterable")
generic_it("Chris")

And this is what pytype does with this code:

 % pytype pytype_str.py 
Computing dependencies
Analyzing 1 sources with 0 local dependencies
ninja: Entering directory `.pytype'
[1/1] check pytype_str
FAILED: /Users/chris/PythonStuff/pytype/.pytype/pyi/pytype_str.pyi 
/Users/chris/miniconda3/envs/pytype/bin/python -m pytype.single --imports_info /Users/chris/PythonStuff/pytype/.pytype/imports/pytype_str.imports --module-name pytype_str --platform darwin -V 3.10 -o /Users/chris/PythonStuff/pytype/.pytype/pyi/pytype_str.pyi --analyze-annotated --nofail --quick /Users/chris/PythonStuff/pytype/pytype_str.py
File "/Users/chris/PythonStuff/pytype/pytype_str.py", line 22, in <module>: Function it_of_str was called with the wrong arguments [wrong-arg-types]
         Expected: (names: Iterable[str])
  Actually passed: (names: str)
  Note: str does not match iterables by default. Learn more: https://github.com/google/pytype/blob/main/docs/faq.md#why-doesnt-str-match-against-string-iterables

It only complains about using str for an 'Iterable[str]` – which would belay most of the concerns I’ve seen in this thread.

1 Like

Have you read this thread? This is discussed way earlier.

I have read this thread, and honestly I’m unsure of what the proposal is any more. Rather than just express your frustration (and I can understand if you are frustrated) could you:

  1. Confirm what the current proposal is
  2. If it isn’t “Make str not satisfy Iterable[str]”, clarify why that isn’t sufficient, and how the Pytype results relate to what is actually being proposed.

Oh, sorry, I’m not frustrated. I just thought we shouldn’t make the thread longer by rehashing things that are already in it.

  1. If it isn’t “Make str not satisfy Iterable[str]”, clarify why that isn’t sufficient, and how the Pytype results relate to what is actually being proposed.

I’m happy to do that for you. I’ll just copy and paste to save time.

It may be possible to the Pytype approach with MyPy primer, but it may be more difficult.

1 Like