Add a character subclass of str

Python doesn’t have a builtin type for 1 character strings (chars). This makes it impossible to declare this in function definitions:

def ord(char: str, /) -> int:
    if len(char) != 1:
        raise TypeError(f"ord() expected a character, but string of length {len(char)} found")
    ...

Example implementation in Python (but should be implemented in C):

from __future__ import annotations

import builtins
from collections.abc import Iterator, Sequence
from typing import Literal, SupportsIndex, overload


def chr(i: int, /) -> CharType:
    return CharType(builtins.chr(i))


class String(str, Sequence['CharType']):
    @overload
    def __getitem__(self, key: SupportsIndex, /) -> CharType: ...
    @overload
    def __getitem__(self, key: slice, /) -> String: ...

    def __getitem__(self, key: SupportsIndex | slice, /) -> String:
        if isinstance(key, slice):
            return String(super().__getitem__(key))

        return CharType(super().__getitem__(key))

    def __iter__(self) -> Iterator[CharType]:
        return map(CharType, super().__iter__())


class CharType(String):
    def __new__(cls, object: str) -> CharType:
        if len(object) != 1:
            raise TypeError(f"CharType() expected a character, but string of length {len(object)} found")

        return super().__new__(cls, object)

    def capitalize(self) -> CharType:
        return CharType(super().capitalize())

    def lower(self) -> CharType:
        return CharType(super().lower())

    def swapcase(self) -> CharType:
        return CharType(super().swapcase())

    def title(self) -> CharType:
        return CharType(super().title())

    def upper(self) -> CharType:
        return CharType(super().upper())

    def __len__(self) -> Literal[1]:
        return 1

    def __repr__(self) -> str:
        return f'c{super().__repr__()}'

Note: this implementation needs to be backwards compatible with the old behaviour. So, CharType must be iterable.

Characters could be constructed like this, detecting typo’s at compile time:

char: CharType = c"\u20AC"
char = c"u20AC" # SyntaxError

This can already be done using NewType, no changes to the standard library necessary, I’ve played around with this idea in the past, but it’s really really not worth the additional hassle it creates.

For one, since str is a Sequence[CharType] it is also a Sequence[str] so you’re right back where you’re started and haven’t actually solved anything. The only place, where this is marginally useful, would be functions that expect a single character as input, but it’s not really worth the price of creating a whole bunch of new false positives with invariant generics.

2 Likes

A type checker could see str as an alias for str | CharType, and give a warning when you try to iterate over CharType:

from typing import reveal_type

Str = str | CharType


def foo(string: Str):
    reveal_type(string)  # Revealed type is "Union[builtins.str, test.CharType]"
    list(string)  # Warning: "string" could be "CharType"
    if isinstance(string, CharType):
        reveal_type(string)  # Revealed type is "test.CharType"
        list(string)  # Warning: "string" is "CharType"
    else:
        reveal_type(string)  # Revealed type is "builtins.str"
        list(string)  # OK

This has been brought up before, but usually from the other direction which is much more likely to be harmful (where a user passes a single string and the API treats it as multiple, but the user should have passed a single element Sequence[str])

The current “best” way to handle that case is with overloads that treat str and Sequence[str] differently.

For your case however, you can’t remove capabilities in a subclass (This breaks the Liskov Substitution Principle, which is fundamental to nominal subtyping), so to have a char type that doesn’t support iteration, it can’t be a subclass of str, and newtype also doesn’t work. you could create your own wrapper class if you need to ensure users are only passing single characters, but this sounds like hostile API design that’s going to be better solved by accepting a string, and validating it is a single character in places where this matters, raising documented errors where it is not, and simply not iterating over it yourself.

2 Likes

This is a great feature of Python. What problems does it cause beyond the known typing limitation?

That’s not what I’m suggesting here, quote:

What I’m suggesting is that type checkers will give a warning when you try to iterate over a str without checking if it isn’t a CharType. At runtime no error will be raised.

You can get in infinite recursion, because when you iterate over str, you get a str which you can iterate over…

I read what was said, and that’s still violating LSP, and worse besides in ways that I didn’t bother getting into (further divergence between runtime and type checker behavior is going in the opposite direction from the goal) as that alone should have been reason enough not to. LSP is (mostly) a concern for static analysis and reasoning about whether one type is replaceable for another. python’s interpreter doesn’t care if you violate LSP, and there’s a few cases of this that type checkers already have to special case (__hash__ being settable to None), more of these would not be a largely positive thing, especially ones that don’t come from strong reasons at runtime with the datamodel.

I think I’d file this under ‘pub trivia’ rather than ‘problems’; it probably never comes up unless you want to recursively flatten a container, in which case you can check if the container is a string.

OK, let’s not break LSP then. Here are some advantages:

  • Could be more memory efficient: 3 bytes for the unicode code & 4 bytes for the utf-8 value
  • Could be faster to manipulate because the length is fixed
  • Could protect against typo’s in escape sequences:
    char: CharType = c"u20AC" # SyntaxError
    
  • Could indicate a function only accepts 1 length strings:
    def ord(char: CharType, /) -> int: ...
    
    ord("foo") # Argument 1 to "ord" has incompatible type "str"; expected "CharType"
    
  • Could detect accidental modification of 1 length strings:
    char: CharType = c'f'
    char += c'0' * 2 # Incompatible types in assignment (expression has type "str", variable has type "CharType")
    

How could this be done if CharType is a subtype of str? This implies ABI subtyping as well.

My bad. Then only the other points.

That could only happen if adding a non-empty string to a char gives a different type, but adding an empty string gives back CharType again. That seems like a recipe for complexity.

Is this something that you actually see happen? That someone intended to write an escape sequence, but forgets the backslash? Even if it does, this would ONLY help in the specific case where that’s in a single-character string.

You’re asking for a LOT here. A string prefix that creates a subclass of str, which changes type as soon as it gets made longer? That’s pretty complicated, and for what benefit - detecting multi-character strings being passed to ord()? Is it really worth all that?

Returning a CharType only happens when the function always returns a CharType. See the example implementation. 1 length strings will still exist.

It has happened to me, but I agree it won’t happen very often.

Mind you, that’s exactly how booleans behave. I don’t see how this is any different:

>>> True * 1
1

So adding a CharType to a str always returns a str? That’s also weird, just in different ways.

But, again, it’s not something that happens very often, and a HUGE amount of complexity to make it happen.

A more radical solution to this kind of problem is dependent types. The type system could be extended to allow str[len] such that str[1] would mean a string of length 1 and str[1…10] would mean a string of length between 1 and 10. This has to be approached very cautiously as it can make the type system undecidable.
See also Can Python implement dependent types? - Stack Overflow

1 Like