Exotic kinds of string annotations

The type system allows encosing annotations in strings in cases where the annotation would not evaluate successfully at runtime (e.g., when the annotation contains a forward reference).

But Python has many kinds of strings. How should the following program be type checked?

def invalid_strings(
    raw: r"int",
    fstring: f"int",
    bytestring: b"int",
    uprefix: u"int",
    implicit_concat: "in" "t",
    unicode_name: "\N{LATIN SMALL LETTER I}nt",
    backslash_escape: "\x69nt",
):
    reveal_type(raw)
    reveal_type(fstring)
    reveal_type(bytestring)
    reveal_type(uprefix)
    reveal_type(implicit_concat)
    reveal_type(unicode_name)
    reveal_type(backslash_escape)

My thoughts:

  • f-strings are too dynamic and should not be allowed in annotations. All type checkers should reject them. Pytype currently allows them, at least in this simple example.
  • bytestrings should be rejected because bytestrings are conceptually a different type than text strings, and type annotations are text, not binary data. Pyright and pyre currently allow them.
  • unicode strings with the u prefix should be accepted because in Python 3 they are completely equivalent to non-prefixed strings. All type checkers already allow them.
  • raw strings may be allowed. I can envision use cases involving Literal where a raw string may be useful. All type checkers currently allow them. However, use cases are very limited and they may be difficult to support correctly, especially for type checkers written in languages other than Python. Support for raw string annotation can be made optional.
  • \N and \x escapes in strings are similar: they could conceivably be useful in some cases involving literals, but they may also make life harder for type checker maintainers for little gain. Support should be optional.
  • implicit concatenation may also be allowed as it can be useful with very long annotations. The spec already explicitly allows triple-quoted strings for this case, but Python the language provides multiple ways to build up long string literals, and I don’t see a strong reason why the type system should be more restrictive. Pyright currently rejects this case; Eric Traut brought up concerns about character ranges for error messages. I am OK with making support for this feature optional.

I started looking at this because @erictraut brought up a few of these cases in Basic terminology for types and type forms - #43 by davidfstr. The situation seems tricky enough that I’m splitting it out of that thread (which is about definitions of basic terms in the type system).

In summary, my proposal is:

  • Type checkers must support string annotations that do not contain any backslash escapes and that have either no string prefix or the u prefix.
  • Type checkers must reject string annotations that are byte strings or f-strings.
  • Type checkers may support raw strings and backslash escapes in string annotations.

This aligns best with mypy’s current behavior.

6 Likes

It might be worth allowing type checkers to allow (ie optional support) for f-strings that are statically determinable for the same reasons stated in support of allowing implicit concatenation.

I don’t have a strong opinion on this being necessary but I also don’t think we need to be as strict as rejecting all f-strings and still be sound here.

bytestrings and rawstrings as well as \N and \x escapes should be supported. These are language features that interact with the type system when it comes to literals, and should not end up discouraged due to typing, and I don’t think we should be thinking of these as optional to support.

edit: bytestrings specifically as the outer string encapsulating a “Stringified annotation” could be disallowed, but doing so needs to not be done in a way that also disallows them inside of a Literal inside of that annotation.

We’re talking about the outermost stringified annotation here, not about what goes inside Literal. Literal[b"x"] is fine and should be allowed, but I don’t see a reason for allowing b'Literal[b"x"]' or similar. Bytestrings are a distinct type meant for binary data and type annotations are text, not binary data.

5 Likes

I agree with Jelle’s recommendations. Definitely disallow b"..." and f"...". My litmus test here would be “is it a legal docstring?”

3 Likes

I also agree with Jelle’s thoughts.

Out of these, the one I see myself most likely to want to use is implicit concatenation.

I agree with much of Jelle’s proposal, but my thoughts differ in a few areas.

  1. I think we need to differentiate between strings that appear as type arguments to Literal and strings that indicate a forward reference. Literal requires support for bytes, and as Jelle points out, there are legitimate use cases for r-strings and escape sequences in strings used with Literal. We have to recognize that Literal has some unique behaviors and requirements, so some special-casing is unavoidable here. Annotated is also special in that it allows any value expression (including any string form) in type arguments beyond the first one.

  2. I don’t like the idea of the spec making aspects of annotations “optional”. There are parts of the typing spec where optional behavior is reasonable and even desirable, but I maintain this isn’t one of them. It’s important for library and stub authors to know which annotation forms will and will not be supported across type checkers, and I think the spec should be clear about this. If some annotation forms are optional, it’s going to create pain for library/stub developers and consumers. I only see downsides in making these forms optional. No upsides.

  3. I agree that b-strings, r-strings, and f-strings should not be allowed in a forward-reference annotation. There’s no legitimate reason I can think of to allow these.

  4. I also cannot think of any legitimate reason to support escape sequences (outside of Literal), so I think these should be disallowed. Allowing them (optionally or otherwise) unnecessarily complicates the tooling and creates opportunities for tools to deviate in behavior.

  5. I also think that implicit concatenation should be explicitly disallowed for the practical reasons I raised previously.

Disallowing raw strings, string escapes, and implicit concatenation would mean that mypy has to make a change to start disallowing those. As a mypy maintainer, I would find it hard to justify such a change: these annotations work already and have reasonable behavior.

1 Like

Since we might be heading towards a standoff here :-), I think we should try to follow a more objective standard than “doing/not doing this would require type checker X/Y to be changed”. (Leaving the special cases for Literal and Annotated aside.)

Such a standard might be derived from the core language’s semantics for strings.

  • Byte strings are a different type
  • F-strings are a dynamic construct (not a literal)
  • R-strings, u-strings, implicit concatenation and escape sequences are part of the literal notation

We want other tools that look at or transform Python source code to be able to reliably handle and transform string literals, without having to understand type annotations.

I agree that optional parts in the spec are problematic, so hopefully we can require type checkers to follow the whole standard here.

4 Likes

What you’re proposing is problematic for tools like pyright or other language servers. These tools need to associate errors with text ranges (including start/end column numbers). For command-line tools like mypy, this isn’t such a concern because errors are reported only with a start location, and that’s not done very accurately in many cases.

Consider the following:

x: list[undefined]
y: "list[undefined]"

For this code, pyright reports an error indicating that “undefined” is not defined. Note that it correctly reports the error range regardless of whether the annotation is quoted or not.

image

This approach to error reporting is in the same spirit as the improvements made in Python 3.11 and 3.12. This makes it easier for developers to understand error messages.

Now, consider if we were to allow the following type annotation:

x: "l\N{LATIN SMALL LETTER I}st[undefined]"

or

y: "list" "[undefined]"

To interpret the meaning of these annotations, a type checker can no longer simply parse the quoted text. It first needs to translate the escaped string(s) into a single unescaped string. (I’ll note that this process is not easy if the tool is written in a language other than Python because it needs to include knowledge of all unicode character name translations!) The tool then needs to parse and evaluate the translated string. Any syntax or semantic errors that result can no longer be reported at the correct location in the original string because the unescape process is not reliably reversible.

So, in an effort to support a feature that has no justified use case, you have added significant complexity to the tool’s implementation and harmed the user experience in the process. That’s not a good tradeoff, in my opinion.

I think it’s better if we take an “originalist” approach here. Let’s look at why quoted type annotations were introduced in the first place. It was done to allow for forward references through deferred evaluation. For forward references, the quoted text must be a valid annotation — syntactically and semantically. That is, the text for a quoted annotation should be the same as the text for a non-quoted annotation. If PEP 484 has included a mechanism like PEP 649’s deferred evaluation, quoted annotations probably never would have been added to the type system, and we wouldn’t even be having this discussion.

As I said above, I can’t think of any upside of supporting escaped strings or implicit concatenation within quoted annotations. I see only downsides.

I don’t feel as strongly about r-strings, but I see no good reason to support them for purposes of creating a forward reference type annotation.

4 Likes

With the case of the \N (edit: as well as \x and raw strings due to how implicit string nesting needs to exist, and escapes), the use case I saw for allowing it is below:

'list[Literal["\N{LATIN SMALL LETTER I}"]]'

The named escape there is part of the string, but it is in a place it would be valid, and should work just the same as in

list[Literal["\N{LATIN SMALL LETTER I}"]]

to that end, I just want to ensure the language is precise enough because that escape is part of the outer string (there’s only 1 string here from the interpreter’s pov), but placed inside what the type checker should see as the inner string, and should not need to do a replacement for.

This becomes even more apparent when people want to use named escapes for things like unicode combining glyphs as to not commit those as plain characters to source code.

Whatever rules end up with, I agree that the end result still needs to be a valid type expression when in a context that type checkers are intended to be allowed to use for type information.

edit: case for needing r-strings, but that this should still be scoped in the rules to where the type checkers should be able to “ignore” the outermost r

r'Literal[r"some_regex\x"]'  # without the outer r, the \x is a syntax error
2 Likes

What you’re proposing is problematic for tools like pyright or other language servers. These tools need to associate errors with text ranges (including start/end column numbers). […]

I find this pretty convincing. Allowing exotic kinds of string annotations doesn’t appear to have any benefit, and this is a clear downside.

If the concern is that it would be a pain to change existing behavior, can we label the treatment of such string annotations as “undefined behavior”? True, that’s effectively pretty much the same as saying “optional support,” but it at least carries a more negative connotation.

This doesn’t strike me as impossible. The algorithm would be to (1) resolve the raw string (including escapes, implicit concatenation, etc.) into a “cooked” string plus a data structure that maps each character to its source location in the original file; (2) parse the "cooked’ string into an AST associated with data mapping the AST nodes to original source locations; (3) show errors where appropriate, using the source code mappings to show the error in the right place.

To be sure, that is a lot of complexity for little gain, though as @mikeshardmind highlighted, there are some conceivable use cases. However, prohibiting these cases would impose a similar complexity cost on type checkers like mypy that use the Python AST: the AST module already resolves strings into “cooked” strings, so to disallow e.g. escape sequences, mypy would have to add some awkward code to look at the original source code and scan it for escape sequences.

Both supporting and prohibiting escape sequences would therefore require complex logic in some type checkers for handling an obscure edge case. That’s why in my original proposal I adopted the compromise position of making support optional.

If we don’t want to make this an optional part of the spec, then I would strongly prefer to say that string annotations follow normal Python string semantics, allowing all forms of escapes and implicit concatenation. As @guido said above, that makes it easier for other Python tools to handle annotation strings.

For example, Black has a feature that attempts to split long strings using implicit concatenation to make them fit into the line length. If implicit concatenation is not allowed in type expressions, then Black would have to somehow know not to use implicit concatenation in a line like csst("Really[Long, Type]", x) where the type string exceeds the line length limit.

Undefined behavior means “anything can happen”. I would much prefer to limit the options to two: either type checkers support the full range of Python string literals with Python’s standard semantics, or they reject specific subsets of the literal syntax with a clear error message. Anything else would be very user-unfriendly.

5 Likes

Specifically responding to Michael’s claim that escapes such as \N{...} should be preserved, I feel that that is too high a bar. In

'list[Literal["\N{LATIN SMALL LETTER I}"]]'

the \N{LATIN SMALL LETTER I} is translated to simply i, so we get

'list[Literal["i"]]'

before we even start parsing the type expression in the string. If we really want to assign a different meaning to the two, the user will have to use r'...' for the outer literal.

This and Michael’s later example suggest that at least the ‘r’ prefix should be supported by all type checkers.

1 Like

To give a slightly more extensive example, I’d expect all of the following to type check, and to behave the same:

x: Literal["\N{CHECK MARK}"] = "\N{CHECK MARK}"
x: 'Literal["\N{CHECK MARK}"]' = "\N{CHECK MARK}"
x: r'Literal["\N{CHECK MARK}"]' = "\N{CHECK MARK}"
x: Literal["\N{CHECK MARK}"] = "✓"
x: 'Literal["\N{CHECK MARK}"]' = "✓"
x: r'Literal["\N{CHECK MARK}"]' = "✓"
x: Literal["✓"] = "✓"
x: 'Literal["✓"]' = "✓"
x: r'Literal["✓"]' = "✓"

I could easily see using an \N escape rather than typing in a Unicode character like ✓. And I can see people either using simple quotes (on the assumption that the \N escape would be translated when the string was constructed) or raw quotes (on the assumption that to quote an annotation that contains a backslash, you need to use raw quotes). So all of these options are, to me, plausible (although I’d hope any individual developer would only use a subset of them!)

And I’d expect any of those annotations to evaluate at runtime (for example, via inspect.get_annotations when used as a parameter annotation) to Literal["✓"].

I understand the arguments about adding complexity to tools, but surely we have to start by not introducing complexity for users? I could get behind annotations being limited “like a docstring” (so it must be a string literal, not a bytes literal or an f-string) but adding extra constraints (and worse, constraints that only fail in type checkers and IDEs, but are fine at runtime) is going to make worse the “uncanny valley” feeling that Python typing is sort of, but not actually, the same as “normal Python”.

2 Likes

There’s one exceptional case though.

'\N{QUOTATION MARK}'

is equivalent to

'"'

But what to do with

'Literal["\N{QUOTATION MARK}"]'

???
Without using r'...' at the outer level, this becomes invalid syntax:

'Literal["""]'

Interesting case. Personally (and this is very much just my personal view) I’d value consistent semantics over “do what I mean” logic - which means, I’d prefer it to be an error for a type checker, matching the error that you’d get using inspect.get_annotations with eval_str=True:

>>> def f(x: 'Literal["\N{QUOTATION MARK}"]'):
...     pass
>>> inspect.get_annotations(f)
{'x': 'Literal["""]'}
>>> inspect.get_annotations(f, eval_str=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Gustav\AppData\Local\Programs\Python\Python312\Lib\inspect.py", line 285, in get_annotations
    value if not isinstance(value, str) else eval(value, globals, locals)
                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 1
    Literal["""]
            ^
SyntaxError: unterminated triple-quoted string literal (detected at line 1)

I guess I could accept a rule that made it “work the way it looks”, but I don’t see how such a rule could be worded in such a way that it was both straightforward to understand, and to easy reconcile with the runtime behaviour.

2 Likes