PEP 701 – Syntactic formalization of f-strings

isidentical · December 24, 2022, 1:52pm

A familiar example is the challenges for f-strings we are having on ast.unparse (a standard library code generator that tries to take an AST and output a valid Python code for it). The f-string code generation is by far the most complicated part due to the implied tracking of quote types and trying to prevent backslashes. And with this complexity, IIRC there are still a few edge cases where we can’t roundtrip some f-strings that appear in the wild (big source code datasets) as regular Python syntax. Lifting quote restrictions would actually help quite a bit in this case where we can simplify the expression code generator for f-strings by denoting the result as '{' + ast.unparse(inner) + '}' as opposed to the complex steps we were taking before.

cben · December 29, 2022, 12:27pm

An additional asymmetry is that the final literal text segment after the last {…} is part of the FSTRING_END:

Pablo Galindo Salgado:

[TokenInfo(type=61 (FSTRING_START), string="rf'", start=(1, 0), end=(1, 3), line="rf'hello {1+2} bye'\n"),
 TokenInfo(type=62 (FSTRING_MIDDLE), string='hello ', start=(1, 3), end=(1, 9), line="rf'hello {1+2} bye'\n"),
 ...
 TokenInfo(type=63 (FSTRING_END), string=' bye', start=(1, 14), end=(1, 18), line="rf'hello {1+2} bye'\n"),
 ...]

Could FSTRING_START / FSTRING_END to cover just the opening f'/closing ', and literal text pieces everywhere to be FSTRING_MIDDLE?
(I’d even rename that to FSTRING_TEXT or _LITERAL (well too overloaded) or _STATIC… The interesting aspect imho is the contrast to dynamic segments LBRACE…RBRACE, not its position)

[TokenInfo(type=61 (FSTRING_START), string="rf'", start=(1, 0), end=(1, 3), line="rf'hello {1+2} bye'\n"),
 TokenInfo(type=62 (FSTRING_TEXT), string='hello ', start=(1, 3), end=(1, 9), line="rf'hello {1+2} bye'\n"),
 ...
 TokenInfo(type=62 (FSTRING_TEXT), string=' bye', start=(1, 14), end=(1, 18), line="rf'hello {1+2} bye'\n"),
 TokenInfo(type=63 (FSTRING_END), string="'", start=(1, 18), end=(1, 19), line="rf'hello {1+2} bye'\n"),
 ...]

pablogsal · December 29, 2022, 12:29pm

Thanks for your comment! In general, you should not focus on what the tokens contain.

As I mentioned before, these tokens are just informative so the reader of the PEP knows what they do, but the implementation is not visible to the user (is the C tokenizer) and has no semantic meaning. We may even change how we implement these tokens in the final implementation if the PEP is affected. What these tokens (the ones in the C tokenizer, which I insist is not visible) will contain is going to be based fundamentally on what makes the implementation simpler.

guido · December 29, 2022, 7:43pm

You’ve said this before, but it feels like the asymmetry in the tokens is going to keep bothering people. I know it bothers me. I’m trying to understand here why this might be, and I think it’s just that the “grammar” you give here is not very helpful without precisely defining the tokens, and that is exactly what the PEP refuses to do. We could look at the reference implementation, but it’s a huge pile of code that’s not so easy to follow, and it doesn’t seem to have a version of Lib/tokenize.py (which might be more accessible).

When I thought about this idea I’ve always started with how the tokenizer could be implemented, in particular what states it should maintain on a stack. (Somehow your suggestion of having a stack of tokenizers didn’t occur to me, but it sounds isomorphic.)

Regular or classic mode.
In the “string” part of an f-string, looking for a matching closing quote OR s single {.
Inside an interpolation expression; this is mostly the same as regular/classic mode, but also stops at !, :, = and }, when outside parens/braces/brackets. (I realize there’s more to it, and we’d like the parser to play a role here too.)
Possibly a special mode after !. Or maybe ! doesn’t need to be so special. (Though I’d prefer clarity over whether spaces are allowed around the following r, s or a.)
Definitely a special mode after :, where we are looking for { and }; this is slightly different from “string” mode because we’re not expecting a closing quote.

Did I miss anything?

pablogsal · December 29, 2022, 9:42pm

I sympathise with this point of view. This is one of the reasons we are including examples of the token descriptions so the grammar can be made sense. I also understand that the asymmetry of the token contents may upset people. On the other hand, we want to just avoid the discussion focusing on this because this only affects the implementation and any appreciation of the token contents (the name of the tokens or what part of the f-string these tokens should have) at this stage of the discussion is mainly about the aesthetics and not about the consequences of the grammar change. I apologize if it seems that I am overly insisting on this, but I think is important because the proposal doesn’t change any user-visible thing if the tokens contain the starting quote or not or if they contain the end quote or other alterations and we have many other important points in flight and the implementation is complex and we want to at least have the main PEP accepted before exploring how to clean/improve the actual implementation.

Guido van Rossum:

When I thought about this idea I’ve always started with how the tokenizer could be implemented, in particular what states it should maintain on a stack. (Somehow your suggestion of having a stack of tokenizers didn’t occur to me, but it sounds isomorphic.)

Regular or classic mode.

In the “string” part of an f-string, looking for a matching closing quote OR s single {.

Inside an interpolation expression; this is mostly the same as regular/classic mode, but also stops at !, :, = and }, when outside parens/braces/brackets. (I realize there’s more to it, and we’d like the parser to play a role here too.)

Possibly a special mode after !. Or maybe ! doesn’t need to be so special. (Though I’d prefer clarity over whether spaces are allowed around the following r, s or a.)

Definitely a special mode after :, where we are looking for { and }; this is slightly different from “string” mode because we’re not expecting a closing quote.

Did I miss anything?

Here you seem to be referring to the description of how to tokenise the f-string text to produce these tokens. Note that even if you specify the switching points the token and their contents are relatively unspecified (should we include the quote at the start or should the tokenizer retain that information? should we emit “empty” tokens for the start and the end or should they contain stuff?).

(Though I’d prefer clarity over whether spaces are allowed around the following r, s or a.)

We don’t allow them to preserve behavior but technically the parser does because the grammar is:

fstring_replacement_field
    | '{' (yield_expr | star_expressions) "="? [ "!" NAME ] [ ':' fstring_format_spec* ] '}'

which means that at parsing time it can be any whitespace between “!” and the NAME token for the r, s or a. But we artificially disallow it when the AST node is formed:

>>> f"{2.34! r}"
  File "<stdin>", line 1
    f"{2.34! r}"
           ^^^
SyntaxError: conversion type must come right after the exclamation mark```

Definitely a special mode after :, where we are looking for { and }; this is slightly different from “string” mode because we’re not expecting a closing quote.

Yep but also you can see it as the same as the second mode because if we find a quote we just handle the error because the expression } was not found. Note that in the first mode if we find a ‘}’ is also an error, so is isomorphic to this problem with the quote and the bracket interchanged.

Regarding the algorithm, I don’t think you are missing anything other than how to handle the = sign, which requires the tokenizer to retain the raw source and maybe the fact than looking for { or } is not enough because {{ doesn’t mark the start and }} doesn’t mark the end (there are some tricky stuff like f"{1:<3}}}"). Of course, the devil is in the detail because including locations for these tokens and other stuff can be a bit challenging (especially at the end of the string).

More or less this:

Taking into account that when the f-string closes we check for unbalanced { and other errors.

In any case, would you like us to include a refined version of this description in the PEP document for clarification?

guido · December 30, 2022, 3:21am

One way to refocus the discussion on the important things is to just give in, change the tokens to what people seem to expect, and move on. As you say, it’s not normative anyway, so there’s no reason for the tokens used in the PEP to match your implementation.

(I do think that the tokens as returned by tokenize.py should be described by the PEP, because that is a public API.)

Maybe the PEP should have a section that explains in detail what will change?

The PEP makes it clear that the AST does not change (since it already supports everything that’s needed). There may be additional nesting, but that shouldn’t be a problem.
I assume that the semantics don’t change either – everything that can affect the semantics (other than recursion limits) should be represented in the AST.
So what changes, apart from being able to reuse string quotes? IIUC you believe that it is necessary to accept spaces around the “identifier” following !. But we don’t have to change that – the parser can insist that two tokens are adjacent without intervening whitespace by checking the end line/col of one token and the start line/col of the next. (I discovered this trick when prototyping custom string prefixes for Jim Baker.)

(Anecdote: I was idly typing some Python code this afternoon and received an unexpected syntax error. Upon inspection I realized I had written this:

>>> for x in dir(f): print(f"{x:20s} : {getattr(f, x, "NOPE")!r:.40}")

)

I’ll respond to the tokenization details later.

guido · December 30, 2022, 4:57am

Hm. I think I may have found an example where I might disagree. But before we argue about this, I think it would help if there was a specification for the tokens – not for what is returned by the tokenizer, but e.g. a regular expression showing what the lexer accepts for each token. This would be helpful for people writing alternate tokenizers.

(In fact, it’s possible that one reason people get hung up on the contents of the tokens is that what they really would like to understand is the input acceptable for each token.)

Yes I would like that very much. It would also be nice if we had an executable version of that specification in he form of a Python class into which one can feed examples.

pablogsal · December 30, 2022, 11:47am

An example where you disagree with treating the tokenizer mode after : then same as the mode for f-string or where you disagree with the proposal in general?

Maybe the PEP should have a section that explains in detail what will change?

We can add that, we somehow have something similar now in “consequences of the grammar”. I can adapt it so is more clear.

but e.g. a regular expression showing what the lexer accepts for each token

I don’t think a regular expression is possible (or at least straightforward so it helps for clarifications) because the cut points depend on the level of parenthesis and bracket some of the characters are and some other state the lexer needs to keep track of.

Yes I would like that very much.

Ok, we will incorporate a refined version of that description to the document

It would also be nice if we had an executable version of that specification in the form of a Python class into which one can feed examples.

If you don’t mind that the token contents are asymmetric, you can already play with it by getting the tokens from the C tokenizer using the private interface we use for testing (you need to do this from the implementation branch):

>>> import pprint
>>> import tokenize
>>> pprint.pprint(list(tokenize._generate_tokens_from_c_tokenizer("f' foo { f' {1 + 1:.10f} ' } bar'")))

[TokenInfo(type=61 (FSTRING_START), string="f'", start=(1, 0), end=(1, 2), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=62 (FSTRING_MIDDLE), string=' foo ', start=(1, 2), end=(1, 7), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=25 (LBRACE), string='{', start=(1, 7), end=(1, 8), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=61 (FSTRING_START), string="f'", start=(1, 9), end=(1, 11), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=62 (FSTRING_MIDDLE), string=' ', start=(1, 11), end=(1, 12), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=25 (LBRACE), string='{', start=(1, 12), end=(1, 13), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=2 (NUMBER), string='1', start=(1, 13), end=(1, 14), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=14 (PLUS), string='+', start=(1, 15), end=(1, 16), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=2 (NUMBER), string='1', start=(1, 17), end=(1, 18), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=11 (COLON), string=':', start=(1, 18), end=(1, 19), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=62 (FSTRING_MIDDLE), string='.10f', start=(1, 19), end=(1, 23), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=26 (RBRACE), string='}', start=(1, 23), end=(1, 24), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=63 (FSTRING_END), string=' ', start=(1, 24), end=(1, 25), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=26 (RBRACE), string='}', start=(1, 27), end=(1, 28), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=63 (FSTRING_END), string=' bar', start=(1, 28), end=(1, 32), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=4 (NEWLINE), string='', start=(1, 33), end=(1, 33), line="f' foo { f' {1 + 1:.10f} ' } bar'\n")]

I will discuss with @isidentical and @lys.nikolaou about the changes in tokenize.py to see if we can add this to the PEP and the proposed implementation.

guido · December 30, 2022, 9:50pm

The former – I am still very much in favor of the proposal, I am just trying to tease out details and edge cases.

Here’s my edge case:

A pair of }} is allowed in the regular “middle” part of an f-string, e.g. f"a}}z" produces "a}z".
But inside the format specifier, this is currently not allowed:
```
>>> f"{X():_}}_}"
  File "<stdin>", line 1
    f"{X():_}}_}"
                 ^
SyntaxError: f-string: single '}' is not allowed
>>> 
```
If this is truly treated as a middle part it would have interpreted this as a format specifier equal to _}_.

Now, I haven’t built your branch yet so maybe it does that in your version? In that case I give up the quibble (other than that this difference isn’t mentioned in the PEP’s “Consequences …” section).

I did miss that, and I’m glad that it mentions backslashes. I guess there are some other options for comments though, e.g. one could argue that this ought to be valid:

f"___{
    x
}___"

since the newlines are technically enclosed by brackets, similar to

{
    x
}

Or if you want to exclude those outer brackets, what about this?

f"___{(
    x
)}___"

I wasn’t thinking of describing the entire f-string as a regex, just each piece. For example a true “middle” piece might be described as follows:

r"""(?x)  # Verbose mode
(
    \\.  # Escaped character (ignoring octal/hex)
|
    {{  # Left brace, must be doubled
|
    }}  # Right brace, must be doubled
|
    [^'"{}\\]  # Everything else
)*
(?={)  # Must be followed by left brace
"""

This is not the whole story, but it feels like a good start, and (for me at least) it’s more helpful than a hundred words.

I will have to try this for myself. Somehow when I looked at the branch it seemed there weren’t any changes to tokenize.py, seems I was mistaken!

pablogsal · December 30, 2022, 10:05pm

This is not the whole story, but it feels like a good start, and (for me at least) it’s more helpful than a hundred words.

I see what you mean but how do you express with regular expressions the fact that the regular expression for the expression part is “normal tokenization but it must stop at } or : only if is not inside parentheses”? Even if somehow you han-wave over the “normal tokenization” I don’t think you can easily express the parentheses part easily with automaton-based regular expressions.

Humm seems that I didn’t clarify this correctly and there is some confusion. Allow me to clarify it better:

The example I gave uses the C tokenizer interface (notice the funcion name is _generate_tokens_from_c_tokenizer). This is a non-public interface that we exposed in 3.11 to test the C tokenizer. This is the one that you can already use to play with in our branch.
At the time they are not changes to the tokenize.py file. I meant that I will discuss with @lys.nikolaou and @isidentical about adding them to the proposed implementation.

Apologies for the confusion.

pablogsal · December 30, 2022, 10:19pm

I checked and we handle that correctly at the time:

>>> f"{X():_}}_}"
  File "<stdin>", line 1
    f"{X():_}}_}"
             ^
SyntaxError: unmatched '}'

we do it here:

github.com

we-like-parsers/cpython/blob/383a3f0d08f0a57ea50be2c230ee34c148b85613/Parser/tokenizer.c#L2500-L2506


      
          if (peek == '}' && current_tok->bracket_mark_index <= 0
              // We can not have }} inside the format spec, so we are going to assume
              // this that the first closing brace belongs to the f-string expression
              // and the second one needs to deal with later (e.g. f"{1:<3}}}").
              && !current_tok->format_spec) {
              p_start = tok->start;
              p_end = tok->cur - 1;

We will mention it in the clarifications

pablogsal · December 30, 2022, 10:23pm

Guido van Rossum:

I guess there are some other options for comments though, e.g. one could argue that this ought to be valid:
f"___{
    x
}___"
since the newlines are technically enclosed by brackets, similar to
{
    x
}
Or if you want to exclude those outer brackets, what about this?
f"___{(
    x
)}___"

Both of these are valid with our current implementation:

>>> x = 1
>>> f"___{
...     x
... }___"
'___1___'
>>> f"___{(
...     x
... )}___"
'___1___'

It also handles debug expressions:

>>> f"___{
... 1
... +
... 1
... =}___"
'___\n1\n+\n1\n=2___'
>>> print(_)
___
1
+
1
=2___

guido · December 31, 2022, 1:14am

That could be driven by the parser though, right? The parser can have lookaheads for }, : and :=, and then signal to the lexer to switch back.

Humm seems that I didn’t clarify this correctly and there is some confusion. […]

Got it. I glanced over that (I seem to be missing a lot of detail today – must be the weather :-).

I checked and we handle that correctly at the time:

Ah, but who says I don’t want to allow }} in a format specifier? Previously maybe we couldn’t but now we could, right?

(Also, could we allow \{ and \} to escape curlies? If we could, would we want to? The doubling feels jarring because that’s not how Python escapes anything else.)

Both of these are valid with our current implementation:

That’s a relief. I guess it’s another part I skimmed incorrectly. The PEP has “Comments, using the # character, are possible only in multi-line f-string literals” which made me think that these also weren’t possible. But if I added a comment to my examples those should work, right?

ncoghlan · January 1, 2023, 1:00am

Definite +1 on the PEP as a whole.

Like others in the thread, I voted in favour of grammar consistency in the poll (allowing quote reuse), but I would also like to see the PEP more strongly discourage actually relying on that in manually written code, in the form of a proposed addition to PEP 8.

The section on string quotes would be the appropriate place for an addition like:

When an f-string contains nested f-strings in subexpressions, use different quote styles at each level of nesting to make it easier to distinguish nested strings from the end of the containing f-string. (By implication, f-string nesting should never exceed 4 levels outside auto-generated code intended solely for machine consumption)

(The part in parentheses probably shouldn’t be added to PEP 8, but could be included in the PEP 701 section on adding the new PEP 8 text)

ncoghlan · January 1, 2023, 1:05am

Maybe tweak the phrasing to be “f-strings that span multiple lines”? (As you say, it’s possible to span multiple lines without using triple quotes to make an explicitly multi-line string)

sirosen · January 2, 2023, 3:39pm

In my view, the various mentions of nested f-strings and unreadable code as a possibility are not particularly compelling arguments. It is always possible to write unreadable code.

O0OO0 = 1
OO0O0 = 0
OOO00 = O0OO0 + OO0O0 - ...

And I haven’t even started defining functions yet!

The backslash restriction shows up pretty naturally when working with \n. The other day, I ran into the following real-world example:

f"{'\n'.join(mylist)}\n"

It is not intuitive that this raises an error at runtime, which makes it a (possibly unnecessary) point of friction for developers.

pitrou · January 2, 2023, 3:46pm

“It is already possible to write unreadable code so we should not discourage more of it” is definitely not a compelling argument. You might as well jump straight to “it’s possible to write unsafe Python code so there’s no compelling argument against C++”.

sirosen · January 2, 2023, 4:10pm

I take your meaning, but that’s not what I’m trying to say.

There are many examples in this thread which are very confusing but also intentionally constructed, like

f"{some + f"{nested + f"string {concat}"}"} example"

This doesn’t seem to me to be an attempt to show how the feature, being used merely “cleverly” or by a beginner, makes code difficult to understand. Rather, it intends to show that it is possible to construct something unreadable.

There is a good amount of discussion about nested f-strings here, but I don’t know that anyone in favor of the PEP cares very much about nested f-strings? More than one comment seems to suggest a view that no one yet knows of a legitimate and useful case for nested f-strings, so we’d probably keep them out during code reviews.

That line of argument might not be convincing to those against. But nobody seems to be strongly in favor of nested f-strings as a particular consequence of this proposal.

ajoino · January 2, 2023, 5:10pm

This will depend on what one considers legitimate, but I found myself writing

print(f"{some_dict["foo"]}")

not once, but twice in the same day when experimenting in the REPL recently. Even though I hade recently read this thread I was still confuced for a while before I realized what the error was. So quick experiments in the repl is one use-case that I would like this feature for.

sirosen · January 2, 2023, 5:54pm

That usage features use of the outer quote character inside of the f-string.
By nested f-strings, I mean any use of an f-string inside of the {} format specifiers:

f"f-string {'containing ' + f'another {"f-string"}'}"