PEP 701 – Syntactic formalization of f-strings

Hm. I think I may have found an example where I might disagree. But before we argue about this, I think it would help if there was a specification for the tokens – not for what is returned by the tokenizer, but e.g. a regular expression showing what the lexer accepts for each token. This would be helpful for people writing alternate tokenizers.

(In fact, it’s possible that one reason people get hung up on the contents of the tokens is that what they really would like to understand is the input acceptable for each token.)

Yes I would like that very much. It would also be nice if we had an executable version of that specification in he form of a Python class into which one can feed examples.

An example where you disagree with treating the tokenizer mode after : then same as the mode for f-string or where you disagree with the proposal in general?

Maybe the PEP should have a section that explains in detail what will change?

We can add that, we somehow have something similar now in “consequences of the grammar”. I can adapt it so is more clear.

but e.g. a regular expression showing what the lexer accepts for each token

I don’t think a regular expression is possible (or at least straightforward so it helps for clarifications) because the cut points depend on the level of parenthesis and bracket some of the characters are and some other state the lexer needs to keep track of.

Yes I would like that very much.

Ok, we will incorporate a refined version of that description to the document

It would also be nice if we had an executable version of that specification in the form of a Python class into which one can feed examples.

If you don’t mind that the token contents are asymmetric, you can already play with it by getting the tokens from the C tokenizer using the private interface we use for testing (you need to do this from the implementation branch):

>>> import pprint
>>> import tokenize
>>> pprint.pprint(list(tokenize._generate_tokens_from_c_tokenizer("f' foo { f' {1 + 1:.10f} ' } bar'")))

[TokenInfo(type=61 (FSTRING_START), string="f'", start=(1, 0), end=(1, 2), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=62 (FSTRING_MIDDLE), string=' foo ', start=(1, 2), end=(1, 7), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=25 (LBRACE), string='{', start=(1, 7), end=(1, 8), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=61 (FSTRING_START), string="f'", start=(1, 9), end=(1, 11), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=62 (FSTRING_MIDDLE), string=' ', start=(1, 11), end=(1, 12), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=25 (LBRACE), string='{', start=(1, 12), end=(1, 13), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=2 (NUMBER), string='1', start=(1, 13), end=(1, 14), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=14 (PLUS), string='+', start=(1, 15), end=(1, 16), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=2 (NUMBER), string='1', start=(1, 17), end=(1, 18), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=11 (COLON), string=':', start=(1, 18), end=(1, 19), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=62 (FSTRING_MIDDLE), string='.10f', start=(1, 19), end=(1, 23), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=26 (RBRACE), string='}', start=(1, 23), end=(1, 24), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=63 (FSTRING_END), string=' ', start=(1, 24), end=(1, 25), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=26 (RBRACE), string='}', start=(1, 27), end=(1, 28), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=63 (FSTRING_END), string=' bar', start=(1, 28), end=(1, 32), line="f' foo { f' {1 + 1:.10f} ' } bar'\n"),
 TokenInfo(type=4 (NEWLINE), string='', start=(1, 33), end=(1, 33), line="f' foo { f' {1 + 1:.10f} ' } bar'\n")]

I will discuss with @isidentical and @lys.nikolaou about the changes in tokenize.py to see if we can add this to the PEP and the proposed implementation.

The former – I am still very much in favor of the proposal, I am just trying to tease out details and edge cases.

Here’s my edge case:

  • A pair of }} is allowed in the regular “middle” part of an f-string, e.g. f"a}}z" produces "a}z".
  • But inside the format specifier, this is currently not allowed:
    >>> f"{X():_}}_}"
      File "<stdin>", line 1
        f"{X():_}}_}"
                     ^
    SyntaxError: f-string: single '}' is not allowed
    >>> 
    
    If this is truly treated as a middle part it would have interpreted this as a format specifier equal to _}_.

Now, I haven’t built your branch yet so maybe it does that in your version? In that case I give up the quibble (other than that this difference isn’t mentioned in the PEP’s “Consequences …” section).

I did miss that, and I’m glad that it mentions backslashes. I guess there are some other options for comments though, e.g. one could argue that this ought to be valid:

f"___{
    x
}___"

since the newlines are technically enclosed by brackets, similar to

{
    x
}

Or if you want to exclude those outer brackets, what about this?

f"___{(
    x
)}___"

I wasn’t thinking of describing the entire f-string as a regex, just each piece. For example a true “middle” piece might be described as follows:

r"""(?x)  # Verbose mode
(
    \\.  # Escaped character (ignoring octal/hex)
|
    {{  # Left brace, must be doubled
|
    }}  # Right brace, must be doubled
|
    [^'"{}\\]  # Everything else
)*
(?={)  # Must be followed by left brace
"""

This is not the whole story, but it feels like a good start, and (for me at least) it’s more helpful than a hundred words.

I will have to try this for myself. Somehow when I looked at the branch it seemed there weren’t any changes to tokenize.py, seems I was mistaken!

This is not the whole story, but it feels like a good start, and (for me at least) it’s more helpful than a hundred words.

I see what you mean but how do you express with regular expressions the fact that the regular expression for the expression part is “normal tokenization but it must stop at } or : only if is not inside parentheses”? Even if somehow you han-wave over the “normal tokenization” I don’t think you can easily express the parentheses part easily with automaton-based regular expressions.

Humm seems that I didn’t clarify this correctly and there is some confusion. Allow me to clarify it better:

  • The example I gave uses the C tokenizer interface (notice the funcion name is _generate_tokens_from_c_tokenizer). This is a non-public interface that we exposed in 3.11 to test the C tokenizer. This is the one that you can already use to play with in our branch.
  • At the time they are not changes to the tokenize.py file. I meant that I will discuss with @lys.nikolaou and @isidentical about adding them to the proposed implementation.

Apologies for the confusion.

1 Like

I checked and we handle that correctly at the time:

>>> f"{X():_}}_}"
  File "<stdin>", line 1
    f"{X():_}}_}"
             ^
SyntaxError: unmatched '}'

we do it here:

We will mention it in the clarifications :+1:

Both of these are valid with our current implementation:

>>> x = 1
>>> f"___{
...     x
... }___"
'___1___'
>>> f"___{(
...     x
... )}___"
'___1___'

It also handles debug expressions:

>>> f"___{
... 1
... +
... 1
... =}___"
'___\n1\n+\n1\n=2___'
>>> print(_)
___
1
+
1
=2___
1 Like

That could be driven by the parser though, right? The parser can have lookaheads for }, : and :=, and then signal to the lexer to switch back.

Humm seems that I didn’t clarify this correctly and there is some confusion. […]

Got it. I glanced over that (I seem to be missing a lot of detail today – must be the weather :-).

I checked and we handle that correctly at the time:

Ah, but who says I don’t want to allow }} in a format specifier? Previously maybe we couldn’t but now we could, right?

(Also, could we allow \{ and \} to escape curlies? If we could, would we want to? The doubling feels jarring because that’s not how Python escapes anything else.)

Both of these are valid with our current implementation:

That’s a relief. I guess it’s another part I skimmed incorrectly. The PEP has “Comments, using the # character, are possible only in multi-line f-string literals” which made me think that these also weren’t possible. But if I added a comment to my examples those should work, right?

1 Like

Definite +1 on the PEP as a whole.

Like others in the thread, I voted in favour of grammar consistency in the poll (allowing quote reuse), but I would also like to see the PEP more strongly discourage actually relying on that in manually written code, in the form of a proposed addition to PEP 8.

The section on string quotes would be the appropriate place for an addition like:

When an f-string contains nested f-strings in subexpressions, use different quote styles at each level of nesting to make it easier to distinguish nested strings from the end of the containing f-string. (By implication, f-string nesting should never exceed 4 levels outside auto-generated code intended solely for machine consumption)

(The part in parentheses probably shouldn’t be added to PEP 8, but could be included in the PEP 701 section on adding the new PEP 8 text)

2 Likes

Maybe tweak the phrasing to be “f-strings that span multiple lines”? (As you say, it’s possible to span multiple lines without using triple quotes to make an explicitly multi-line string)

In my view, the various mentions of nested f-strings and unreadable code as a possibility are not particularly compelling arguments. It is always possible to write unreadable code.

O0OO0 = 1
OO0O0 = 0
OOO00 = O0OO0 + OO0O0 - ...

And I haven’t even started defining functions yet!


The backslash restriction shows up pretty naturally when working with \n. The other day, I ran into the following real-world example:

f"{'\n'.join(mylist)}\n"

It is not intuitive that this raises an error at runtime, which makes it a (possibly unnecessary) point of friction for developers.

“It is already possible to write unreadable code so we should not discourage more of it” is definitely not a compelling argument. You might as well jump straight to “it’s possible to write unsafe Python code so there’s no compelling argument against C++”.

I take your meaning, but that’s not what I’m trying to say.

There are many examples in this thread which are very confusing but also intentionally constructed, like

f"{some + f"{nested + f"string {concat}"}"} example"

This doesn’t seem to me to be an attempt to show how the feature, being used merely “cleverly” or by a beginner, makes code difficult to understand. Rather, it intends to show that it is possible to construct something unreadable.

There is a good amount of discussion about nested f-strings here, but I don’t know that anyone in favor of the PEP cares very much about nested f-strings? More than one comment seems to suggest a view that no one yet knows of a legitimate and useful case for nested f-strings, so we’d probably keep them out during code reviews.

That line of argument might not be convincing to those against. But nobody seems to be strongly in favor of nested f-strings as a particular consequence of this proposal.

This will depend on what one considers legitimate, but I found myself writing

print(f"{some_dict["foo"]}")

not once, but twice in the same day when experimenting in the REPL recently. Even though I hade recently read this thread I was still confuced for a while before I realized what the error was. So quick experiments in the repl is one use-case that I would like this feature for.

2 Likes

That usage features use of the outer quote character inside of the f-string.
By nested f-strings, I mean any use of an f-string inside of the {} format specifiers:

f"f-string {'containing ' + f'another {"f-string"}'}"

10 posts were split to a new topic: String joining design

We actually started with that but it turned out to be a bad idea for the following:

  • The tokenizer needs to be able to tokenize the source without the parser driving the modes. This is because for many operations (such as some error reporting, tokenize.py and others) the parser is actually not available.
  • When the parser does backtracking we would need the tokenize to “untokenize” the source. This is not only quite chaotic but also is complicated to get “right” with the memorization.
  • It breaks the nice separation between the tokenizer and parser. This has proven to help debugging a lot because we can check using our exposed functions that the tokenization works correctly before having to debug the parser. Debugging the parser at the same time as the tokenizer is quite painful.

In general, we found that just modifying the tokenizer to know when to switch is not only much easier, but also much cleaner and debuggable, and requires many many fewer changes. At the end of the day, one of our cornerstones is making this part of CPython more maintainable so our implementation went this route.

2 Likes

Thanks, this actually makes sense.

Trying to get the discussion back on track (away from .join()) a bit, have you thought about my parenthetical remark about using \{ and \} to escape curlies instead of, or in addition to, {{ and }}? I believe that when PEP 498 was discussed, there was a good reason to use the latter, but somehow I can’t find that reason in the PEP, and I can’t recall what it was. Maybe it was for compatibility with str.format()? The doubling has always irked me, and I think I’ve forgotten that it existed several times over the years (because I so rarely have a real need to put a curly brace in an f-string).

3 Likes

Not yet, I want to discuss this in detail with @lys.nikolaou and @isidentical and currently we are focusing on making the tokenize.py changes and acting on the feedback so far. We will discuss it and check if is too hard to support. If it proves to be trivial and we all agree we can add it to the proposal. If it proves to be tricky we can do it as a follow-up (maybe in an issue and not in a PEP).

In any case, I think the question is if we can support both, right (for backwards compat). The double { makes it somehow harder to implement but I don’t think there is a way around supporting it

I believe that when PEP 498 was discussed, there was a good reason to use the latter, but somehow I can’t find that reason in the PEP,

Meanwhile, maybe @ericvsmith knows the answer to this?

Yes, that was it.