PEP 701 – Syntactic formalization of f-strings

10 posts were split to a new topic: String joining design

We actually started with that but it turned out to be a bad idea for the following:

  • The tokenizer needs to be able to tokenize the source without the parser driving the modes. This is because for many operations (such as some error reporting, tokenize.py and others) the parser is actually not available.
  • When the parser does backtracking we would need the tokenize to “untokenize” the source. This is not only quite chaotic but also is complicated to get “right” with the memorization.
  • It breaks the nice separation between the tokenizer and parser. This has proven to help debugging a lot because we can check using our exposed functions that the tokenization works correctly before having to debug the parser. Debugging the parser at the same time as the tokenizer is quite painful.

In general, we found that just modifying the tokenizer to know when to switch is not only much easier, but also much cleaner and debuggable, and requires many many fewer changes. At the end of the day, one of our cornerstones is making this part of CPython more maintainable so our implementation went this route.

2 Likes

Thanks, this actually makes sense.

Trying to get the discussion back on track (away from .join()) a bit, have you thought about my parenthetical remark about using \{ and \} to escape curlies instead of, or in addition to, {{ and }}? I believe that when PEP 498 was discussed, there was a good reason to use the latter, but somehow I can’t find that reason in the PEP, and I can’t recall what it was. Maybe it was for compatibility with str.format()? The doubling has always irked me, and I think I’ve forgotten that it existed several times over the years (because I so rarely have a real need to put a curly brace in an f-string).

3 Likes

Not yet, I want to discuss this in detail with @lys.nikolaou and @isidentical and currently we are focusing on making the tokenize.py changes and acting on the feedback so far. We will discuss it and check if is too hard to support. If it proves to be trivial and we all agree we can add it to the proposal. If it proves to be tricky we can do it as a follow-up (maybe in an issue and not in a PEP).

In any case, I think the question is if we can support both, right (for backwards compat). The double { makes it somehow harder to implement but I don’t think there is a way around supporting it

I believe that when PEP 498 was discussed, there was a good reason to use the latter, but somehow I can’t find that reason in the PEP,

Meanwhile, maybe @ericvsmith knows the answer to this?

Yes, that was it.

It’s unclear from the PEP whether following (new line in f-expression) are valid or not:

f'{1
+
1}'

f'{1\
+\
1}'

# triple-quoted f-string
f'''{1
+
1}'''

f'''{1\
+\
1}'''

similar code is valid in string interpolation, e.g. C# (see [Proposal]: Remove restriction that interpolations within a non-verbatim interpolated string cannot contain new-lines. · Issue #4935 · dotnet/csharplang · GitHub), Kotlin (see Kotlin language specification).

Thanks for your comment! I answered this partially here:

But I can confirm that this will be incorporated soon into the PEP document :+1:

1 Like

The SC (sans Pablo, who didn’t get a vote since he’s one of the PEP authors) is happy to accept PEP 701. Thanks for the hard work in getting f-strings in an even better place :slight_smile:

19 Likes

A big thanks to the authors of this PEP and its implementation. This is a nice improvement!

I’m implementing support for PEP 701 in pyright, which has its own tokenizer and parser. I tried to implement it based on the documentation in the PEP, but I found that there’s an important detail missing in the “How to produce these tokens” section. Step 3 doesn’t say anything special about what happens after a : character is encountered. It says that the tokenizer should go back to step 2 after emitting a : token and popping the current tokenizer mode from the stack. However, when in “format specifier” mode (i.e. after the : is encountered), I think the rules for step 2 need to change so double braces ({{ or }}) are no longer assumed to be part of an FSTRING_MIDDLE token.

To test this, I tried f"{'':{{1}.pop()}}" with the latest CPython implementation. This f-string parses without error. The {{ after the : is clearly being treated as two separate { tokens rather than part of an FSTRING_MIDDLE token.

If I use the same subexpression prior to the :, it works consistently with the PEP’s “step 2” description, and the {{ is treated as part of an FSTRING_MIDDLE token. As proof, the f-string f"{{1}.pop()}" results in a syntax error unless a space is added between the two opening braces.

Could the authors of the PEP confirm that my assumption is correct? If so, would it make sense to update the PEP so it accurately reflects the intended behavior?

Edit: Another related question… in “step 3”, the PEP implies that when a = or ! token is seen, the tokenizer should switch back to step 2. I presume that’s incorrect and it should stay in step 3 and continue to produce normal tokens (as opposed to FSTRING_MIDDLE tokens) until it sees a : or '}`.

Another small topic that isn’t mentioned in the PEP has to do with nested format specifiers. It appears that the implementation supports two levels of nesting but not three.

f"{'':*^{1:{1}}}" works
f"{'':*^{1:{1:{1}}}}" produces a syntax error (f-string: expressions nested too deeply)

The PEP indicates that there is a “lower bound of 5 levels of nesting”. I presume this statement applies to nesting of f-strings, not of format specifiers within an f-string. Would it make sense for the PEP to mention a separate lower bound for format specifier nesting?

Thanks @erictraut for the message! I am very excited to know that you are adding PEP 701 support in pyright! Thanks a lot for the great work :slight_smile:

Let me answer the points that you raise.

“How to produce these tokens” section. Step 3 doesn’t say anything special about what happens after a : character is encountered. It says that the tokenizer should go back to step 2 after emitting a : token and popping the current tokenizer mode from the stack. However, when in “format specifier” mode (i.e. after the : is encountered), I think the rules for step 2 need to change so double braces ({{ or }} ) are no longer assumed to be part of an FSTRING_MIDDLE token.

I think we may want to be a bit more explicit here, but the section was intended to be indicative and not prescriptive as different tokenizers would require different ways to approach the problem. You have a good point though because there is ambiguity over how the format specifier should be tokenized in general so we should describe that better. I will try to make a PR after beta freeze as we are currently fully immersed in the Python tokenizer changes.

You are right here but note this is not new as this is something we just preserved from the previous implementation:

ython 3.11.1 (main, Dec 15 2022, 18:32:52) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> f"{'':{{1}.pop()}}"
' '

[quote="Eric Traut, post:120, topic:22046, username:erictraut"]
Another small topic that isn’t mentioned in the PEP has to do with nested format specifiers. It appears that the implementation supports two levels of nesting but not three.

`f"{'':*^{1:{1}}}"` works
`f"{'':*^{1:{1:{1}}}}"` produces a syntax error (f-string: expressions nested too deeply)
[/quote]

This is also something we preserved from the previous implementation:

Python 3.11.1 (main, Dec 15 2022, 18:32:52) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> f"{'':*^{1:{1:{1}}}}"
  File "<stdin>", line 1
    f"{'':*^{1:{1:{1}}}}"
                         ^
SyntaxError: f-string: expressions nested too deeply

For both of these things we just preserved existing behavior, while for the nesting of f-strings we had to dictate it because now arbitrary nesting is possible while before it was not.

Would it make sense for the PEP to mention a separate lower bound for format specifier nesting?

I think this is something that makes sense, just so is more clear what the default prescription/recommendation is.

1 Like

Personally, I don’t find any of the arguments against supporting quote reuse to be compelling. I find the benefits of resourcing referential transparency inside f-string expressions to far outweigh any concerns I’ve seen so far, for a couple reasons:

One, the current status quo can cause huge problems when generating code from a template (like with mako), and can still be a pain even in normal every day use. It’s just simpler to think of an f-string expression as a normal Python expression without any special restrictions.

Two, allowing quote reuse inside f-strings doesn’t, by itself, make code unreadable. What people are worrying about is actually the abuse of quote reuse.

While discouraging the abuse of language syntax is a noble goal, I personally feel that it should never come at the expense of legitimate use of the language. In my opinion a language should empower developers to write concise, readable, and expressive code first. Discouraging misuse of syntax should always come second. That’s because being able to provide the expressive power to describe a legitimately complex system in clear and simple terms is always going to be more important for a language.

Anyhow, any sufficiently powerful syntax can be abused, the question is whether misuse is easy enough that it becomes a footgun. IMO that is not the case here.

1 Like

Not sure this is worth debating any more… the PEP was accepted into Python 3.12 :slight_smile:

12 Likes