Pre-PEP: d-string / Dedented Multiline Strings with Optional Language Hinting

methane · May 7, 2025, 1:12pm

This is just a hack, but similar purpose.
PHP has heredoc with END identifier. << END. PhpStorm can recognize END identifier as language marker. << JSON, << SQL.

C# doesn’t have such component. But since C# is statically typed language, Visual C# can do syntax highlight for new Regex() argument:

image440×210 7.61 KB

ピックアップRoslyn: raw string literal | ++C++; // 未確認飛行 C ブログ

Inline comments like /* lang=html */ are also used.

Similar syntax was proposed to C#, but static analysis was chosen:

github.com/dotnet/csharplang

[Proposal]: Embedded Language Indicators for raw string literals

opened 10:47PM - 27 Jun 22 UTC

closed 11:34PM - 19 Nov 24 UTC

333fred

Proposal Proposal champion

# Embedded Language Indicators for raw string literals * [x] Proposed * [ ] …Prototype: Not Started * [ ] Implementation: Not Started * [ ] Specification: Not Started ## Summary [summary]: #summary When we were designing [raw string literals](https://github.com/dotnet/csharplang/issues/4304), we intentionally left the door open for putting a language indicator at the end of the opening `"""` for the multi-line form. This proposal adds the support to do that. ## Motivation [motivation]: #motivation In the BCL, we added `StringSyntaxAttribute` for applying to parameters, which allows parameters to indicate the strings passed to them contain some form of embedded language, which is then used for syntax highlighting. However, this only works for strings passed directly to the parameter. For strings first stored in a variable, the only solution is a `// lang = x` comment. This means that, if the IDE wants to extract a multi-line raw string literal, it cannot neatly preserve the highlighting that was used. This syntax form is intended to help bridge that gap. ## Detailed design [design]: #detailed-design The existing raw string literal [proposal](https://github.com/dotnet/csharplang/blob/main/proposals/raw-string-literal.md) has the following multi-line grammar: ```antlr multi_line_raw_string_literal : raw_string_literal_delimiter whitespace* new_line (raw_content | new_line)* new_line whitespace* raw_string_literal_delimiter ; ``` This is updated to the following: ```antlr multi_line_raw_string_literal : raw_string_literal_delimiter identifier? whitespace* new_line (raw_content | new_line)* new_line whitespace* raw_string_literal_delimiter ; ``` Where the `identifier?` token is added right after the delimiter. ## Drawbacks [drawbacks]: #drawbacks This form is not equally applicable to all string types, so it would only apply to multi-line raw string literals. Ideas on other forms that could be more broadly applied would be useful: maybe putting the identifier after the closing quote could work? ## Alternatives [alternatives]: #alternatives ## Unresolved questions [unresolved]: #unresolved-questions ## Design meetings * https://github.com/dotnet/csharplang/blob/main/meetings/2022/LDM-2022-09-21.md#embedded-language-indicators-for-raw-string-literals * https://github.com/dotnet/csharplang/blob/main/meetings/2023/LDM-2023-10-09.md#embedded-language-indicators-for-raw-string-literals

Since Python doesn’t have inline comment, it is interesting to allow something in the first line.
Another idea is allowing comment in the first line, instead of language hint:

query = d"""# lang=SQL. This is comment
    SELECT
        id, name, age
    FROM
        user
    WHERE
        id=?
    """

nedbat · May 7, 2025, 1:15pm

For dedenting, I’ve long used textwrap.dedent or its lesser-known and possibly more convenient cousin inspect.cleandoc. I don’t find the extra function call to be that much trouble or distracting.

There seem to be new proposals for string prefixes frequently these days. I’d like to propose a general rule of thumb:

If the behavior you want can be implemented with a function call, then it isn’t going to become a string prefix.

For example, r-, f-, and t- strings cannot be implemented as a function call with another kind of string as an argument. They require special processing before the string becomes a string object.

On the other hand, d-strings can be implemented as a function call. Therefore, I propose that we will continue to recommend the function call, and won’t add a new lexical rule to the language for d-strings.

Nodd · May 7, 2025, 1:54pm

Julia has dedent included natively in multiline strings : Strings · The Julia Language. The rules are not simple but they look practical (I never used Julia myself).

In my code I use global variables, but usually the need desn’t come that often.

jamestwebber · May 7, 2025, 2:17pm

This proposal has two (basically unrelated) parts: dedenting and a mechanism for annotating the language in the string, for syntax highlighting. I believe that question is about the second part, not the first.

bwoodsend · May 7, 2025, 2:17pm

I would agree except that It can’t really. If you do a multiline insertion into a big f"string" then it’ll butcher the indentation.

multiline_insertion = """\
    foo
    bar
"""
template = f"""
    Some
    long
    template
    {multiline_insertion}
    blah
    blah
    blah
"""

No use of textwrap.detent() is going to result in that string looking the way it’s supposed to. The best you can do is to write the {multiline_insertion} bit without indentation then dedent the whole thing afterwards (assuming the two literals are written at the same indentation level – it’s a lot muddier if one of those strings is inside an extra if block).

And using \ to get a long single line string is also impossible. Writing:

textwrap.dedent("""
    A long \
    piece of \
    text
""")

results in a string where the spaces aren’t removed (i.e. '\nA long piece of text\n').

jamestwebber · May 7, 2025, 2:22pm

What set of rules would result in that string looking correct? It seems like you’ll have extra newlines no matter what.

Although this works:

-    {multiline_insertion}
+    {multiline_insertion.strip()}

bwoodsend · May 7, 2025, 2:24pm

Yeah, you’re right. In reality I’d append .strip("\n") to the multiline_insertion = """...""".

gpshead · May 16, 2025, 2:34am

The one thing that has come up repeatedly in past dedented string discussions is that the reason people want it is to save both runtime overhead and memory. it would be done at compilation time so the resulting string stored in the bytecode and in memory would be smaller. With zero runtime cost. Either saving repeated pain or startup time pain, and always saving memory. The measurable need for that savings impacts huge codebases more so than smaller projects.

I still agree with continuing to use the function call syntax as today’s recommendation. It clearly expresses intent. But it has never been possible to reliably optimize that call away at compile time due to Python being as dynamic as Python… Actual syntax is one way to finish that thought - for constants, a .dedent() method on a str is something that could also be optimized out. That doesn’t require syntax, just a method:

some of those cross-link to other discussions (one or more of which was probably linked above… i’m not rereading these right now)

methane · May 16, 2025, 2:59am

str.dedent() cannot dedent t-string and f-string in compile time.
Additionally, str.dedent is difficult to produce string having some indent:

html_parts = """\
      <div>
        Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur sed efficitur ante.
      </div>
    END""".dedent().removesuffix("END")

Other examples that str.dedent() cannot solve in previous thread:

h-vetinari · May 16, 2025, 4:51am

I programmed a bit in scala a while ago, and one approach that language has which I liked a lot, is to denote the “baseline” indentation of the string that’s itself indented, which looks like this (docs):

val quote = """The essence of Scala:
               |Fusion of functional and object-oriented
               |programming in a typed setting.""".stripMargin

This produces the exact string

The essence of Scala:
Fusion of functional and object-oriented
programming in a typed setting.

This is particularly useful where the string you’d like to write within some indented code itself has some varying degrees of indentation. Without a baseline marker, it becomes very difficult to tell where the dedent ends and the indentation begins.

With |, there’s a clear visual marker that’s easy to digest visually, as well as easy to lint and/or syntax-highlight on.

I could imagine that d-strings could do the .stripMargin part by default, so modifying @methane’s example from above slightly, this could look like

if some_condition:
    html_parts += d"""\
        |  <div>
        |    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur [...]
        |  </div>
        |"""

which would add the following string at the end of html_parts^[1]:

# no line break before (see \)
  <div>
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur [...]
  </div>
# with line break at the end, because the final """ was on a new line, and no \ before

minus the lines starting with # which are for explanation only ↩︎

blhsing · May 16, 2025, 5:45am

H. Vetinari:

With |, there’s a clear visual marker that’s easy to digest visually, as well as easy to lint and/or syntax-highlight on.

I could imagine that d-strings could do the .stripMargin part by default, so modifying @methane’s example from above slightly, this could look like
if some_condition:
    html_parts += d"""\
        |  <div>
        |    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur [...]
        |  </div>
        |"""
which would add the following string at the end of html_parts [1]:
# no line break before (see \)
  <div>
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur [...]
  </div>
# with line break at the end, because the final """ was on a new line, and no \ before

I don’t really like this idea because:

The marker adds one character of indentation to the content so the content would no longer follow the indentation of the surrounding code.
The marker on every line makes the content less copy-and-paste-friendly.
It makes content with leading literal |s difficult to express, e.g. Markdown tables.

xitop · May 16, 2025, 6:43am

I like the “baseline” idea and in Python I guess the baseline can be made implicitly one level after the d-string start. Modern code editors show a hint of a line there.

            if problem:
               raise ValueError(d"\
                    Multi line message
                    ... line 2 ...
                   ")

It is similar to the way a longer line is usually wrapped now:

            if problem:
               raise ValueError(
                    "Some longer single line message")

This would allow to create a text with some intended indentation (pun not planned).

blhsing · May 16, 2025, 6:46am

It might make sense to simply allow the indentation before the closing quote to dictate the level of dedentation, so that the code above can be rewritten as:

if some_condition:
    html_parts += d"""\
          <div>
            Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur [...]
          </div>
        """ # leaves 2 characters of indentation when compared to the lines above

h-vetinari · May 16, 2025, 7:04am

I fail to see the problem. The issue is whether raw strings end up completely disregarding the surrounding indentation, resp. the that existing workarounds like dedent are cumbersome. Whether the actual text is indented 16 or 17 characters is negligible visually (and the | would be aligned correctly with other code on the same indentation level).

Indeed, but that cost is worth the gain IMO (aside from the very substantial likelihood that the python REPL as well as major IDEs would learn to ignore the leading \s*\| when pasting^[1]).

That’s a corner case that has some obvious solutions (e.g. don’t use d-strings), but even if we posit this as a required case, I fail to see the issue. In that case you’d just have a double pipe, where the first one would get stripped (and probably greyed out by syntax highlighting)

    the_markdown_table = d"""
        || header | header | header |
        || ------ | ------ | ------ |
        ||  cell  |  cell  |  cell  |
        |"""

That’s fine for machines but terrible for humans. If you got more than a handful of lines, it’ll be very hard to keep track of where the closing quote is (perhaps even off-screen), leading to repeated yet avoidable mistakes. I like the |-approach because it avoids exactly this sort of papercut.

say, if all lines being pasted start with that pattern. ↩︎

blhsing · May 16, 2025, 7:39am

What I mean is that the quoted content itself often has nested indentation that follows the indentation of the surrounding Python code. The HTML example above uses a 2-character indentation so it isn’t actually representative of my preferred style.

The code I have in mind is more like (note the indentation now 4 characters per level even in the HTML):

first = True
if some_condition:
    if first:
        html_parts += "some header"
        first = False
    html_parts += d"""\
        <div>
            Lorem ipsum dolor sit amet, consectetur adipiscing elit.
        </div>
    """

And with your proposed | baseline marker it would read:

first = True
if some_condition:
    if first:
        html_parts += "some header"
        first = False
    html_parts += d"""\
    |    <div>
    |        Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    |    </div>
    |"""

which makes it obvious that <div> is not indented evenly with the surrounding code such as first = False.

What makes it worse is that the IDE for Python code usually has a tab setting of 4 spaces, so if I press the tab key to further indent Lorem it would incorrectly insert 3 spaces instead of 4 because | already pushes the indentation by 1 character.

My points #2 and #3 are relatively minor and can be worked around like you suggested but there really is no way around the downside of my point #1.

I agree that my suggested workaround isn’t great either but I don’t see an elegant solution so far.

gpshead · May 16, 2025, 12:12pm

Trying to predict and solve all possible use cases instead of taking the practical simple approach means we’re unlikely to ever do anything.

Fancier pie in the sky ideas that could be done should not negate the value of doing the simple thing that works to achieve real savings via str.dedent today.

They’re not in conflict.

bwoodsend · May 16, 2025, 1:08pm

Aren’t two solutions to apparently (but not strictly) the same problem always in conflict? If """...""".dedent() was ever implemented, it would be even harder to get an f-string/t-string/trailing backslash aware option past the why do we need yet another way of dedenting strings sayers.

h-vetinari · May 16, 2025, 10:47pm

I would indent the | portion relative to html_parts.

if some_condition:
    if first:
        html_parts += "some header"
        first = False
    html_parts += d"""\
        |    <div>
        |        Lorem ipsum dolor sit amet, consectetur adipiscing elit.
        |    </div>
        |"""

so then the content of html_parts is aligned with first = False.

I haven’t looked at Java-flavoured IDEs in a while, but I’m willing to wager that this is a solved problem for Scala. That IDEs can provide some support for these common workflows doesn’t solve the issue across all editors, but it would still soften the blow substantially.

methane · May 17, 2025, 12:28am

I don’t like Scala’s stripMargin.
It uses |-marker because stripMargin is string method. It has some limitations same to str.dedent().
If we add method to str, it should be same to textwrap.dedent(), not Scala’s stripMargin.

Swift, C#, Julia, Java has multiline string with dedent literal. All of them don’t use |-marker.

Java JEP JEP 378: Text Blocks
Java Guide Programmer's Guide to Text Blocks
Raw string literals - """ - C# reference | Microsoft Learn
Swift Documentation
Strings · The Julia Language
PHP PHP: rfc:flexible_heredoc_nowdoc_syntaxes

effigies · May 17, 2025, 2:24pm

Reading that JEP, I would be in favor of adopting those rules for all multiline strings, including f-strings and t-strings. If a d'' tag is needed to support a transition period, fine, but I would hope that it would be like a __future__ import and become unnecessary after a certain point.