Indented multi-line string literals

alextretyak · July 26, 2021, 12:00am

I propose to add indented multi-line string literals [similar to triple-quoted string literals in Julia or in Swift] support to Python through I (or i) prefix [i means indented] in order to match the surrounding code.

For example, this code:

print(I"""First line.
          Second line.""")
print(RI"""//\\
           \\//""")

will be equivalent to:

print("First line.\nSecond line.")
print("//\\\\\n\\\\//")

This is especially useful for strings inside deeply nested functions (inspired by this question):

    def method():
        string = I"""line one
                     line two
                     line three"""

or

    def method():
        string = \
            I"""line one
                line two
                line three"""

looks much better than

    def method():
        string = """line one
line two
line three"""

or even

    def method():
        string = """\
line one
line two
line three"""

and will work faster than using textwrap.dedent() or something like that.

To avoid ‘combinatorial explosion’ mentioned here, I think that I prefix should be the last one, i.e. prefixes such as IR or If are not allowed. (Also I looks like a vertical bar |, so it makes more sense that quotes follow right behind it.)

P.S. I know that it’s not a new idea, but still I think this feature should be added to Python, sooner or later, in some form or another.

Do you find this feature useful?

Yes
Maybe
No

0 voters

cameron · July 26, 2021, 12:24am

[…]

I have mixed feelings about this. Mostly reluctance to add yet another
permutation to the string literals.

That said, I actually use indented strings a lot in docstrings, eg:

class Foo:
    def method(....):
        ''' Describe the method here
            with multiple lines of description.

            Blah blah blah.
        '''

which like you, I fix FAR FAR more readable.

For myself, I use dedent to unindent these - I remove the first line,
dedent the rest, put back the lstrip()ed first line. Oh, how I hate what
black does to these when it autoformats.

I’ve got a stripped_dedent() function for this purpose here:
https://hg.sr.ht/~cameron-simpson/css/browse/lib/python/cs/lex.py#L385

Cheers,
Cameron Simpson cs@cskk.id.au

steven.daprano · July 26, 2021, 12:40am

Please see this:

https://bugs.python.org/issue36906

EpicWink · July 26, 2021, 7:59am

As stated in the linked bus.python.org thread, you can use inspect.cleandoc:

>>> from inspect import cleandoc as I
>>> print(I("""spam
...            eggs"""))
spam
eggs

alextretyak · July 26, 2021, 8:40am

from inspect import cleandoc as I

print(I("""customer:
               first_name:   Dorothy
               family_name:  Gale
           """))

prints this:

customer:
first_name:   Dorothy
family_name:  Gale

Also it removes all empty lines at the beginning and end.

A closer look reveals that implementation of proposed syntax outside of CPython core [i.e. in some library function] is just impossible, because information about starting position of the string literal in the corresponding line of source code is available only for lexer and parser.

pf_moore · July 26, 2021, 9:24am

I’m curious. How do other languages handle this? It’s not like Python is the only language with multi-line strings, and even though Python is unusual in being indentation-sensitive, people still write code indented in other languages, so they’ll still want a solution…

I’m pretty sure people embed indented multi-line text in other languages (SQL and ASCII art in C/C++, for example). Are C/C++ programmers debating how to do this neatly?

Is there any precedent for having language support for this?

Docstrings may be a special case here - most other languages I know of have documentation comments that are parsed specially by the compiler/documentation generator, and which (because they are special syntax) can be given special treatment.

Note - I’m not against a feature like this, I’m just not convinced it’s important enough to warrant all the energy spent on it.

alextretyak · July 26, 2021, 11:00am

Well, I’ve already mentioned Julia and Swift (their triple-quoted string literals always take into account indentation without any prefixes).

But it should be noted that multi-line string literals in this languages work slightly differently:

"""
hello
"""

is equivalent to "hello" in Swift, but to "hello\n" in Julia.

In other languages there is an alternative called “heredoc”.

PHP since version 7.3 has Flexible Heredoc and Nowdoc Syntaxes:

echo <<<END
      a
     b
    c
    END;
/*
  a
 b
c
*/

The indentation of the closing marker (END;) dictates the amount of whitespace to strip from each line within the heredoc.

In Perl since version 5.26, heredocs can include indention:

#prints "Hello there\n" with no leading whitespace.
if (1) {
  print <<~EOF;
    Hello there
    EOF
}

Ruby provide the “<<~” syntax for omitting indentation on the here document:

puts <<~EOF
  This line is indented two spaces.
    This line is indented four spaces.
      This line is indented six spaces.
  EOF

The common indentation of two spaces is omitted from all lines:

This line is indented two spaces.
  This line is indented four spaces.
    This line is indented six spaces.

I think that looking at C++ is not very relevant, because C++ is rarely used to generate web-pages or something like that, where indented multi-line strings or heredocs are most useful.

pf_moore · July 26, 2021, 12:47pm

Thanks. Sounds like those languages have a very straightforward “remove the common whitespace prefix” behaviour, which is what textwrap.dedent does (rather than inspect.cleandoc). That suggests that docstrings are special, because the ideal indent rules for them don’t match the common rules for other forms “embedded multiline strings”.

Whether the existence of docstrings and the difference in what’s preferred behaviour for them should mean that Python needs a more complex solution than other languages is a question that needs to be answered (IMO, the answer should be “no”, we have to draw the line somewhere).

If this is just for generating webpages “or something like that”, then IMO it’s not generally useful enough to warrant language support.

But having said that, embedding chunks of SQL in code as strings is another use case, and it’s definitely something you see in C/C++, Java, etc. Embedding things like YAML/TOML config data is also something I’ve seen (in test cases and occasionally as “the default if no config file exists”). You gave YAML as an example yourself, in your comment about cleandoc…

Consider me somewhat +1 on having something better than textwrap.dedent(). I like the .dedent() method for strings suggested in the issue @steven.daprano linked. But I’m against yet another string prefix. The number of string prefixes Python has feels like it’s getting out of hand, and I don’t think this warrants adding to the complexity.

steven.daprano · July 26, 2021, 11:17pm

I agree with Paul that we should be cautious about adding yet more
string prefixes. We already have b, r, u and f plus uppercase and
combinations.

(u’’ is used only for backwards compatibility with Python 2, which means
we could deprecate it and remove it if necessary.)

I suspect that the only advantage of a string prefix is that it
guarantees that the indentation is normalised at compile-time, rather
than leaving it up to the interpreter to decide. But to me, I don’t take
that feature as important enough to mandate compile-time processing. I
think that allowing interpreters to use the keyhole optimizer to shift:

<string literal>.dedent()

to compile-time, without mandating that all interpreters must do it, is
sufficient.

Aside from the legacy u prefix, which no longer has a meaning, the other
prefixes all change the way the string is parsed. b restricts legal
characters to ASCII:

>>> b'--Î¼--'
  File "<stdin>", line 1
    b'--Î¼--'
            ^
SyntaxError: bytes can only contain ASCII literal characters.

f turns parts of the string into evaluated code, and r changes the
meaning of backslashes. A “dedent” prefix would be the only one which
could be handled by a post-processing step.

I am +1 into making the dedent functionality more readily available, -1
on using a string prefix for it.

gpshead · July 27, 2021, 12:41am

A strong motivating reason for doing this at the language level is for it to happen at compile time. Issue 36906: Compile time textwrap.dedent() equivalent for str or bytes literals - Python tracker already summarizes my thoughts on next steps. We’ve got a .dedent() method PR. Moving that forward, while a follow-on PR that optimizes for the case where it is being called on a constant literal to do it at compile time makes sense.

ammaraskar · July 27, 2021, 1:13am

Yeah, this is such a common idiom that I think getting rid of import textwrap for it would be a great quality-of-life improvement.

Like Gregory and Steven mentioned, since this is primarily meant for constant strings, adding an optimization in the compiler for it provides the benefit of no run-time cost without adding more prefixes/language features.

I’m a big +1 towards that approach.

holdenweb · July 31, 2021, 11:00am

There’s been a huge amount of effort expended on keeping text and data indented the same. While I realise this may not be everyone’s preference an easy way to achieve the required dedenting is to begin multiline string literals with an escaped newline, meaning the actual text of the literal appears relative to the margin and does not need to be dedented, thus:

print("""\
First line.
Second line.""")

This has the further advantage of clearly delineating text and data in indented code in a way that’s idempotent to blackening - this is actual black output from an indented version of the above.

def x():
    print(
        """\
First line.
Second line."""
    )

A possible disadvantage is that in raw multiline strings there is no way to escape the initial newline:

print(R"""\
//\\
\\//""")

prints the initiaI backslash. I would argue there are few such use cases, and I might tongue-in-cheek suggest that they could easily be dealt with by a method to trim the leading newline.

b11c · August 1, 2021, 11:17am

This is exactly the same pattern I used to do, until Black messed up the docstrings…

ferdnyc · October 14, 2021, 2:20pm

Someone (at least one someone) was wondering how C++ coders deal with this.

A GitHub search of C++ files containing likely long string literals (specifically, SQL code, found in a good percentage of code containing all of the words SELECT, FROM, and WHERE) is proving tedious and unenlightening enough that I’m going to stop paging through it after ~ 15 pages or so, but I’m seeing a mix of strings constructed into a std::ostringstream, e.g:

std::ostringstream sql;
sql << "SELECT `something` FROM "
    << "`" << schema_name << "`.`" << table_name
    << " WHERE `somecolumn` = `the_right_value` "
    << " AND `othercolumn` = " << (want_othercolumn ? "TRUE" : "FALSE");
send_db_query(sql.str());

A lot of implicitly concatenated string literals…

send_db_query("SELECT something, somethingelse FROM some_table, someother_table "
        " WHERE some_table.id = someother_table.id "
        " AND somecolumn = `the_right_value`"
        " AND someother_table.column = TRUE "
        " GROUP BY some_table.id");

And a bunch of… just… really long-ass lines:

send_db_query("SELECT something, somethingelse FROM some_table, someother_table  WHERE some_table.id = someother_table.id AND somecolumn = `the_right_value` AND someother_table.column = TRUE GROUP BY some_table.id");

¯\_(ツ)_/¯ (There was even one neckbeard using sprintf(). Terrifying.)

I know in my own code, the very few times I’ve had to embed multi-line string literals in the actual code (the string to compare our JSON-producing methods’ output to, in unit tests), I’ve used the “raw string that’s shoved over to the first column from line 2” method that’s already been advocated here.

        const std::string expected_json =
R"json({
 "key": "value",
 "position": {
  "x": 10,
  "y": 20
 },
 "size": {
  "width": 100,
  "height": 100
 }
})json";

…But we don’t use a code formatter, so no idea whether they’d typically take issue with that.