D-string vs str.dedent()

I previously created a thread about str.dedent().
Before writing a PEP for it, I would like to consider adding a prefix of d/D to string literals (d-string).

This is rough idea about d-string:

  • d-string starts with triple quote (""" or ''') followed by newline. The newline is not included in the string.
    • d"spam" and d'''egg''' are syntax error.
  • d-string ends with indent and triple quote. Only the indent to be removed is allowed before closing triple quote in the line.
  • Indents in lines that are same to the last line is stripped.
  • d-string can be used with ‘f/F’ and `r/R’ prefix.

Example:

    text = d"""
      Lorem ipsum dolor

      sit amet, consectetur
    """

    # This is equivalent to above.
    text = "  Lorem ipsum dolor\n\n  sit amet, consectetur\n"

    # current syntax
    text = """\
  Lorem ipsum dolor

  sit amet, consectetur
"""

The advantages of d-string over str.dedent are as follows:

  • It has slightly less visual noise than .dedent(). It will be nice to be used with multiline string in function arguments. (sample)
  • It can be combined with f-strings. The indentation can be statically calculated and removed at compile time before evaluating the f-string.
  • The string can start with d""" instead of """\. The compiler will remove the first newline, which can reduce visual noise.
    • d-strings might make triple quotes and the first newline mandatory.
  • It is possible to specify the amount of indentation to remove on the final line of the string.

On the other hand, d-string would complicate Python’s syntax, as opposed to str.dedent(), which is just one method. There are currently 14 combinations in stringprefix. I don’t know how difficult it would be to add d/D to this.

stringprefix ::=  "r" | "u" | "R" | "U" | "f" | "F"
                     | "fr" | "Fr" | "fR" | "FR" | "rf" | "rF" | "Rf" | "RF"
9 Likes

There’s a small benefit to the fd-string idea, in that a dedent() method (since it would only see the resulting string) would potentially be affected by indentation caused by interpolated values; but I don’t think that that’s going to be hugely impactful. I highly doubt that the removal of the triple quotes will be possible, though - currently, in Python, a string that doesn’t start with three quotation marks WILL end within a single line, regardless of its prefixes. Though the rule has been weakened slightly, to the benefit of the language; previously, you could always scan for the end of a string without needing to know its prefix, but that led to this quirk:

Python 3.9
>>> f"What is this? {'"'} It is unbalanced!"
  File "<stdin>", line 1
    f"What is this? {'"'} It is unbalanced!"
                                            ^
SyntaxError: EOL while scanning string literal


Python 3.13.0a0
>>> f"What is this? {'"'} It is unbalanced!"
'What is this? " It is unbalanced!'

So there are definitely some small advantages to be gained here. But what are the disadvantages, the costs? Firstly, this is actual syntax, as opposed to simply being a method. That means you can’t try/except to test for this functionality:

Python 3.7

>>> try: "".removeprefix
... except AttributeError: print("fallback, removeprefix not available")
... 
fallback, removeprefix not available
>>> try: f"{1=}"
... except SyntaxError: print("fallback, self-documenting not available")
... 
  File "<fstring>", line 1
    (1=)
      ^
SyntaxError: invalid syntax

That’s an inherent benefit of non-syntactic changes. Since I personally run early alphas of Python a lot of the time, but I deploy code to other people’s computers, this is something I frequently stumble upon. New methods are cheaper than new syntax. And as to the string prefix combinations… uhh…

Let’s see… this is going to be hard to estimate. Guess we should calculate properly.

Does u need to be able to be combined with d? No, since u is only relevant for backward compatibility.

What about rd? Yes - imagine something like this:

class Stuff:
    def boxify(self):
        box = r"""\
            /-----\
            |     |
            \-----/
        """.dedent()
        ...

FD-strings? Definitely.

So we have three letters r, d, and f, and all combinations of them make sense. So that’s eight functionally-distinct combinations: r f rf d rd fd rfd and plain unprefixed strings. If you allow both d and D, that’s two more letters. And assuming you don’t mandate the order of the letters (which so far hasn’t been a problem - you can write fr"..." and rf"..."), that gives even more options.

The full combo of “raw interpretation of backslashes, formatting interpolation, and dedent” can be spelled in a LOT of ways. Six permutations of the letters: rfd rdf drf dfr frd fdr In each of those permuations, the letters can individually be upper or lower cased. fdr fdR fDr fDR Fdr FdR FDr FDR (And one of these is also a US president.) So that’s 48 options.

Each of the pairs (the two new ones, fd and rd) can be written in 8 ways: fd fD Fd FD df dF Df DF and equivalently for rd. So that’s 16.

And then there’s the two new options for d D on its own.

So… 66 new string prefixes. Welcome to combinatorics.

To be honest, I can’t imagine Python adding 66 additional string prefixes. If this were to be done, there would need to be a completely new system whereby string prefixes are (a) case insensitive, (b) order insensitive, and (c) arbitrarily combinable, which would make the grammar significantly more complicated, but scaling linearly with the letters available.

The method is far less of a cost, and in my opinion, would be easily sufficient. There’s no need to go to the syntactic complexity of a string prefix here.

8 Likes

Python was one of first programming languages that added support of triple-quoted string literals. Since then, the support of them has been added in many other programming languages, both new (like Julia) and old (like Java). Where they were added later, usage patterns and issues were taken into account. It would be good if Python borrowed this experience.

I would like to change the interpretation of triple-quoted strings by default (after long period of adaptation, with corresponding __future__ import). But if it is not possible, adding a new “d” prefix is the second good option.

Advantages of automatic dedenting triple-quoted string literals:

  1. It applies before interpreting special sequences, so it is easy to disable dedenting, control the depth of dedenting or add leading or trailing newlines that should not be stripped. The str.dedent() method does not have access to original representation.
  2. It can be used with docstrings. It will make docstrings more readable by default, without preprocessing in pydoc.
  3. It can be used with f-strings. It will dedent the original template, not the resulting string after substitution.

https://docs.julialang.org/en/v1/manual/strings/#Triple-Quoted-String-Literals
https://openjdk.org/jeps/378

8 Likes

If the question is in implementation, I am ready to do this.

1 Like

I’m not sure that adding yet another string prefix or making dedent the default for triple-quoted strings is a good idea.

The case for doc-strings doesn’t really apply, since those are read by programmers when reading the source code and only need to be dedented when processing them for e.g. help screens or documentation. Tools doing this can easily apply the textwrap.dedent() function.

On the other hand, doing this by default changes the meaning of triple-quoted strings in a quite non-intuitive way due to the many corner cases the programmer would have to consider. This can lead to an unexpected mismatch between what you write as a literal and what actually gets processed by Python. After all, whitespace does have it’s use cases :wink: E.g. think about cases where you want to combine several such triple-quoted strings to build an indented XML string.

Overall, I find the status quo quite reasonable. Adding a str.dedent() method may make some things easier (e.g. no need to import textwrap), but then again, having this in implemented in C is not really necessary.

So -1 on making dedent the default for triple-quoted strings or adding a d-prefix. -0 on adding a str.dedent() method.

4 Likes

And Python will calculate this for you! One of my favorite parlor tricks:

>>> import tokenize
>>> tokenize._all_string_prefixes()
{'', 'b', 'rf', 'fR', 'Rb', 'bR', 'RB', 'Rf', 'B', 'u', 'RF', 'BR', 'U', 'R', 'F', 'f', 'rB', 'fr', 'FR', 'rF', 'rb', 'r', 'br', 'Fr
', 'Br'}
>>> len(tokenize._all_string_prefixes())
25

If you add “rd”, “fd”, and “frd”:

>>> import tokenize
>>> tokenize._all_string_prefixes()
{'', 'DFR', 'rF', 'fRd', 'fRD', 'rFD', 'dR', 'fDr', 'FD', 'DR', 'drF', 'RDF', 'dFR', 'BR', 'Dr', 'RB', 'Rfd', 'fDR', 'RFd', 'B', 'u'
, 'rd', 'r', 'Rdf', 'fr', 'frD', 'Df', 'fdr', 'Br', 'rfd', 'fD', 'DrF', 'FR', 'Fd', 'rDF', 'rdf', 'f', 'Frd', 'Fr', 'dr', 'fdR', 'Rb
', 'DRF', 'bR', 'DRf', 'frd', 'Rd', 'dF', 'Drf', 'fd', 'Rf', 'R', 'FRD', 'dfr', 'FdR', 'RdF', 'FDR', 'RDf', 'U', 'DFr', 'rdF', 'Fdr'
, 'rB', 'rb', 'FrD', 'RFD', 'dRF', 'df', 'rDf', 'dFr', 'DfR', 'drf', 'rFd', 'RfD', 'fR', 'DF', 'FRd', 'FDr', 'dfR', 'b', 'RF', 'Dfr'
, 'F', 'rD', 'rf', 'rfD', 'RD', 'dRf', 'br'}
>>> len(tokenize._all_string_prefixes())
89

When I added f-strings, I argued that we should restrict it to lower case, even when used in combination with other characters, so that “fR” would be invalid, although “fr” and “rf” would be okay. But I lost that one.

5 Likes

I quite like the idea; I think prefixes are a very out-of-your way method for changing string behaviour without run-time overhead.[1] Why would the “combinatorial explosion” of possible string prefixes matter here? Surely a reader who understands fdr"" can understand dRF"" after learning that order and case of the prefixes doesn’t matter, which is already the case.


  1. In fact, I would love user-definable literals as in C++, but that’s a different topic. ↩︎

2 Likes

And as a bonus, Python won’t forget about bytestrings. Do they also need to support dedenting?

If the answer is “yes”, then I really think this proposal will depend on a different system for string prefixes. This is getting insane.

1 Like

How sad. Please only add ‘d’ if this is done. Yes, one can understand ‘dRF’, but it is ugly and jarring at least to my eyes.

I believe tokenize._all_string_prefixes is newish. In any case, it is not documented. Years ago, I wrote out all 24 non-blank prefixes to test IDLE’s syntax colorizer and have opposed any more prefixes because of the explosion. Exposing this method should be part of any proposal to add more prefixes.

There are, of course, other proposals to add other prefixes.

3 Likes

I must be missing something, because this example does not match my expectation created by textwrap.dedent!
Why isn’t it equivalent to:

    text = "Lorem ipsum dolor\n\nsit amet, consectetur\n"

dedent means remove common leading indent to me.

2 Likes

This idea is borrowed from Julia and Swift.

I agree that from __future__ import is better idea.

I am totally beginner at parsers. So it is very welcome.

I really like this idea. I have always dedented triple-quoted strings, and can’t remember not wanting to dedent them.

1 Like

OK. It is one of disadvantages of from __future__ import.

This is where d-string is better than str.dedent(). str.dedent() can not used for indented string parts. But d-string can be used for it.

        # Two-space indented XML snippet in current triple quote string.
        x = """\
  <p>
    hello
  </p>
"""
        # Same snippet in d-string (if we don't user __future__ import).
        x = d"""
          <p>
            hello
          </p>
        """

This would make Python code building xml/html/SQL looks better.

We can reuse the u prefix to disable autodedenting.

Please don’t. I think using d"..." is fine. I’m not too keen on a __future__ import. (I maintain some tests that happen to require

I feel it’s important that there is some API (not necessarily spelled textwrap.dedent()) that implements the exact same algorithm as the d prefix does (to the extent possible – it cannot see escaped newlines and other escape characters that are expanded by the time the string literal in the source has been converted to an object.

1 Like

Proposed algorithm is tailored for multiline literal. I don’t think str method should does exactly same as:

  • Input string must begin with newline (empty string is not allowed) and it will be removed. If the string doesn’t starts with newline, raises ValueError (insted of SyntaxError).
  • The last line is the indent to be removed. If the last line contains characters except SP or TAB, raises ValueError.

We can relax about restriction for the beginning newline. Just ignore it instead of raising ValueError.
But specifying indent to be removed using the last line looks very ugly for generic string method.

If we have str.dedent(), it should takes indent parameter or detect “longest common indent” like textwrap.dedent().

Except the beginning newline and the “indent only last line”, str.dedent() can remove indent exactly same to d-string. And it may be exactly same to textwrap.dedent().

With d-strings you can write the following:

def f():
    return d"""First shalt thou take out the Holy Pin.
        Then shalt thou count to three, no more, no less.
        Three shall be the number thou shalt count, and the number of the \
        counting shall be three.
        Four shalt thou not count, neither count thou two, excepting that \
        thou then proceed to three.
        Five is right out!
        Once the number three, being the third number, be reached, then \
        lobbest thou thy Holy Hand Grenade of Antioch towards thy foe, \
        who, being naughty in My sight, shall snuff it."""

Sentences are separated by newlines, long sentences are split on several physical lines on 80 columns in the source code, and these lines are nuicely indented, but the resulting string do not contain newlines or extra spaces in the middle of sentence. It is not possible to do this with str.dedent().

10 Likes

I see. Very cool. And now I agree that requiring an API for the same thing is problematic.

1 Like

I like the idea and I would like that this syntax could be added by formatters or other tools.

Problem:

I have some files where i have source code in my tests for example:


def test_something():
    code = """\
a=1+2
b=3+4
"""
    ...

I would like if the source code could be converted automatically to the new syntax.
The requirement for such a change would be string == formatting_algo(string)

But I think there would be an problem with the .dedent() approach in combination with f-strings.

def test_something():
    var="(1+\n2)"  # line break without indentation in this string

    code1 = f"""\
        a=1+{var}
        b=3+4
""".dedent()

code2 = f"""\
a=1+{var}
b=3+4
"""
    
    assert code1 == code2  # would fail
    ...

This would mean that .dedent() could not be used by tooling to automatically indent the f-strings.
It would also mean that it is not possible to change the indentation for a f-string which is already indented with .dedent()

For this reason, I would prefer the d"", because it applies the dedent before the variable replacement.

But I would also recommend you to ask someone who writes formatter for his opinion.