Str.dedent vs str.removeindent

I want to add textwrap.dedent() to str method. It makes compile time dedent possible. It saves not only execution time but also some RAM and disk space for bytecode.

Issue: Add textwrap.dedent, .indent, as str methods. · Issue #62535 · python/cpython · GitHub

Now I am considering its name. We longly used dedent. But str has lstrip() and removeprefix. Which name is the best?

  • dedent
  • removeindents
  • other (comment)
0 voters
1 Like

Just to muddy the water, other terms for this are “unindent” or “outdent”, but, as you say, the term “dedent” has long been used in Python circles.

1 Like

I’m annoyed to have to import textwrap just to use textwrap.dedent(). Sometimes, i’m lazy and use the ugly if 1: hack. Having a built-in method on str type sounds appealing. I’m not sure that it’s useful on bytes, but i don’t have a strong reason to not add it neither for consistency. Should str-like and bytes-like type also implement it? Like “UserString” and ABC classes.

3 Likes

So long as the implementation is 100% identical behavior to what we’ve called textwrap.dedent() in the past then reusing the dedent name makes sense. If the behavior differs in any way, a new name makes sense and documentation will need to explain the difference.

I’d only do it for str.

9 Likes

Another immediate performance boost would be to re-implement inspect.cleandoc using str.dedent, as at least I have a few projects that store the dedented docstring for many functions.

2 Likes

I suspect folks may be voting without reading the linked Github issue, since the proposed implementation isn’t identical to textwrap’s multiple-regex-passes approach.

It may be that the lower level implementation will give the same effect as the regex based implementation, but it isn’t clear to me if that’s intended to be the case.

Thus GPS’s comment applies: if the effect is the same, reusing the name makes sense. If the effect is different, a new name would be more appropriate (and removeindent would be fine, although “removemargin” could also be considered - that’s what dedent calls the common indent prefix internally).

(Given that dedent specifically works on tabs and spaces rather than arbitrary whitespace, it shouldn’t be that hard to devise a non-regex algorithm with the same effect as the textwrap implementation)

1 Like

I voted in favour of the same name regardless of that. In the same way that string split and regex split aren’t identical but get the same name, it’s reasonable for str.dedent and textwrap.dedent to have slightly different algorithms. You should be able to summarize the differences and then make a decision as to which form of dedenting you want.

One benefit of using the same name, even with non-identical behaviour, is that it removes the “wait, which name is which?” confusion. I get that confusion all the time when switching languages (my web browser is very familiar with me searching things like “mdn array contains” to remind me that, oh right, it’s includes()), and one of the benefits of coherent design is that you should be able to predict the name of something. Having textwrap.dedent and str.removeindent will lead to people typing textwrap.removeindent and then it not working.

Not a huge deal but that’s why I voted as I did.

11 Likes

FYI, I implemented compile time cleandoc for docstring.

The behavior of inspect.cleandoc() is different from textwrap.dedent().
So I chose implement it separatedly.

2 Likes

I would like to register strong opposition to adding new text-like functionality to bytes. The following is a brief rant that superficially digresses from the topic (the issues I describe here oughtn’t matter to the proposed functionality), but motivates my opinion on the matter.

If I had my druthers, there would be an active push to remove string-formatting-like methods from the bytes type (really, zfill? What’s special about bytes with a value of 48?) and give it a repr that doesn’t encourage thinking of the type as textual (for values outside the ASCII printable range, an ordinary hex dump would be shorter and more pleasant to read, as repeated \x sequences would be replaced with spaces).

To say nothing of the fact that privileging ASCII is just not actually that useful for dealing with legacy data - considering that “legacy data” exists worldwide and lots of systems used to blithely assume a “native” encoding not recorded in metadata anywhere. Latin-1 would make some sense to assume, but b'\xdf'.upper() leaves the data unchanged, whereas '\xdf'.upper() not only recognizes a letter but properly uppercases to 'SS' even without any German-specific locale configuration. For my own purposes: I’ve had work with old binary dumps that include byte sequences that I needed to scan for and recognize as text - in shift-JIS.

Having tools that only work on half the data and silently ignore the rest, is IMO not an improvement over lacking those tools and therefore being expected to convert back and forth with a proper string type, specifying the encoding. What happened to “Explicit is better than implicit”? What happened to “Errors should never pass silently”?

The most sympathetic use cases I’ve seen for any text-like bytes functionality are for formatting/interpolation, working with stuff like network protocols that specify byte sequences that form readable ASCII-only “text”. But we didn’t get a .format method or f-bytes, so that has to use the ancient legacy %-style interpolation. Ecch.

I initially voted for dedent as a name, but now I’m reconsidering. To me, the name “dedent” implies that it should remove exactly one “level” of indentation, which in turn requires that the indentation levels are actually specified in some standard way. I like the proposed name removemargin; that sounds more like something that detects the longest common prefix of whitespace and removes it.

I actually did implement this in one of my projects, not in a necessarily great way, because I didn’t think of textwrap.dedent :sweat:

# where `data` is a list of lines, e.g. as one might get from `.splitlines()`
while {d[0] for d in data if d} in [{' '}, {'\t'}]:
    data = [d[1:] for d in data]

(A previous version checked whether all of the non-empty lines started with a space or tab, but of course this hangs when there are no such lines…)

Perhaps this is a better approach, using an unexpected corner of the standard library:

from os.path import commonprefix
prefix = commonprefix(data)
amount = len(prefix) - len(prefix.lstrip())
data = [d[amount:] for d in data]

The need to create a non-regex-based algorithm isn’t clear to me, but this approach does significantly outperform textwrap.dedent for me in a probably-far-too-naive test:

>>> from os.path import commonprefix
>>> def removemargin(s):
...     lines = s.splitlines()
...     prefix = commonprefix(lines)
...     amount = len(prefix) - len(prefix.lstrip())
...     return '\n'.join(l[amount:] for l in lines)
... 
>>> from timeit import timeit
>>> from textwrap import dedent
>>> teststr = '    foo\n   bar\n     baz'
>>> assert dedent(teststr) == removemargin(teststr)
>>> timeit('dedent(teststr)', globals=globals())
3.216312870150432
>>> timeit('removemargin(teststr)', globals=globals())
2.019660951104015

(The semantics are slightly different: at least, this code removes trailing newlines.)

1 Like

dedent is more than removing common margin.
It removes common margin from non blank lines, and it makes blank line to empty.

>>> s = "     \n  foo\n\t\t"
>>> import textwrap
>>> textwrap.dedent(s)
'\nfoo\n'
1 Like

Personally I don’t think there should be a bytes variant. textwrap doesn’t handle bytes, and no-one has added that support in all the time it’s been around, so why add a bytes variant now?

As far as behavior is concerned, I don’t really understand the proposed differences in the issue (I didn’t read the actual code) but IMO as long as it’s a sufficiently similar behaviour to be the same in a practical sense (and yes, I know that’s subjective) then I’m fine with that. I can’t say I like the idea of not handling tabs (it feels right on the border of “the same in a practical sense”), but for all the use cases I care about, I’ve never wanted tab processing. In particular, tabs in Python source code are frowned on, so the primary use case of """...""".dedent() would be unaffected by this restriction.

I disagree with the statement in the issue that “textwrap.dedent() is complex function”. Maybe in terms of implementation it is, but the documented behaviour seems pretty straightforward, so it doesn’t seem to me like there’s much to debate (beyond the tabs point, which as I say I don’t really care about).

6 Likes

Because other string methods have bytes versions.
But after reading this discussion, now I tend to add only str.dedent(). It also make implementation easy.

2 Likes

As a string method, removeindents seems more aligned with the newly added removeprefix, but we sure wouldn’t die retaining the analogy with textwrap.

2 Likes

The term “dedent” seems to be used outside Python community:

  • ES6 dedent project: “ES6 string tag that strips indentation from multi-line strings.”
  • NPM dedent-js
  • Dart flutter-dedent: Dedent - Remove any common leading whitespace from every line in text. Ported from Python.

See also ES TC39 discussion about using ``` quote to automatically dedent string: Triple-backtick template literal with indentation support - 💡 Ideas - TC39 Here it’s not even a string method, but directly in the ES language.


By the way, if we add a method to remove the indentation, should we use the inverse indent() function to add indentation? :slight_smile: I don’t think so. Moreover, it would not be the inverse since dedent() removes multiple level of indentations, whereas usually “add indentation” means adding a single level.

str.strip() removes leading and trailing spaces. str.center() adds leading and trailing spaces :slight_smile: But it’s not exactly the inverse: strip() removes different kinds of spaces, and an arbitrary number.

1 Like

It depends on what you mean by “multiple levels of indentation”. What is 8 spaces? Is it 2 levels of 4 spaces or 1 level of 8 spaces?

If you wanted more control over dedenting, you’d need to specify what constitutes indentation and the maximum number of times it can be applied. We could always add that later, it there was a need.

But that’s mostly due to either the 2 → 3 transition or because it made sense from the perspective of binary protocols one might work with. I can’t think of any binary protocol where you might want to strip out whitespace. Plus what’s “whitespace” is fancier if you take full Unicode into consideration compared to bytes and saying binary “whitespace” is ord(" ").

1 Like

So it is. I don’t really understand the purpose of specifying it, though. I thought the goal is just to fix the visual alignment of the text; why does it matter what happens to a whitespace-only line? If it’s to avoid unnecessary overflowing from whitespace, then why not .rstrip every line while we’re at it?

Looks like the poll was created yesterday, but I am unable to vote – is it closed already?

I’ve opened it again. As a moderator I almost misclicked close yesterday when I voted, so I’m assuming someone else made that mistake.

2 Likes