Str.dedent vs str.removeindent

Current C dedent implementation (.e.g _PyUnicode_Dedent) is in https://github.com/python/cpython/pull/103998 .

I added this patch to the pull request, and it passes 100% textwrap.dedent() test cases.

diff --git a/Lib/test/test_textwrap.py b/Lib/test/test_textwrap.py
index dfbc2b93df..17b3a0efab 100644
--- a/Lib/test/test_textwrap.py
+++ b/Lib/test/test_textwrap.py
@@ -10,7 +10,7 @@
 
 import unittest
 
-from textwrap import TextWrapper, wrap, fill, dedent, indent, shorten
+from textwrap import TextWrapper, wrap, fill, indent, shorten
 
 
 class BaseTestCase(unittest.TestCase):
@@ -760,6 +760,8 @@ def test_subsequent_indent(self):
                       initial_indent="  * ", subsequent_indent="    ")
         self.check(result, expect)
 
+def dedent(text):
+    return text.dedent()
 
 # Despite the similar names, DedentTestCase is *not* the inverse
 # of IndentTestCase!
diff --git a/Objects/unicodeobject.c b/Objects/unicodeobject.c
index 284185756f..c6a9756a4e 100644
--- a/Objects/unicodeobject.c
+++ b/Objects/unicodeobject.c
@@ -13484,6 +13484,12 @@ _PyUnicode_Dedent(PyObject *unicode)
     return res;
 }
 
+PyObject *
+unicode_dedent(PyObject *u, PyObject *Py_UNUSED(ignored))
+{
+    return _PyUnicode_Dedent(u);
+}
+
 static PyMethodDef unicode_methods[] = {
     UNICODE_ENCODE_METHODDEF
     UNICODE_REPLACE_METHODDEF
@@ -13534,6 +13540,7 @@ static PyMethodDef unicode_methods[] = {
     UNICODE___FORMAT___METHODDEF
     UNICODE_MAKETRANS_METHODDEF
     UNICODE_SIZEOF_METHODDEF
+    {"dedent", unicode_dedent, METH_NOARGS},
     {"__getnewargs__",  unicode_getnewargs, METH_NOARGS},
     {NULL, NULL}
 };
7 Likes

I just wanted to notice dedent() is not just removing common prefix from each line.

Maybe, I misread your comment. You just demonstrated that non-regex approach outperforms regex-based approach. So I didnā€™t need to mention about difference between dedent and removing common prefix.

Another reason to not using regex is simplicity.
One of goal of str.dedent() method is compile time processing (constant folding).
Users may write long multi-line SQL with triple-quote. ...""".dedent() makes it easy to compile time processing. User can save both of runtime cost and RAM usage.
And using sre in builtin str method is complex than implementing non-regex based approach in C.

Jon Crall implemented it already in the pull request. So I want to reuse it for str.dedent().

2 Likes

Wow, congrats!

1 Like

The addition of removesuffix and removeprefix to str required a pep right? Wouldnā€™t these additions also require a pep?

1 Like

Iā€™m not sure about adding builtin method always needs PEP or not.

FYI, adding this method had been decided once in 2013. See this comment.

To end proposals for new syntax to do what they do for triple-quoted strings. Nick Coghlan gave reasons as follows: run time cost small, can be optimized away, would be used more than some other string methods.
[Python-ideas] Idea for new multi-line triple quote literal

In response, Guido said ā€œThatā€™s a compelling argument. Letā€™s do it.ā€
[Python-ideas] Idea for new multi-line triple quote literal

For now, 10 year later, has this situation been changed?
If it is now controversive argument than compelling, I will write a PEP for it.

IMO, this seems like a logical enough addition to not need a PEP. Not that my word on the matter is worth much :slightly_smiling_face:

2 Likes

A few things changed in 10 years. For example, Guido is no longer in charge, but the Steering Council. But IMO this approval from the time machine is still relevant :slight_smile:

For removeprefix/suffix, I requested a PEP. I expected to be quickly written. But. Oh well, the devil is in detailsā€¦ Have a look at the Rejected Ideas section: PEP 616 ā€“ String methods to remove prefixes and suffixes | peps.python.org

I think that the most difficult point was to decide to accept only a string or also accept a tuple of strings, similar to startswith(). Well, read the PEP for details.

A PEP is a nice place to drop links to past discussions, summarize the rationale, etc.

1 Like

Thereā€™s an SC now, so yes, things have changed. :wink: You can ask the SC if they think a PEP is necessary.

1 Like

How is this going on?

I support adding this str.dedent. textwrap.dedent is widely used, and people would expect str.dedent the same naturally.

I implemented the C algorithm carefully, it should be safe and correct, and provide some performance boost.

1 Like

This is diverging a bit from indents, butā€¦

I would like to register strong opposition to adding new text-like functionality to bytes. The following is a brief rant that superficially digresses from the topic (the issues I describe here oughtnā€™t matter to the proposed functionality), but motivates my opinion on the matter.

Iā€™m mildly against adding new textlike stuff to bytes for all the
reasons Karl enumerates, but wanted to add a counter example on the
topic of bytes looking a bit like text.

Karl writes:

To say nothing of the fact that privileging ASCII is just not actually
that useful
for dealing with legacy data - considering that ā€œlegacy
dataā€ exists worldwide and lots of systems used to blithely assume a
ā€œnativeā€ encoding not recorded in metadata anywhere.

Iā€™m writing a PDF parser at present, and PDF is the very image of a
binary data format masquerading as ASCII text. With raw binary byte
streams embedded in it, which can only be parsed correctly if you
parse/evaluate the dictionary which preceeds it :slight_smile:

And it doesnā€™t entirely pretend to be pure ASCII outside those streams,
the PDF specification of a name is nearly ā€œa sequence of bytes which
doesnā€™t include the ASCII whitespace code valuesā€. Gah!

Anyway, being binary, Iā€™ve benefited enormously from the existence of
binary regexps.

There, thatā€™s my bit said.

Cheers,
Cameron Simpson cs@cskk.id.au

If weā€™re including ā€œformatsā€ in ā€œprotocolsā€, PDF (Iā€™m using a binary
regexps in my own code). It reads like ASCII text with allowed higher
byte values without specifying that theyā€™re in a character set, with
some raw binary shoehorned in.

And isnā€™t HTTP technically binary-with-ASCII-headers; I seem to recall
you canā€™t treat it as ASCII (sorry, no citation)?

FYI that discussion was continued in another thread:

1 Like

To be fair, the code page system was a pretty decent kludge for its purpose, and it has the consequence that lots of things are at least ASCII-transparent. Itā€™s just that you still need to figure out what to do when you get a byte with the high bit set. And no, a ā€œnameā€ that contains bytes in an unknown encoding and unknown semantics besides ā€œnot ASCII whitespaceā€ā€¦ still doesnā€™t really need to be thought of as ā€œtextā€.

I mean, Youā€™re just going to be comparing it for equality, right? Not, say, trying to uppercase it, when you donā€™t even know if a given byte represents Ɵ, let alone whether a Turkish locale should be assumed?

For clarity, though, I donā€™t think thereā€™s anything wrong with the concept of byte regexes per se. I just wish they didnā€™t preserve text-like concepts like ā€œword characterā€ so strongly, and wish they did make it easier to, say, match a byte with a given value for the low nybble (or even specific bit patterns).

1 Like

A post was split to a new topic: Markdown code-block style syntax for string literals

Is the Steering Council fine with merging this without a PEP? Iā€™m quite excited for this!

Based on recent precedent (e.g. str.removeprefix in PEP 616, fully qualified names in PEP 737, I think adding new builtin methods should go through a PEP, but the Steering Council hasnā€™t said this explicitly.

1 Like

Also BaseException.__notes__ in 3.11. This was originally implemented without a PEP, but the Steering Council voiced concern about that and, as a result, a PEP was written (which led to the design of the feature being changed). See Guidelines on semantic changes to any built-in? Ā· Issue #93 Ā· python/steering-council Ā· GitHub and PEP 678

2 Likes