Str.dedent vs str.removeindent

methane · July 24, 2023, 11:41pm

Current C dedent implementation (.e.g _PyUnicode_Dedent) is in https://github.com/python/cpython/pull/103998 .

I added this patch to the pull request, and it passes 100% textwrap.dedent() test cases.

diff --git a/Lib/test/test_textwrap.py b/Lib/test/test_textwrap.py
index dfbc2b93df..17b3a0efab 100644
--- a/Lib/test/test_textwrap.py
+++ b/Lib/test/test_textwrap.py
@@ -10,7 +10,7 @@
 
 import unittest
 
-from textwrap import TextWrapper, wrap, fill, dedent, indent, shorten
+from textwrap import TextWrapper, wrap, fill, indent, shorten
 
 
 class BaseTestCase(unittest.TestCase):
@@ -760,6 +760,8 @@ def test_subsequent_indent(self):
                       initial_indent="  * ", subsequent_indent="    ")
         self.check(result, expect)
 
+def dedent(text):
+    return text.dedent()
 
 # Despite the similar names, DedentTestCase is *not* the inverse
 # of IndentTestCase!
diff --git a/Objects/unicodeobject.c b/Objects/unicodeobject.c
index 284185756f..c6a9756a4e 100644
--- a/Objects/unicodeobject.c
+++ b/Objects/unicodeobject.c
@@ -13484,6 +13484,12 @@ _PyUnicode_Dedent(PyObject *unicode)
     return res;
 }
 
+PyObject *
+unicode_dedent(PyObject *u, PyObject *Py_UNUSED(ignored))
+{
+    return _PyUnicode_Dedent(u);
+}
+
 static PyMethodDef unicode_methods[] = {
     UNICODE_ENCODE_METHODDEF
     UNICODE_REPLACE_METHODDEF
@@ -13534,6 +13540,7 @@ static PyMethodDef unicode_methods[] = {
     UNICODE___FORMAT___METHODDEF
     UNICODE_MAKETRANS_METHODDEF
     UNICODE_SIZEOF_METHODDEF
+    {"dedent", unicode_dedent, METH_NOARGS},
     {"__getnewargs__",  unicode_getnewargs, METH_NOARGS},
     {NULL, NULL}
 };

methane · July 25, 2023, 1:15am

I just wanted to notice dedent() is not just removing common prefix from each line.

Maybe, I misread your comment. You just demonstrated that non-regex approach outperforms regex-based approach. So I didn’t need to mention about difference between dedent and removing common prefix.

Another reason to not using regex is simplicity.
One of goal of str.dedent() method is compile time processing (constant folding).
Users may write long multi-line SQL with triple-quote. ...""".dedent() makes it easy to compile time processing. User can save both of runtime cost and RAM usage.
And using sre in builtin str method is complex than implementing non-regex based approach in C.

Jon Crall implemented it already in the pull request. So I want to reuse it for str.dedent().

vstinner · July 25, 2023, 2:50am

Wow, congrats!

ajoino · July 25, 2023, 9:19am

The addition of removesuffix and removeprefix to str required a pep right? Wouldn’t these additions also require a pep?

methane · July 25, 2023, 2:59pm

I’m not sure about adding builtin method always needs PEP or not.

FYI, adding this method had been decided once in 2013. See this comment.

To end proposals for new syntax to do what they do for triple-quoted strings. Nick Coghlan gave reasons as follows: run time cost small, can be optimized away, would be used more than some other string methods.
[Python-ideas] Idea for new multi-line triple quote literal

In response, Guido said “That’s a compelling argument. Let’s do it.”
[Python-ideas] Idea for new multi-line triple quote literal

For now, 10 year later, has this situation been changed?
If it is now controversive argument than compelling, I will write a PEP for it.

pf_moore · July 25, 2023, 3:19pm

IMO, this seems like a logical enough addition to not need a PEP. Not that my word on the matter is worth much

vstinner · July 25, 2023, 3:56pm

A few things changed in 10 years. For example, Guido is no longer in charge, but the Steering Council. But IMO this approval from the time machine is still relevant

For removeprefix/suffix, I requested a PEP. I expected to be quickly written. But. Oh well, the devil is in details… Have a look at the Rejected Ideas section: PEP 616 – String methods to remove prefixes and suffixes | peps.python.org

I think that the most difficult point was to decide to accept only a string or also accept a tuple of strings, similar to startswith(). Well, read the PEP for details.

A PEP is a nice place to drop links to past discussions, summarize the rationale, etc.

brettcannon · July 25, 2023, 11:29pm

There’s an SC now, so yes, things have changed. You can ask the SC if they think a PEP is necessary.

sunmy2019 · September 29, 2023, 10:00pm

How is this going on?

I support adding this str.dedent. textwrap.dedent is widely used, and people would expect str.dedent the same naturally.

I implemented the C algorithm carefully, it should be safe and correct, and provide some performance boost.

cameron · September 30, 2023, 2:38am

This is diverging a bit from indents, but…

Victor Stinner:

I’m not sure that it’s useful on bytes, but i don’t have a strong reason to not add it neither for consistency.

I would like to register strong opposition to adding new text-like functionality to bytes. The following is a brief rant that superficially digresses from the topic (the issues I describe here oughtn’t matter to the proposed functionality), but motivates my opinion on the matter.

I’m mildly against adding new textlike stuff to bytes for all the
reasons Karl enumerates, but wanted to add a counter example on the
topic of bytes looking a bit like text.

Karl writes:

To say nothing of the fact that privileging ASCII is just not actually
that useful for dealing with legacy data - considering that “legacy
data” exists worldwide and lots of systems used to blithely assume a
“native” encoding not recorded in metadata anywhere.

I’m writing a PDF parser at present, and PDF is the very image of a
binary data format masquerading as ASCII text. With raw binary byte
streams embedded in it, which can only be parsed correctly if you
parse/evaluate the dictionary which preceeds it

And it doesn’t entirely pretend to be pure ASCII outside those streams,
the PDF specification of a name is nearly “a sequence of bytes which
doesn’t include the ASCII whitespace code values”. Gah!

Anyway, being binary, I’ve benefited enormously from the existence of
binary regexps.

There, that’s my bit said.

Cheers,
Cameron Simpson cs@cskk.id.au

cameron · September 30, 2023, 2:48am

If we’re including “formats” in “protocols”, PDF (I’m using a binary
regexps in my own code). It reads like ASCII text with allowed higher
byte values without specifying that they’re in a character set, with
some raw binary shoehorned in.

And isn’t HTTP technically binary-with-ASCII-headers; I seem to recall
you can’t treat it as ASCII (sorry, no citation)?

petersuter · September 30, 2023, 4:45am

FYI that discussion was continued in another thread:

kknechtel · September 30, 2023, 8:08am

To be fair, the code page system was a pretty decent kludge for its purpose, and it has the consequence that lots of things are at least ASCII-transparent. It’s just that you still need to figure out what to do when you get a byte with the high bit set. And no, a “name” that contains bytes in an unknown encoding and unknown semantics besides “not ASCII whitespace”… still doesn’t really need to be thought of as “text”.

I mean, You’re just going to be comparing it for equality, right? Not, say, trying to uppercase it, when you don’t even know if a given byte represents ß, let alone whether a Turkish locale should be assumed?

For clarity, though, I don’t think there’s anything wrong with the concept of byte regexes per se. I just wish they didn’t preserve text-like concepts like “word character” so strongly, and wish they did make it easier to, say, match a byte with a given value for the low nybble (or even specific bit patterns).

CAM-Gerlach · December 9, 2023, 8:54am

A post was split to a new topic: Markdown code-block style syntax for string literals

ofek · December 9, 2023, 9:02pm

Is the Steering Council fine with merging this without a PEP? I’m quite excited for this!

Jelle · December 9, 2023, 10:28pm

Based on recent precedent (e.g. str.removeprefix in PEP 616, fully qualified names in PEP 737, I think adding new builtin methods should go through a PEP, but the Steering Council hasn’t said this explicitly.

AlexWaygood · December 9, 2023, 11:59pm

Also BaseException.__notes__ in 3.11. This was originally implemented without a PEP, but the Steering Council voiced concern about that and, as a result, a PEP was written (which led to the design of the feature being changed). See Guidelines on semantic changes to any built-in? · Issue #93 · python/steering-council · GitHub and PEP 678