Add support for CRLF in textwrap.dedent

belfner · February 19, 2023, 11:29pm

The Problem

The function textwrap.dedent does not work as expected for text with CRLF line endings.

A key feature of textwrap.dedent is the ability to ignore lines that contain only whitespace (' ' and '\t') and end in a newline character. The problem arises when processing strings that use CRLF for line endings as the regex patterns used in the function consider the carriage return character as non-whitespace. This means the string ' foo\n\n' will be processed as expected into 'foo\n\n' but the string ' foo\r\n\r\n' will be unchanged. This is because the second line contains a non-whitespace character ('\r') and thus is included when calculating the largest common prefix which in this case would be '' causing no change to happen.

One possible solution is to use the pattern 'textwrap.dedent(input_str.replace('\r\n','\n')).replace('\n','\r\n')' . This will work as long as the input string only uses CRLF as its line endings. If the input string contains mixed line endings, this solution will change them all to CRLF. While this behavior may be okay since text with mixed line endings could be considered an okay edge case, however this code pattern still exists due to unexpected behavior of textwrap.dedent.

This Github issue proposes another solution but there are three problems I see with it. One, it changes the default behavior of textwrap.dedent possibly breaking legacy code (this may or may not be a big deal). Two, it does not treat CRLF line endings the same as LF line endings. Currently any line that contains only whitespace and ends with '\n' is replaced with just '\n' by the current implementation. The proposed solution however does not do something similar for lines that contain only whitespace and ends with '\r\n'. Instead the lines are left alone. The third problem is the new regex pattern suggested treats the string ' \r\r\n' as having a leading whitespace of 1 rather than 2. For any string that starts with a non-zero number of whitespace characters followed by '\r\r\n', the pattern will identify the leading whitespace as all leading whitespace characters except for the last one.

There is another Github issue (#59250) related to textwrap.dedent but it is more focused on what should be considered a “whitespace” character.

The Solution

My solution (located at this commit) overcomes all issues raised above. To prevent changes to the default behavior of textwrap.dedent, a new argument was added that would act as a flag to enable the new behavior. Enabling the new behavior adds an extra processing step where lines with only whitespace and ending with CRLF line endings are replaced with '\r\n' to emulate the way LF lines are handled. It also changes the regex patterns used to gather all prefixes to ignore lines containing only '\r\n'. I have updated the tests for the function to demonstrate that passing True to the new argument treats CRLF line endings the same as LF line endings.

Final Thoughts

I am interested to hear your thoughts on my proposed changes and am open to hearing suggestions on implementation. The current name of the “flag” is eol_agnostic but I welcome suggestions for a better one. I have also made the decision to make eol_agnostic keyword-only to avoid boolean traps.

I understand this is not a pull request or even a official issue but I am including the output of make patchcheck:

The following modules are *disabled* in configure script:
_sqlite3                                                       

The necessary bits to build these optional modules were not found:
_bz2                  _curses               _curses_panel      
_dbm                  _gdbm                 _lzma              
_tkinter              readline                                 
To find the necessary bits, look in configure.ac and config.log.

Checked 111 modules (30 built-in, 71 shared, 1 n/a on linux-x86_64, 1 disabled, 8 missing, 0 failed on import)
./python ./Tools/patchcheck/patchcheck.py
Getting base branch for PR ... origin/main
Getting the list of files that have been added/changed ... 1 file
Fixing Python file whitespace ... 0 files
Fixing C file whitespace ... 0 files
Fixing docs whitespace ... 0 files
Docs modified ... NO
Misc/ACKS updated ... NO
Misc/NEWS.d updated with `blurb` ... NO
configure regenerated ... not needed
pyconfig.h.in regenerated ... not needed

Did you run the test suite?

And the output of ./python -bb -E -Wd -m test -r -w -uall -j0:

== Tests result: SUCCESS ==

404 tests OK.

1 test altered the execution environment:
    test_generators

29 tests skipped:
    test_bz2 test_check_c_globals test_curses test_dbm_gnu
    test_dbm_ndbm test_devpoll test_gdb test_idle test_ioctl
    test_kqueue test_launcher test_lzma test_msilib test_peg_generator
    test_readline test_sqlite3 test_startfile test_tcl test_tix
    test_tkinter test_ttk test_ttk_textonly test_turtle
    test_winconsoleio test_winreg test_winsound test_wmi
    test_zipfile64 test_zoneinfo

Total duration: 2 min 16 sec
Tests result: SUCCESS

MRAB · February 20, 2023, 12:04am

Why do you have ‘\r’ in your strings? Python normally uses ‘\n’ as the line ending with ‘\r’ existing only in files, and even then only in files that were written on Windows.

belfner · February 20, 2023, 12:16am

The ‘\r’ is just for the example. I ran into this issue when working with files from Windows. More specifically I was copying text from a file on windows, then grabbed the text from the clipboard. The string still had CRLF line endings.

PythonCHB · February 20, 2023, 4:57am

I think that it was / is that way exactly because of MRAB’s point – for the intended use cases, this would never be an issue. (particularly since py3 and “universal newlines”.

However, I see no harm in making this change – it’s really hard to imagine that anyone would WANT \r\n not to be treated as a newline, and clearly, people can end up with mixed text.

guido · February 20, 2023, 5:29am

There are many string APIs in Python that use newlines and nearly all of them use only \n. (There may be some exceptions in the email and/or http modules). Let’s not make ad hoc exceptions.

If your editor (or other tool) preserves \r\n copied from Windows files on UNIX you should deal with it there, not by requesting Python to change.

PythonCHB · February 20, 2023, 7:36am

I still don’t think it would do any harm, but also not hard for folks dealing with mixed-line ending text to use a little utility to clean it up first.

Back before universal newlines, I wrote a fair bit of code like this:

def normalize_line_endings(text):
    text.replace('\r\n', '\n')
    text.replace('\r', '\n')
    return text

Pretty simple, really – and the only reason for the second one is if you have old-style Mac line endings – which used to be a single '\r'. And the advantage of this is that you can decide for yourself how to deal with double or lone ‘\r’.

Or you can do it in a “fluent” style:

textwrap.wrap(text.replace('\r\n', '\n'))

Not all that much of a lift, is it?

barry-scott · February 20, 2023, 9:38am

It becomes a maintenance burden.
It is code that needs lots of comments to explain why it is needed.

I think it is close to falling into the lava-flow anti-pattern.

I agree with Guido’s comments, fix the editor.

tjreedy · February 20, 2023, 8:15pm

Based on the discussion above, ‘merwok’ closed textwrap.dedent doesn't work properly with strings containing CRLF · Issue #63678 · python/cpython · GitHub. I agree with this.

What code editor were you using? I copied two lines from Notepad to IDLE. I am rather sure that it had \r\n in Notepad and the clipboard. It only had \n after pasting. (IDLE uses tk via tkinter and tk only uses \n.) Multiline string literals must only use \n as line separators, even if the lines are pasted rather than keyed.

belfner · February 20, 2023, 9:12pm

Thanks for the feedback and I agree that making this change will make this function non standard. My point with the original post was to highlight what I had originally perceived to be a gap in functionality not to just change python rather than solving my problem. I understand that there are ways to work around the existing implementation to get the functionality I want even if I don’t “fix” my editor. The main reason I felt this post was even justified is because in the first GitHub issue I linked it seemed there was interest but the patch was flawed.

I was copying text from a file with CRLF line endings in Pycharm on Windows 10 then pulling the clipboard with pyperclip. Again, this is not the motivation behind the original post but I appreciate the help.

Thanks for everybody’s time.