textwrap.dedent does not work as expected for text with CRLF line endings.
A key feature of
textwrap.dedent is the ability to ignore lines that contain only whitespace (
' ' and
'\t') and end in a newline character. The problem arises when processing strings that use CRLF for line endings as the regex patterns used in the function consider the carriage return character as non-whitespace. This means the string
' foo\n\n' will be processed as expected into
'foo\n\n' but the string
' foo\r\n\r\n' will be unchanged. This is because the second line contains a non-whitespace character (
'\r') and thus is included when calculating the largest common prefix which in this case would be
'' causing no change to happen.
One possible solution is to use the pattern
'textwrap.dedent(input_str.replace('\r\n','\n')).replace('\n','\r\n')' . This will work as long as the input string only uses CRLF as its line endings. If the input string contains mixed line endings, this solution will change them all to CRLF. While this behavior may be okay since text with mixed line endings could be considered an okay edge case, however this code pattern still exists due to unexpected behavior of
This Github issue proposes another solution but there are three problems I see with it. One, it changes the default behavior of textwrap.dedent possibly breaking legacy code (this may or may not be a big deal). Two, it does not treat CRLF line endings the same as LF line endings. Currently any line that contains only whitespace and ends with
'\n' is replaced with just
'\n' by the current implementation. The proposed solution however does not do something similar for lines that contain only whitespace and ends with
'\r\n'. Instead the lines are left alone. The third problem is the new regex pattern suggested treats the string
' \r\r\n' as having a leading whitespace of 1 rather than 2. For any string that starts with a non-zero number of whitespace characters followed by
'\r\r\n', the pattern will identify the leading whitespace as all leading whitespace characters except for the last one.
There is another Github issue (#59250) related to
textwrap.dedent but it is more focused on what should be considered a “whitespace” character.
My solution (located at this commit) overcomes all issues raised above. To prevent changes to the default behavior of
textwrap.dedent, a new argument was added that would act as a flag to enable the new behavior. Enabling the new behavior adds an extra processing step where lines with only whitespace and ending with CRLF line endings are replaced with
'\r\n' to emulate the way LF lines are handled. It also changes the regex patterns used to gather all prefixes to ignore lines containing only
'\r\n'. I have updated the tests for the function to demonstrate that passing
True to the new argument treats CRLF line endings the same as LF line endings.
I am interested to hear your thoughts on my proposed changes and am open to hearing suggestions on implementation. The current name of the “flag” is
eol_agnostic but I welcome suggestions for a better one. I have also made the decision to make
eol_agnostic keyword-only to avoid boolean traps.
I understand this is not a pull request or even a official issue but I am including the output of
The following modules are *disabled* in configure script: _sqlite3 The necessary bits to build these optional modules were not found: _bz2 _curses _curses_panel _dbm _gdbm _lzma _tkinter readline To find the necessary bits, look in configure.ac and config.log. Checked 111 modules (30 built-in, 71 shared, 1 n/a on linux-x86_64, 1 disabled, 8 missing, 0 failed on import) ./python ./Tools/patchcheck/patchcheck.py Getting base branch for PR ... origin/main Getting the list of files that have been added/changed ... 1 file Fixing Python file whitespace ... 0 files Fixing C file whitespace ... 0 files Fixing docs whitespace ... 0 files Docs modified ... NO Misc/ACKS updated ... NO Misc/NEWS.d updated with `blurb` ... NO configure regenerated ... not needed pyconfig.h.in regenerated ... not needed Did you run the test suite?
And the output of
./python -bb -E -Wd -m test -r -w -uall -j0:
== Tests result: SUCCESS == 404 tests OK. 1 test altered the execution environment: test_generators 29 tests skipped: test_bz2 test_check_c_globals test_curses test_dbm_gnu test_dbm_ndbm test_devpoll test_gdb test_idle test_ioctl test_kqueue test_launcher test_lzma test_msilib test_peg_generator test_readline test_sqlite3 test_startfile test_tcl test_tix test_tkinter test_ttk test_ttk_textonly test_turtle test_winconsoleio test_winreg test_winsound test_wmi test_zipfile64 test_zoneinfo Total duration: 2 min 16 sec Tests result: SUCCESS