The Problem
The function textwrap.dedent
does not work as expected for text with CRLF line endings.
A key feature of textwrap.dedent
is the ability to ignore lines that contain only whitespace (' '
and '\t'
) and end in a newline character. The problem arises when processing strings that use CRLF for line endings as the regex patterns used in the function consider the carriage return character as non-whitespace. This means the string ' foo\n\n'
will be processed as expected into 'foo\n\n'
but the string ' foo\r\n\r\n'
will be unchanged. This is because the second line contains a non-whitespace character ('\r'
) and thus is included when calculating the largest common prefix which in this case would be ''
causing no change to happen.
One possible solution is to use the pattern 'textwrap.dedent(input_str.replace('\r\n','\n')).replace('\n','\r\n')'
. This will work as long as the input string only uses CRLF as its line endings. If the input string contains mixed line endings, this solution will change them all to CRLF. While this behavior may be okay since text with mixed line endings could be considered an okay edge case, however this code pattern still exists due to unexpected behavior of textwrap.dedent
.
This Github issue proposes another solution but there are three problems I see with it. One, it changes the default behavior of textwrap.dedent possibly breaking legacy code (this may or may not be a big deal). Two, it does not treat CRLF line endings the same as LF line endings. Currently any line that contains only whitespace and ends with '\n'
is replaced with just '\n'
by the current implementation. The proposed solution however does not do something similar for lines that contain only whitespace and ends with '\r\n'
. Instead the lines are left alone. The third problem is the new regex pattern suggested treats the string ' \r\r\n'
as having a leading whitespace of 1 rather than 2. For any string that starts with a non-zero number of whitespace characters followed by '\r\r\n'
, the pattern will identify the leading whitespace as all leading whitespace characters except for the last one.
There is another Github issue (#59250) related to textwrap.dedent
but it is more focused on what should be considered a âwhitespaceâ character.
The Solution
My solution (located at this commit) overcomes all issues raised above. To prevent changes to the default behavior of textwrap.dedent
, a new argument was added that would act as a flag to enable the new behavior. Enabling the new behavior adds an extra processing step where lines with only whitespace and ending with CRLF line endings are replaced with '\r\n'
to emulate the way LF lines are handled. It also changes the regex patterns used to gather all prefixes to ignore lines containing only '\r\n'
. I have updated the tests for the function to demonstrate that passing True
to the new argument treats CRLF line endings the same as LF line endings.
Final Thoughts
I am interested to hear your thoughts on my proposed changes and am open to hearing suggestions on implementation. The current name of the âflagâ is eol_agnostic
but I welcome suggestions for a better one. I have also made the decision to make eol_agnostic
keyword-only to avoid boolean traps.
I understand this is not a pull request or even a official issue but I am including the output of make patchcheck
:
The following modules are *disabled* in configure script:
_sqlite3
The necessary bits to build these optional modules were not found:
_bz2 _curses _curses_panel
_dbm _gdbm _lzma
_tkinter readline
To find the necessary bits, look in configure.ac and config.log.
Checked 111 modules (30 built-in, 71 shared, 1 n/a on linux-x86_64, 1 disabled, 8 missing, 0 failed on import)
./python ./Tools/patchcheck/patchcheck.py
Getting base branch for PR ... origin/main
Getting the list of files that have been added/changed ... 1 file
Fixing Python file whitespace ... 0 files
Fixing C file whitespace ... 0 files
Fixing docs whitespace ... 0 files
Docs modified ... NO
Misc/ACKS updated ... NO
Misc/NEWS.d updated with `blurb` ... NO
configure regenerated ... not needed
pyconfig.h.in regenerated ... not needed
Did you run the test suite?
And the output of ./python -bb -E -Wd -m test -r -w -uall -j0
:
== Tests result: SUCCESS ==
404 tests OK.
1 test altered the execution environment:
test_generators
29 tests skipped:
test_bz2 test_check_c_globals test_curses test_dbm_gnu
test_dbm_ndbm test_devpoll test_gdb test_idle test_ioctl
test_kqueue test_launcher test_lzma test_msilib test_peg_generator
test_readline test_sqlite3 test_startfile test_tcl test_tix
test_tkinter test_ttk test_ttk_textonly test_turtle
test_winconsoleio test_winreg test_winsound test_wmi
test_zipfile64 test_zoneinfo
Total duration: 2 min 16 sec
Tests result: SUCCESS