Changing str.splitlines to match file readlines

malemburg · November 17, 2018, 8:05am

We have already had the discussion on the bug tracker.

In summary the situation goes like this:

When Unicode was added, I had focused on making the right design decisions from the Unicode perspective. Unicode does define more line breaks than ASCII, so we had to cover all of them to stay in line with the standard, just as Unicode defines way more characters which can be interpreted as digits or even whole numbers and the corresponding methods such as .isdigit() were extended. This was necessary, since the Unicode object in Python was supposed to represent Unicode according to the standard.

Now, when Python 3 switched to using Unicode instead of bytes for “str”, we got the Unicode semantics, but without actually going through the stdlib (or other code) which was written with ASCII semantics in mind in some parts. In some cases, this helped, in others it did not, since Python would now recognize more “characters” as line breaks or digits than originally planned by the resp. code authors.

After all this time, I don’t think we can change the semantics radically anymore and simply have to accept that we did not look close enough at the implications when transitioning from str=bytes to str=unicode.

Overall, I don’t think there are that many cases where the Unicode semantics cause a problem in real life. The few extra line end characters are very rarely used in practice and will most likely point to problems with the data, rather than problems with the code, in those cases where you don’t want the new code points to be interpreted as line breaks.

If you are indeed working with Unicode text and not with ASCII text which you happen to receive as Uncode str, you do want these line breaks, since that’s what Unicode defines and what the author of the text will have used as basis for writing the text. In this respect, the handling of .splitlines() is correct. It is only not correct for standards that mandate ASCII line break characters only.

I think for code which definitely may only break on ASCII newlines we should either create a new method (e.g. .asciilines()) or add a parameter to .splitlines() which restricts the set of characters to split on to the ASCII ones we had in Python 2.