Changing str.splitlines to match file readlines

This is related to bug 18291. Perhaps some people will be surprised that str.splitlines() splits on a bunch more things besides newlines (\n) and carriage return (\r) characters. That’s unlike the str object in Python 2. See the doc for str.splitlines. Because the codecs module uses str.splitlines internally, you get different line splitting behaviour when you use codecs.getreader() vs using open() with a text mode file.

I don’t see the sense of having splitlines() treat all these extra control characters as splitting lines. If I look at the uses of splitlines() in the stdlib, it seems that usually people would want the same splitting behavior as what a io file does. I would not be surprised if there real bugs, some of them security related, due to this behavior.

After doing some digging, it looks like this behavior goes way back to when the unicode type as introduced in Python 2. I suspect when code was converted from 2 to 3, splitlines() got changed from operating on bytes to operating on unicode strings. And, I bet people didn’t realize this difference in behaviour.

IMHO, it should ideally be the case that:
open(fn).readlines() == open(fn).read().splitlines().

I’m posting here because I don’t know if this is something we can change, as it could be behaviour that people are relying on.

2 Likes

I agree that "same to io.readlines()" is the behavior 99% people expect.
This backward incompatible change will fix more bugs in user applications than break user application.

But breaking backward compatibility is a hard decision, always.

More easy way is adding a new method: str.iterlines{[keepends]).
This is ugly, but backward compatible.

1 Like

There are a number of bugs in the issue tracker related to this, some open, some closed. I found bug 22232 which seems to be most relevant (str.splitlines splitting on non-\r\n characters). Now I’m not sure I should have brought the discussion here, maybe should have keep it inside the issue tracker.

Marc-Andre Lemburg has commented on the issue and is -1 on changing str.splitlines behavior and also doesn’t think codecs should be changed either. I can see his argument from a Unicode purity standpoint. However, to me, it looks like this behavior creates way more bugs than it helps. Naming the unicode.splitlines method the same as str.splitlines was a mistake, IMHO.

If I look at the uses of .splitlines(...) in the Python standard library and in the top 50 PyPi packages, nearly every case appears to expect splitting on ‘\r’ and ‘\n’ or on ‘\n’ alone. In most cases the code matches the Python 2 code and often the value being split is a str. I haven’t found any case where it is clear that the Unicode-style splitting would be desired. So, that’s why I feel Marc-Andre is arguing for purity rather than practicality.

Adding a new method like iterlines() might be the best solution. If we do, I think it should split only on \n, so it matches readlines() on file objects. We should give it a keepends keyword so that fixing calls of splitlines() is easy. Then, we have to audit all uses of splitlines() and most of them need to get changed to iterlines().

I would like codecs reader objects to get a newline keyword parameter and have it work like the io.open version does. Then, modules like csv should have a way to check that newline='' has been used and warn if it hasn’t. To me, it looks like there are many bugs related to this Unicode splitting behaviour. Start putting \v characters in protocol headers, CSV file fields, etc, and you will see a lot of unexpected behaviour.

Would it make sense to give iterlines() a newline parameter and make it behave as open does? I.e. it controls universal newlines the same way as for file objects. That would make implementing support for newline in codecs.getreader trivial, just replace splitlines() with iterlines(newline=self._newline). The most common pattern seems to be “I have a chunk of text/bytes, give me the lines just like file.readlines() would give them”.

I actually didn’t even know str.splitlines() didn’t operate on just newlines, so I think that’s going to surprise a lot of people if it split on something else. For practicality purposes I would support changing it. We could probably has a DeprecationWarning in 3.8 when str.splitlines() would have split on something other than newlines and then in 3.9 make it only split on newlines (and possibly expand the docstring to explain how to get the old behaviour with a code snippet). Otherwise adding a keyword-only argument like newlines_only=True as an option would also work but still be an odd default compared to user expectation.

1 Like

Here is a quick analysis of the use of splitlines inside the stdiib. It is not so clear what is intended (or even if sometimes the data is str or bytes). So, take my analysis with a grain of salt as I did it quickly.

codecs.py
509:                    lines = newchars.splitlines(keepends=True)
551:                line = line.splitlines(keepends=False)[0]
568:            lines = line.splitlines(keepends=True)
584:                        line = line.splitlines(keepends=False)[0]
587:                line0withoutend = lines[0].splitlines(keepends=False)[0]
600:                    line = line.splitlines(keepends=False)[0]
619:        return data.splitlines(keepends)
822:        return data.splitlines(keepends=True)

Marc-Andre suggest that codecs should split lines according to the Unicode
ISNEWLINE rules and should not match file.readlines() for text files. I think
otherwise but let’s leave this case. At minimum, I think getreader() should
get an option newline to make it match what open() does.

nntplib.py
896:            f = f.splitlines()

In this case f is bytes so this is okay maybe. Not sure splitting on \r is
okay though.

doctest.py
1413:            return example.source.splitlines(keepends=True)
1676:            want_lines = want.splitlines(keepends=True)
1677:            got_lines = got.splitlines(keepends=True)

Split on ‘\n’ only.

argparse.py
655:        return ''.join(indent + line for line in text.splitlines(keepends=True))
666:        return text.splitlines()
2047:                        for arg_line in args_file.read().splitlines():

Split on ‘\n’ only.

linecache.py
113:                    [line+'\n' for line in data.splitlines()], fullname

Split on ‘\n’ only.

pprint.py
250:        lines = object.splitlines(True)

Split on ‘\n’ only.

difflib.py
785:    ... '''.splitlines(keepends=True)
794:    ... '''.splitlines(keepends=True)
880:        >>> print(''.join(Differ().compare('one\ntwo\nthree\n'.splitlines(True),
881:        ...                                'ore\ntree\nemu\n'.splitlines(True))),
1247:    >>> print(''.join(context_diff('one\ntwo\nthree\nfour\n'.splitlines(True),
1248:    ...       'zero\none\ntree\nfour\n'.splitlines(True), 'Original', 'Current')),
1366:    >>> diff = ndiff('one\ntwo\nthree\n'.splitlines(keepends=True),
1367:    ...              'ore\ntree\nemu\n'.splitlines(keepends=True))
2070:    >>> diff = ndiff('one\ntwo\nthree\n'.splitlines(keepends=True),
2071:    ...              'ore\ntree\nemu\n'.splitlines(keepends=True))

Split on ‘\n’ or ‘\r’?

pydoc.py
288:            result = module.__doc__.splitlines()[0] if module.__doc__ else None
2155:                    desc = module.__doc__.splitlines()[0] if module.__doc__ else ''

Split on ‘\n’ only.

smtpd.py
709:        lines = data.splitlines()

Split on ‘\r’ or ‘\n’?

site.py
180:                    for line in record.splitlines():

Split on ‘\n’ only.

textwrap.py
478:        for line in text.splitlines(True):

Split according to Unicode ISNEWLINE? I think splitting on ‘\r’ or ‘\n’ would
be fine too.

email/contentmanager.py
143:    lines = string.encode(charset).splitlines()

Okay, this is bytes.

email/header.py
86:    for line in header.splitlines():
369:            lines = string.splitlines()

Split on ‘\n’ only?

email/quoprimime.py
186:    for line in body.splitlines():
243:    for line in encoded.splitlines():

Split on ‘\n’ only?

email/policy.py
142:        if isinstance(value, str) and len(value.splitlines())>1:
207:        lines = value.splitlines()

Split on ‘\n’ only?

email/message.py
286:            value, defects = decode_b(b''.join(bpayload.splitlines()))

Split on ‘\n’ only?

distutils/util.py
511:        for line in template.splitlines():

Split on ‘\n’ only?

distutils/_msvccompiler.py
149:        (line.partition('=') for line in out.splitlines())

Split on ‘\n’ only?

lib2to3/main.py
19:    a = a.splitlines()
20:    b = b.splitlines()

Split on ‘\n’ only.

lib2to3/refactor.py
550:        for line in input.splitlines(keepends=True):
594:            new = str(tree).splitlines(keepends=True)

Split on ‘\n’ only.

urllib/robotparser.py
70:            self.parse(raw.decode("utf-8").splitlines())

Split on ‘\n’ or ‘\r’?

unittest/case.py
1017:            difflib.ndiff(pprint.pformat(seq1).splitlines(),
1018:                          pprint.pformat(seq2).splitlines()))
1130:                           pprint.pformat(d1).splitlines(),
1131:                           pprint.pformat(d2).splitlines())))
1207:            firstlines = first.splitlines(keepends=True)
1208:            secondlines = second.splitlines(keepends=True)

Split on ‘\n’ or ‘\r’? Like difflib.

test/bisect.py
62:    tests = proc.stdout.splitlines()

Split on ‘\n’.

lib2to3/pgen2/grammar.py
185:for line in opmap_raw.splitlines():

Split on ‘\n’.

lib2to3/pgen2/tokenize.py
328:        readline = iter(newcode.splitlines(1)).next

Split on ‘\n’.

I don’t see how the current behavior is a good default. I can understand if we can’t change it due to backwards compatibility. I might argue that we would be doing more good than harm by changing it.

I would assume some Unicode standards nuts would be disappointed. (And perhaps they would want to change .readlines() instead?)

I also assume you intended to write .splitlines(True) in your example.

I can’t speak for them but I would expect so. From a purely “Unicode text” standpoint, splitting on the extra characters makes sense. So, for a module like textwrap, it is okay. However, based on my quick survey of code, that’s often not what is expected or wanted. I think when Python programs deal with text, in many cases it is dealing with text that is part of some file format or protocol. Then, if you split on the extra characters, you end up with bugs.

I think there’s nothing that can be done here. The two functions have had their current behavior for a very long time and changing either one is going to disappoint a few folks. The vast majority of folks won’t care at all about the difference, no matter how surprising it is to you.

1 Like

I can accept that we don’t want to change splitlines() at this point. What about introducing iterlines() such that:

list(open(…).readlines()) == list(open(…).read().iterlines())

What’s your use case that you’re so determined to fix this? You’re not going to change all the code that’s already written. If we’re going to add new API functionality I’d rather make it a new keyword arg to splitlines() and/or readlines().

I think most instances of splitlines() in the Python stdlib need to be fixed. They should be splitting on ‘\n’ only, same as readlines(). So, what is the best way to fix all those calls? You can’t use x.split(’\n’), it doesn’t do the same thing. Adding a new keyword arg to splitlines() is an option. But, I recall that you thought a keyword that is always set to a constant value is bad API design. So, I think a new method would be a better design.

I wouldn’t say I’m determined to fix this, just motivated to try. I was personally bitten by a bug caused by this. When data contains characters that Unicode considers line-breaks (e.g. ^K), my software broke and did weird things. After a fair amount of debugging (figuring out what data is needed to trigger the bug) I tracked it down. Based on looking at other code, I expect the kind of bug I ran into is common. Nearly every case of splitlines() looks like it intends to split on ‘\n’.

1 Like

I am definitely a tiny bit sad about this state of affairs, and like you I wish splitlines() had used the same rules as readlines(). But I don’t think there’s much that can be done without breaking backwards compatibility, and in most cases it probably doesn’t matter – even if we had a new method or parameter, I doubt it would be a good idea to make sweeping changes to stdlib code calling splitlines(). And I wonder how much you’re overreacting because you had a painful debugging session – this is a common psychological effect.

We have already had the discussion on the bug tracker.

In summary the situation goes like this:

When Unicode was added, I had focused on making the right design decisions from the Unicode perspective. Unicode does define more line breaks than ASCII, so we had to cover all of them to stay in line with the standard, just as Unicode defines way more characters which can be interpreted as digits or even whole numbers and the corresponding methods such as .isdigit() were extended. This was necessary, since the Unicode object in Python was supposed to represent Unicode according to the standard.

Now, when Python 3 switched to using Unicode instead of bytes for “str”, we got the Unicode semantics, but without actually going through the stdlib (or other code) which was written with ASCII semantics in mind in some parts. In some cases, this helped, in others it did not, since Python would now recognize more “characters” as line breaks or digits than originally planned by the resp. code authors.

After all this time, I don’t think we can change the semantics radically anymore and simply have to accept that we did not look close enough at the implications when transitioning from str=bytes to str=unicode.

Overall, I don’t think there are that many cases where the Unicode semantics cause a problem in real life. The few extra line end characters are very rarely used in practice and will most likely point to problems with the data, rather than problems with the code, in those cases where you don’t want the new code points to be interpreted as line breaks.

If you are indeed working with Unicode text and not with ASCII text which you happen to receive as Uncode str, you do want these line breaks, since that’s what Unicode defines and what the author of the text will have used as basis for writing the text. In this respect, the handling of .splitlines() is correct. It is only not correct for standards that mandate ASCII line break characters only.

I think for code which definitely may only break on ASCII newlines we should either create a new method (e.g. .asciilines()) or add a parameter to .splitlines() which restricts the set of characters to split on to the ASCII ones we had in Python 2.

If add an option for ASCII-only splitlines(), it may be worth to add corresponding options for other str methods.

ASCII-only analogs of Unicode string methods.

  • s.isdigit():

    • s.isascii() and s.isdigit()
  • s.strip():

    • s.strip(string.whitespace)
  • s.split():

    • list(filter(None, re.split(r'(?a)\s+', s)))
  • s.splitlines():

    • re.split(r'\n|\r\n?', s)
  • s.splitlines(keepends=True):

    • re.split(r'(?<=\n)|(?<=\r)(?!\n)', s)
1 Like

To me, that sounds like a recipe for security bugs, like CVE-2019-9740. If an attacker controls the data and can inject “exotic” line breaks, they could trick server software into doing something unexpected. E.g. imagine you have two places were incoming data is parsed:

  • an initial layer that does input validation and splits on CR/LF

  • an inner layer that uses str.splitlines() and executes the actions specified by the (potentially hostile) user

The attacker can use exotic line breaks to bypass the validation. Since almost no one tests using those kinds of line breaks, the problem doesn’t appear unless it is intentionally exploited.

1 Like

Neil, we’re going around in circles. str.splitlines() implements
the Unicode semantics of splitting on Unicode defines line break
code points. As such, the code does what it should do as part of the
Unicode implementation.

If you’d rather like to see other semantics, we can add a parameter
or a new method which deals with just ASCII lines breaks or a
specific set of characters you want to regard as line breaks.
Or you could use re.split() or a dedicated ASCII line break
function instead.

However, changing the default semantics would break Unicode
compatibility, so is a no-go.

Regarding security issues: Yes, there are many ways to trick code
which wasn’t written for Unicode into doing all kinds of weird
things, but that only hints at problems in the code.

Porting to Unicode is not as easy as taking ASCII code and letting it
play with Unicode instead. There are many things to consider,
e.g. combining code points, normalizations, code points which
look alike but aren’t the same, format code points, control
code points, etc. etc.

Moreover, code point meanings do change over time. E.g. what
used to be tagging code points for the purpose of associating
language with a series of code points have now turned into
emojis.

In some cases, you may need to preserve ASCII semantics and
algorithms or standard may require this, so a new method or
even set of methods may be warranted to address this need in
a Unicode world.

1 Like

Sorry, I’ll drop it. I thought the problem is I’m failing to make myself understood. However, I guess I must be the one not understanding or we have some disagreement that we are not going to resolve with endless discussion.

Leaving incorrect usage of splitlines() in the standard library?

Hi Piotr,
There is a chance that anywhere splitlines() is being used, that code needs to be fixed (not just in stdlib but any Python code). Based on my little investigation work, it seems more often than not people expect to split on only CR/LF (like Python 2 did). Also, anywhere codecs.open() or similar (e.g. getreader) is used could also be buggy. I think often people expect line splitting like open() does (CR/LF) but the codes module line splits Unicode text like str.splitlines().

Figuring out what the code intends to do and if it should be changed is not trivial. In some cases, like wrapping text, it doesn’t matter much one way or another. In other cases, the text being split is part of some protocol and splitting on the wrong characters is definitely a bug. Nothing I found in the stdlib looked to me like very obviously buggy behavior. If someone else wants to investigate further and to file bugs if they find problems, that would be great.