My team routinely deals with all sorts of proprietary file formats from various customers. Many of those file formats are records with peculiar record separators that we have to use either str.split, re.split or re.finditer to parse.
Wouldn’t it be nice if io.TextIOWrapper, and by extension, the open function, supports alternative characters as line terminator, so we can take advantage of one of the most beautiful idioms of Python, reading “lines” with a for loop over a file-like object? The newline keyword argument can be aliased as lineterminator (a name borrowed from csv.Dialect), recordseparator (as in awk), or something more fitting:
with open('records.dat', recordseparator=';') as records:
for record in records:
# additional parsing of record here
It would be nice if csv.Dialect can support a custom lineterminator too.
I don’t believe this would be too technically difficult to implement (just removing the validation of the argument will do perhaps?) and it would surely help eliminate a lot of ugly parsing code, especially when the content of the file-like object is streamed (str.split, re.split and re.finditer only work on strings so we have to write a lot of code to buffer the stream and deal with incomplete fragments before we can use one of those methods).
Even more awesome would be to support regex patterns (when the length is greater than 1 and not equal to \r\n, or when it is a re.Pattern object) as line terminator, but can wait if it is deemed too complex of a change.
Until you’ve prototyped it, I would be careful about making assumptions about implementation difficulty.
Why do you think that is true for the majority of Python developers? Considering how old the io module is (Python 3.0), and this is the first time I’ve seen this feature request (and I assume you checked the issue tracker and couldn’t find such an issue, hence this topic), I’mm not sure it’s that common. I completely understand how it would benefit your needs at work, but we have to balance maintaining this, potential performance costs in making this flexible, etc.
Of course just about everything one wishes to accomplish can be implemented on his/her own, and once implemented all the ugliness and/or complexity can be kept out of sight in a separate module and reused easily by importing it, but the point of this proposal is that we already have a perfectly elegant solution built-in from the standard library, so why reinvent the wheel/duplicate logics if all it takes is for the built-in solution to be made slightly less restrictive?
By the way your solution reads all the file content into memory and therefore would not efficiently support a streamed file-like object. To support it you would need something unslightly like this:
def splitlines(file, newline, chunk_size=4096):
tail = ""
chunk = file.read(chunk_size)
if not chunk:
lines = (tail + chunk).split(newline)
tail = lines.pop(0)
tail = lines.pop()
yield from lines
I did do a search before posting here but I guess I wasn’t using the right keywords. Now that you mentioned it I redid my search with a wider set of keywords and it found me this 18-year-old issue that was marked as “resolved” with someone uploading a patch to _pyio.TextIOWrapper while nothing was done to CPython itself: https://bugs.python.org/issue1152248
The demand is there, even if not “common” (a very subjective adjective). I still wonder why the newline argument has to be made so restrictive to begin with, since there should be no performance downside to allowing an alternative character to be the line separator.
Because it’s dealing with new lines, which are delimited by a specific set of bytes, not with “generic way to delimit text”. Reading text is either line, chunk, or byte based. If you need parsing on top of that, then read chunks and write a parser for your specific need.
I don’t get what’s the need to be so stuck on the current naming of the argument, when the concept of a line is really of a record. We can easily alias the argument as recordseperator (think awk) so the elegant idiom can be used in a much broader range of applications.
Yes, I will do that. Will report back once I gather some usage statistics.
I just thought that this is a missed opportunity, a low hanging fruit that would elegantly satisfy a meaningful set of use cases with a minimal amount of efforts made to a language that has long maintained a “rich and versatile” standard library from a batteries included philosophy.
I wouldn’t consider this low-hanging fruit or technically simple to implement. I was curious, so I spent a few minutes diving into the C code to look. Here’s what looks to be the core logic: _PyIncrementalNewlineDecoder_decode
I haven’t taken the time to unravel what it would take to add checking for an arbitrary character, but suffice to say, it would not be simple. And even then, there are a ton of other considerations, like performance, documentation, testing, platform support, etc.
Opening and working with text files is one of the fundamental features of Python, not to mention any programming language designed for solving problems. Even if the idea was universally accepted, making a change to such a stable, fundamental part of the language would merit a serious investigation. Gauging interest through a PyPI package or a search through public source code would be a simpler start than trying to actually code an implementation.
Sidebar - the real killer feature of open()
In my view, the best thing about Python’s EOL line detection is that it works across Windows and *nix flawlessly, as well as being codec aware and handling byte strings. I can write this same code on both operating systems and have it just work:
with open('file.txt') as f:
for line in f.splitlines():
That’s the killer feature, and it’s not a simple one to do well and to do fast. But Python manages to do so.
Here’s an example of it ‘just working’ for different line endings:
>>> s = 'this\nis\na\nstring'
>>> for l in s.splitlines():
>>> for l in bytes(s, 'utf-8').splitlines():
>>> ws = 'this\r\nis\r\na\r\nwindows\r\nstring'
>>> for line in ws.splitlines():
>>> mixed = 'this\nis\r\na\nweird\r\nstring'
>>> for line in mixed.splitlines():
So what happens when something in a proprietary file format arrives by extraction from a zip file, via SFTP, or in an XML file element? If you tie your file format parsing in with text file I/O, then all those things become kludgy. Not impossible, there’s always io.StringIO, which I presume would be made to support the same newline varieties. But how would you incrementally parse data read from a ZipExtFile returned from ZipFile.open?
What happens when your customer tells you that obviously semicolons are not record separators if they are doubled, within quotes or backslash-escaped? You can’t push that additional logic into open, so you will have to rethink your structure.
Don’t repeat the csv module mistake: csv.writer needs a text file opened in a special way, with a newline='' parameter. Forgetting that is a mistake that I’ve made more than once. If instead csv had built on binary files, it would have been less error-prone to work with.
That’s exactly what io.TextIOWrapper is for. See the SO answer below:
That’s why support for regex as a record separator is a nice-to-have, but not a priority since even my team, having dealt with all sorts of legacy proprietary file formats over the years, rarely needed it. Not never, but rarely.
The no-translation mode only matters when newline is the record separator, which is entirely irrelevant to my proposal here since I’m specifically asking for a non-newline record separator.
So… do it. Make a PR. From this discussion and others you’ve created or participated in, it seems you’re set on getting what you ask for the way you ask for it. So why discuss it more? Open the PR, and either a core dev will agree with it or not. Is there any more you need to get out of your discussions?