Make io.TextIOWrapper/open support custom line terminators/record separators

blhsing · December 20, 2023, 5:57am

My team routinely deals with all sorts of proprietary file formats from various customers. Many of those file formats are records with peculiar record separators that we have to use either str.split, re.split or re.finditer to parse.

Wouldn’t it be nice if io.TextIOWrapper, and by extension, the open function, supports alternative characters as line terminator, so we can take advantage of one of the most beautiful idioms of Python, reading “lines” with a for loop over a file-like object? The newline keyword argument can be aliased as lineterminator (a name borrowed from csv.Dialect), recordseparator (as in awk), or something more fitting:

with open('records.dat', recordseparator=';') as records:
    for record in records:
        # additional parsing of record here

It would be nice if csv.Dialect can support a custom lineterminator too.

I don’t believe this would be too technically difficult to implement (just removing the validation of the argument will do perhaps?) and it would surely help eliminate a lot of ugly parsing code, especially when the content of the file-like object is streamed (str.split, re.split and re.finditer only work on strings so we have to write a lot of code to buffer the stream and deal with incomplete fragments before we can use one of those methods).

Even more awesome would be to support regex patterns (when the length is greater than 1 and not equal to \r\n, or when it is a re.Pattern object) as line terminator, but can wait if it is deemed too complex of a change.

kknechtel · December 20, 2023, 7:04am

I like this. The restriction on the newline argument seems entirely arbitrary.

If I might bikeshed, I think my preferred spelling is end_of_line.

brettcannon · December 20, 2023, 7:49pm

It’s more about the maintenance of the feature.

Until you’ve prototyped it, I would be careful about making assumptions about implementation difficulty.

Why do you think that is true for the majority of Python developers? Considering how old the io module is (Python 3.0), and this is the first time I’ve seen this feature request (and I assume you checked the issue tracker and couldn’t find such an issue, hence this topic), I’mm not sure it’s that common. I completely understand how it would benefit your needs at work, but we have to balance maintaining this, potential performance costs in making this flexible, etc.

adamsilkey · December 21, 2023, 12:11am

I don’t think this is something that needs to be put into the standard library. That said, it’s easily something you can accomplish on your own with:

__enter__
__exit__
__iter__
__next__
@classmethod

Here’s a silly, untested implementation I whipped up: Line Terminator Class · GitHub

You’d need to change it depending on what situations you’re looking to handle (not to mention test), but the result is you get what you’re looking for: contextmanagers and for loops.

blhsing · December 21, 2023, 2:27am

Of course just about everything one wishes to accomplish can be implemented on his/her own, and once implemented all the ugliness and/or complexity can be kept out of sight in a separate module and reused easily by importing it, but the point of this proposal is that we already have a perfectly elegant solution built-in from the standard library, so why reinvent the wheel/duplicate logics if all it takes is for the built-in solution to be made slightly less restrictive?

By the way your solution reads all the file content into memory and therefore would not efficiently support a streamed file-like object. To support it you would need something unslightly like this:

def splitlines(file, newline, chunk_size=4096):
    tail = ""
    while True:
        chunk = file.read(chunk_size)
        if not chunk:
            if tail:
                yield tail
            break
        lines = (tail + chunk).split(newline)
        tail = lines.pop(0)
        if lines:
            yield tail
            tail = lines.pop()
            yield from lines

blhsing · December 21, 2023, 2:56am

I did do a search before posting here but I guess I wasn’t using the right keywords. Now that you mentioned it I redid my search with a wider set of keywords and it found me this 18-year-old issue that was marked as “resolved” with someone uploading a patch to _pyio.TextIOWrapper while nothing was done to CPython itself:
https://bugs.python.org/issue1152248

The demand is there, even if not “common” (a very subjective adjective). I still wonder why the newline argument has to be made so restrictive to begin with, since there should be no performance downside to allowing an alternative character to be the line separator.

davidism · December 21, 2023, 3:07am

Because it’s dealing with new lines, which are delimited by a specific set of bytes, not with “generic way to delimit text”. Reading text is either line, chunk, or byte based. If you need parsing on top of that, then read chunks and write a parser for your specific need.

blhsing · December 21, 2023, 3:10am

I don’t get what’s the need to be so stuck on the current naming of the argument, when the concept of a line is really of a record. We can easily alias the argument as recordseperator (think awk) so the elegant idiom can be used in a much broader range of applications.

adamsilkey · December 21, 2023, 3:58am

Maybe put together a package and throw it up on PyPI? See if it gets any usage? Or do a search if there are other people out there writing their own custom readers that are splitting on characters.

blhsing · December 21, 2023, 5:08am

Yes, I will do that. Will report back once I gather some usage statistics.

I just thought that this is a missed opportunity, a low hanging fruit that would elegantly satisfy a meaningful set of use cases with a minimal amount of efforts made to a language that has long maintained a “rich and versatile” standard library from a batteries included philosophy.

kknechtel · December 21, 2023, 5:20am

But this isn’t “parsing” any more than line-based (as normally understood) reading is. The necessary logic is already present, so why artificially restrict access to it?

adamsilkey · December 21, 2023, 6:20am

I wouldn’t consider this low-hanging fruit or technically simple to implement. I was curious, so I spent a few minutes diving into the C code to look. Here’s what looks to be the core logic: _PyIncrementalNewlineDecoder_decode

I haven’t taken the time to unravel what it would take to add checking for an arbitrary character, but suffice to say, it would not be simple. And even then, there are a ton of other considerations, like performance, documentation, testing, platform support, etc.

Opening and working with text files is one of the fundamental features of Python, not to mention any programming language designed for solving problems. Even if the idea was universally accepted, making a change to such a stable, fundamental part of the language would merit a serious investigation. Gauging interest through a PyPI package or a search through public source code would be a simpler start than trying to actually code an implementation.

Sidebar - the real killer feature of `open()`

In my view, the best thing about Python’s EOL line detection is that it works across Windows and *nix flawlessly, as well as being codec aware and handling byte strings. I can write this same code on both operating systems and have it just work:

with open('file.txt') as f:
    for line in f.splitlines():
        print(line)

That’s the killer feature, and it’s not a simple one to do well and to do fast. But Python manages to do so.

Here’s an example of it ‘just working’ for different line endings:

>>> s = 'this\nis\na\nstring'
>>> for l in s.splitlines():
...     print(l)
...
this
is
a
string
>>> for l in bytes(s, 'utf-8').splitlines():
...     print(l)
...
b'this'
b'is'
b'a'
b'string'
>>> ws = 'this\r\nis\r\na\r\nwindows\r\nstring'
>>> for line in ws.splitlines():
...     print(line)
...
this
is
a
windows
string
>>> mixed = 'this\nis\r\na\nweird\r\nstring'
>>> for line in mixed.splitlines():
...     print(line)
...
this
is
a
weird
string
>>>

blhsing · December 21, 2023, 7:03am

So you haven’t really looked and already jumped to the conclusion.

_PyIncrementalNewlineDecoder_decode is only used when the universal newline mode is enabled:

github.com

python/cpython/blob/a2dd0e7038ad65a2464541f91604524d871d98a7/Modules/_io/textio.c#L858


      
          if (newline == NULL) {
              self->readnl = NULL;
          }
          else {
              self->readnl = PyUnicode_FromString(newline);
              if (self->readnl == NULL) {
                  self->readnl = old;
                  return -1;
              }
          }
          self->readuniversal = (newline == NULL || newline[0] == '\0');
          self->readtranslate = (newline == NULL);
          self->writetranslate = (newline == NULL || newline[0] != '\0');
          if (!self->readuniversal && self->readnl != NULL) {
              // validate_newline() accepts only ASCII newlines.
              assert(PyUnicode_KIND(self->readnl) == PyUnicode_1BYTE_KIND);
              self->writenl = (const char *)PyUnicode_1BYTE_DATA(self->readnl);
              if (strcmp(self->writenl, "\n") == 0) {
                  self->writenl = NULL;
              }
          }

and then the decoder is set to the PyIncrementalNewlineDecoder_Type type:

github.com

python/cpython/blob/a2dd0e7038ad65a2464541f91604524d871d98a7/Modules/_io/textio.c#L904


      
              return -1;
          
          if (r != 1)
              return 0;
          
          Py_CLEAR(self->decoder);
          self->decoder = _PyCodecInfo_GetIncrementalDecoder(codec_info, errors);
          if (self->decoder == NULL)
              return -1;
          
          if (self->readuniversal) {
              _PyIO_State *state = self->state;
              PyObject *incrementalDecoder = PyObject_CallFunctionObjArgs(
                  (PyObject *)state->PyIncrementalNewlineDecoder_Type,
                  self->decoder, self->readtranslate ? Py_True : Py_False, NULL);
              if (incrementalDecoder == NULL)
                  return -1;
              Py_XSETREF(self->decoder, incrementalDecoder);
          }
          
          return 0;

and then _PyIncrementalNewlineDecoder_decode is called only if the decoder is of that type:

github.com

python/cpython/blob/a2dd0e7038ad65a2464541f91604524d871d98a7/Modules/_io/textio.c#L924


      
              return 0;
          }
          
          static PyObject*
          _textiowrapper_decode(_PyIO_State *state, PyObject *decoder, PyObject *bytes,
                                int eof)
          {
              PyObject *chars;
          
              if (Py_IS_TYPE(decoder, state->PyIncrementalNewlineDecoder_Type))
                  chars = _PyIncrementalNewlineDecoder_decode(decoder, bytes, eof);
              else
                  chars = PyObject_CallMethodObjArgs(decoder, &_Py_ID(decode), bytes,
                                                     eof ? Py_True : Py_False, NULL);
          
              if (check_decoded(chars) < 0)
                  // check_decoded already decreases refcount
                  return NULL;
          
              return chars;
          }

otherwise the given newline argument is simply stored as readnl:

github.com

python/cpython/blob/a2dd0e7038ad65a2464541f91604524d871d98a7/Modules/_io/textio.c#L852


      
          }
          
          static int
          set_newline(textio *self, const char *newline)
          {
              PyObject *old = self->readnl;
              if (newline == NULL) {
                  self->readnl = NULL;
              }
              else {
                  self->readnl = PyUnicode_FromString(newline);
                  if (self->readnl == NULL) {
                      self->readnl = old;
                      return -1;
                  }
              }
              self->readuniversal = (newline == NULL || newline[0] == '\0');
              self->readtranslate = (newline == NULL);
              self->writetranslate = (newline == NULL || newline[0] != '\0');
              if (!self->readuniversal && self->readnl != NULL) {
                  // validate_newline() accepts only ASCII newlines.

and readnl is then properly searched for in the _PyIO_find_line_ending function without any hardcoding of '\n':

github.com

python/cpython/blob/a2dd0e7038ad65a2464541f91604524d871d98a7/Modules/_io/textio.c#L2126


      
                      return (s - start)/kind;
                  if (ch == '\r') {
                      if (PyUnicode_READ(kind, s, 0) == '\n')
                          return (s - start)/kind + 1;
                      else
                          return (s - start)/kind;
                  }
              }
          }
          else {
              /* Non-universal mode. */
              Py_ssize_t readnl_len = PyUnicode_GET_LENGTH(readnl);
              const Py_UCS1 *nl = PyUnicode_1BYTE_DATA(readnl);
              /* Assume that readnl is an ASCII character. */
              assert(PyUnicode_KIND(readnl) == PyUnicode_1BYTE_KIND);
              if (readnl_len == 1) {
                  const char *pos = find_control_char(kind, start, end, nl[0]);
                  if (pos != NULL)
                      return (pos - start)/kind + 1;
                  *consumed = len;
                  return -1;

It even supports multi-chararacter line endings already (so a line ending of '\tl\n' as requested by the SO question above would actually work):

github.com

python/cpython/blob/a2dd0e7038ad65a2464541f91604524d871d98a7/Modules/_io/textio.c#L2149


      
          const char *s = start;
          const char *e = end - (readnl_len - 1)*kind;
          const char *pos;
          if (e < s)
              e = s;
          while (s < e) {
              Py_ssize_t i;
              const char *pos = find_control_char(kind, s, end, nl[0]);
              if (pos == NULL || pos >= e)
                  break;
              for (i = 1; i < readnl_len; i++) {
                  if (PyUnicode_READ(kind, pos, i) != nl[i])
                      break;
              }
              if (i == readnl_len)
                  return (pos - start)/kind + readnl_len;
              s = pos + kind;
          }
          pos = find_control_char(kind, e, end, nl[0]);
          if (pos == NULL)
              *consumed = len;

So yes, it very much looks like a low hanging fruit. That plugging in any other character as newline would just work out of the box, if we only remove its validation:

github.com

python/cpython/blob/a2dd0e7038ad65a2464541f91604524d871d98a7/Modules/_io/textio.c#L1132


      
              return -1;
          }
          else if (io_check_errors(errors)) {
              return -1;
          }
          const char *errors_str = _PyUnicode_AsUTF8NoNUL(errors);
          if (errors_str == NULL) {
              return -1;
          }
          
          if (validate_newline(newline) < 0) {
              return -1;
          }
          
          Py_CLEAR(self->buffer);
          Py_CLEAR(self->encoding);
          Py_CLEAR(self->encoder);
          Py_CLEAR(self->decoder);
          Py_CLEAR(self->readnl);
          Py_CLEAR(self->decoded_chars);
          Py_CLEAR(self->pending_bytes);

github.com

python/cpython/blob/a2dd0e7038ad65a2464541f91604524d871d98a7/Modules/_io/textio.c#L831


      
              {"utf-16-be",   (encodefunc_t) utf16be_encode},
              {"utf-16-le",   (encodefunc_t) utf16le_encode},
              {"utf-16",      (encodefunc_t) utf16_encode},
              {"utf-32-be",   (encodefunc_t) utf32be_encode},
              {"utf-32-le",   (encodefunc_t) utf32le_encode},
              {"utf-32",      (encodefunc_t) utf32_encode},
              {NULL, NULL}
          };
          
          static int
          validate_newline(const char *newline)
          {
              if (newline && newline[0] != '\0'
                  && !(newline[0] == '\n' && newline[1] == '\0')
                  && !(newline[0] == '\r' && newline[1] == '\0')
                  && !(newline[0] == '\r' && newline[1] == '\n' && newline[2] == '\0')) {
                  PyErr_Format(PyExc_ValueError,
                               "illegal newline value: %s", newline);
                  return -1;
              }
              return 0;

blhsing · December 21, 2023, 8:28am

Adding another SO question as a proof of demand, this time requesting for '\0' to be a line terminator:

so to support it the check for universal newline mode would need to use a length == 0 check instead of newline[0] == '\0':

github.com

python/cpython/blob/a2dd0e7038ad65a2464541f91604524d871d98a7/Modules/_io/textio.c#L858


      
          if (newline == NULL) {
              self->readnl = NULL;
          }
          else {
              self->readnl = PyUnicode_FromString(newline);
              if (self->readnl == NULL) {
                  self->readnl = old;
                  return -1;
              }
          }
          self->readuniversal = (newline == NULL || newline[0] == '\0');
          self->readtranslate = (newline == NULL);
          self->writetranslate = (newline == NULL || newline[0] != '\0');
          if (!self->readuniversal && self->readnl != NULL) {
              // validate_newline() accepts only ASCII newlines.
              assert(PyUnicode_KIND(self->readnl) == PyUnicode_1BYTE_KIND);
              self->writenl = (const char *)PyUnicode_1BYTE_DATA(self->readnl);
              if (strcmp(self->writenl, "\n") == 0) {
                  self->writenl = NULL;
              }
          }

AndersMunch · December 21, 2023, 2:35pm

So what happens when something in a proprietary file format arrives by extraction from a zip file, via SFTP, or in an XML file element? If you tie your file format parsing in with text file I/O, then all those things become kludgy. Not impossible, there’s always io.StringIO, which I presume would be made to support the same newline varieties. But how would you incrementally parse data read from a ZipExtFile returned from ZipFile.open?

What happens when your customer tells you that obviously semicolons are not record separators if they are doubled, within quotes or backslash-escaped? You can’t push that additional logic into open, so you will have to rethink your structure.

Don’t repeat the csv module mistake: csv.writer needs a text file opened in a special way, with a newline='' parameter. Forgetting that is a mistake that I’ve made more than once. If instead csv had built on binary files, it would have been less error-prone to work with.

blhsing · December 21, 2023, 3:26pm

That’s exactly what io.TextIOWrapper is for. See the SO answer below:

That’s why support for regex as a record separator is a nice-to-have, but not a priority since even my team, having dealt with all sorts of legacy proprietary file formats over the years, rarely needed it. Not never, but rarely.

The no-translation mode only matters when newline is the record separator, which is entirely irrelevant to my proposal here since I’m specifically asking for a non-newline record separator.

kknechtel · December 21, 2023, 4:30pm

For completeness: is the part that rejects arbitrary values for newline, as straightforward as one might imagine?

blhsing · December 21, 2023, 4:34pm

Yes, please see the last part of that post of mine:

davidism · December 21, 2023, 4:53pm

So… do it. Make a PR. From this discussion and others you’ve created or participated in, it seems you’re set on getting what you ask for the way you ask for it. So why discuss it more? Open the PR, and either a core dev will agree with it or not. Is there any more you need to get out of your discussions?

blhsing · December 21, 2023, 4:56pm

Will do. Just wanted to gauge the chance of the PR getting approved before I make one. Thanks.

Sidebar - the real killer feature of open()

Sidebar - the real killer feature of `open()`