Why does Python automatically add Carriage Returns to text files?

tjryan · September 19, 2022, 12:28am

Why does Python silently add a carriage return to the end of a line of text when it writes it to a text file and then just as silently removes it when it reads it from the file? The read(), readline(), and readlines() commands never show the carriage return, just a newline. However, reading the file in binary form will show that the line is actually terminated with a \r\n escape sequence, as does reading the file with a hex file utility.

If you want to read one and a half lines of text from the file with the read(size) command, you have to count the number of characters to be read, including the one \n character at the end of the first line. In this instance, you ignore the \r character, since the read(size) command never sees it.

The file pointer, however, does see both the \r and the \n characters at the end of each line of text, and is incremented for both when they are read. If you use the seek() command to position the file pointer to some place in the second line of a text file before reading some characters, you have to count the number of characters to offset the file pointer from the start of the file. Instead of just adding one for the \n at the end of the line of text, you have to add two for the \r\n pair of escape characters.

If you have to count characters, you need to make believe that there is only one escape character at the end of a line of text. If you need to figure a file pointer offset, you have to use the fact that there are actually two escape characters at the end of a text line. Is this a bug or is there some mystifying reason for this crazy operation. Look as hard as I might, I have not been able to find any documentation on this “feature”. I have found one or two tutorials on the seek() statement that used the +1 fact, but never explained it. I used my hex file utility to manually strip the \r escape characters from a text file, and all of the read commands still worked correctly, so it is rather superfluous.

Python also changes any solo \r escape characters that it finds in a text file to a \n character when it reads it. What if you really wanted the \r character?

smontanaro · September 19, 2022, 12:57am

Python’s end-of-line behavior for text files is described in the documentation of the built-in open function (see the newline parameter):

https://docs.python.org/3/library/functions.html?highlight=open#open

See also universal newlines:

https://docs.python.org/3/glossary.html#term-universal-newlines

cameron · September 19, 2022, 1:50am

By Timothy Ryan via Discussions on Python.org at 19Sep2022 00:38:

Why does Python silently add a carriage return to the end of a line of
text when it writes it to a text file and then just as silently removes
it when it reads it from the file? The read(), readline(), and
readlines() commands never show the carriage return, just a newline.
However, reading the file in binary form will show that the line is
actually terminated with a \r\n escape sequence, as does reading the
file with a hex file utility.

Yes. Skip’s pointed you at the docs for Python’s newline handling model.
Internally (once text is being used inside Python) “lines of text” end
in newlines. The translation stuff accomodates which of a few “end of
line” conventions are in use by particular platforms or programs without
requiring every app to use special logic.

If you really care, reading the file in binary mode, or specifying a
specific end of line coding is the approach to use.

If you want to read one and a half lines of text from the file with the
read(size) command, you have to count the number of characters to be
read, including the one \n character at the end of the first line. In
this instance, you ignore the \r character, since the read(size)
command never sees it.

Generally with text file, you shouldn’t assume you can compute precise
byte locations in the file data from text with having detailed knowledge
of the text encoding in use in the file. Many text encodings use
multiple bytes for various characters - the issue is not just newlines.

You can either do tedious fiddly and error prone computation of the text
encoding used in the file (not just end of line, but things like UTF-8
or other codings), or use unbuffered I/O (usually inefficient) and query
a position from the underlying OS file descriptor, or use things like
file.readline() and take note of file.tell() afterwards, which would
give you a reliable seek point for file.seek(). But it would fall on a
line boundary. You could do partial reads too, and not file.tell()
then.

The basic deal is that unless you are (a) in binary mode with no
translation of bytes or (b) doing detailed and accurate computation
about text encoding yourself, then you should consider the file pointers
in a text file to be opaque values. You can use them to go to places in
the file you’ve been before, but shouldn’t infer much from any
arithmetic.

Cheers,
Cameron Simpson cs@cskk.id.au

tjryan · September 19, 2022, 1:18pm

Thank you both, Skip and Cameron. I’m fairly new to Python, although I have been a EE for over 50 years (which should be a clue that I predate the uP). When I saw tutorials on seek that didn’t make sense, I just had to track down why they didn’t. I looked all around seek, read, write, file I/O, but didn’t think that the answer was in the open statement. As usual, it was Windows that was screwing things up again. As they say, it takes a Community to make things work. Thanks again. Tim Ryan.