Will open() function change line-endings in files?

Hi

I’m reading the docs for the open() function in Python 3. The following statement makes me wonder whether reading a file in r-mode or rb-mode will corrupt line-endings, or if it is only writing to a file that can corrupt - it is not clear to me 7. Input and Output — Python 3.9.1 documentation

In text mode, the default when reading is to convert platform-specific line endings (\n on Unix, \r\n on Windows) to just \n. When writing in text mode, the default is to convert occurrences of \n back to platform-specific line endings. This behind-the-scenes modification to file data is fine for text files, but will corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files.

I.e. I have a small program that iterates over a directory and subdirectories to look for a string in text-files. In the directories there can also be images etc. The following is reading data in default mode ®.

import sys
import os

keyword = '134'
root_dir = "C:\Temp"  # path to the root directory to search


for root, dirs, files in os.walk(root_dir, onerror=None):  # walk the root dir
    for filename in files:  # iterate over the files in the current dir
        file_path = os.path.join(root, filename)  # build the file path
        try:
            with open(file_path) as f:  # open the file for reading
                # read the file line by line
                try:
                    for line in f:  # use: for i, line in enumerate(f) if you need line numbers 
                        if keyword in line:  # if the keyword exists on the current line...
                            print(file_path)  # print the file path
                            break  # no need to iterate over the rest of the file
                except:
                    print('Funny characters that can\'t be read in: ' + file_path)
                    
        except (IOError, OSError):  # ignore read and permission errors
            pass

Another version of the program reads the files in rb-mode.

import sys
import os

keyword = '134'
root_dir = "C:\Temp"  # path to the root directory to search


for root, dirs, files in os.walk(root_dir, onerror=None):  # walk the root dir
    for filename in files:  # iterate over the files in the current dir
        file_path = os.path.join(root, filename)  # build the file path
        try:
            with open(file_path, "rb") as f:  # open the file for reading
                # read the file line by line
                for line in f:  # use: for i, line in enumerate(f) if you need line numbers
                    try:
                        line = line.decode("utf-8")  # try to decode the contents to utf-8
                    except ValueError:  # decoding failed, skip the line
                        continue
                    if keyword in line:  # if the keyword exists on the current line...
                        print(file_path)  # print the file path
                    
                        #break  # no need to iterate over the rest of the file
        except (IOError, OSError):  # ignore read and permission errors
            pass

Will either of the programs corrupt line endings in i.e. JPEG or EXE-files or other file-types, or is it only the write-modes in the open() function that potentially affects line-endings?

Potential data corruption when reading a binary file in text mode is only about the data that’s read, not the file on disk. If you open a regular file with just read access, the file object and underlying kernel file handle do not allow writing to the file.

1 Like

Hi Eryk
Thank you very much for your answer.
So is it only if the open() function is used with the ‘w’ mode the potential corruption of file endings on the underlying file can happen?

Mode “w” overwrites (truncates) an existing file. Modes “r+” (read and write), “a+” (append and read), and “a” (append) allow writing to an existing file.

Thank you. Will mode r+ or a+ or a modify line-endings?

For instance if I have a csv-file or txt-file that I want to modify with a structure like this:

Column1;Column2;Column3(CRLF to mark line ending)
ABC;DEF;GHI(CRLF)
JKL;MNO;PQR(CRLF)

And I use r+ or a+ or a modify the file - will they change the line endings?

Assuming that you have CRLF endings in your file, the logic is:

  • When reading, you will get a string with just LF.
  • When you write that string back into the file, LF will be converted to CRLF again.

That should be fine for CSV. In general, this behavior is nice whenever working with plain text files. What you get is a string ready for processing that is independent of the platform, since all line endings are LF. After you modified the contents, when you write the file, it has line endings typical of your platform.

For images, you definitely want binary mode. For one thing, binary data will most often not be decodable as UTF-8. For another, if it happens to contain a sequence of bytes that means LF, that will stay LF when you read it, and if you write it back, it will become CRLF, making the file unreadable.

The meaning you assign to “corrupt” is not entirely clear to me. I hope this answers your question. Please see details at Built-in Functions — Python 3.9.1 documentation.