Handling files: RFC

I’ve been writing up my notes over the past few days, this being one such note in my files. It kind of speaks for itself, but if someone could please review this and let me know if what I have here is correct and suggest any improvements.

I’m building my own library of functions and useful code, so that I don’t have to ‘reinvent the wheel’ each time I start a new project.

I know that this is not a short read and as such is not a small ask, but the other side of the coin is that it may help others who are, like me, less experienced.
Thanks.


Handling files

Avoid using os.chdir and relative paths (remembering PEP 20: Explicit is better than implicit.). Instead construct the full (absolute) paths:

#!/usr/bin/python3

import os.path

#---------------------------------------------------#
def process_file(filename, path=None):
    if path is not None:
        filename = os.path.join(path, filename)
    return(filename)
#---------------------------------------------------#

path = '/home/rob/Desktop'
file = 'days.txt'

f = open((process_file(file, path)), 'r')
output = f.read()

print(output)

f.close()


File processing takes place in the following order:

  • Open a file that returns a file object (fileHandle)

  • Use the fileHandle to perform read or write actions

  • Close the file

Open a file

file syntax

file_object = open(file_name, access_mode)

There are six access modes:

  1. 'r' Read only: the default
  2. 'w' Write: Create a new file or overwrite the contents of an existing file.
  3. 'a' Append: Write data to the end of an existing file.
  4. 'r+' Both read and write to an existing file. The file pointer will be at the beginning of the file.
  5. 'w+' Both read and write to a new file or overwrite the contents of an existing file.
  6. 'a+' Both read and write to an existing file or create a new file. The file pointer will be at the end of the file.

Read a file

file syntax

file_object.read()

There are three functions with which we can read the data in a file.

  • .read()
    • By default, returns all the characters in a file, until EOF or until the value of B.
      • Can be used in a loop function to read a set number of characters or bytes.
  • .readline()
    • By default, returns all the characters up to EOL (e.g ‘\n’) or until the value of B.
      • Can be used in a loop function to read a set number of lines.
  • .readlines()
    • By default, returns all the lines in a file in the format of a list where each element is a line in the file.
      • Can be used in a loop to read a set number of lines.

The constructor (B) can be used to set the maximum number of bytes (where B is a positive integer value) to be returned for any of the above (except .readlines()), by using the file pointer as a ‘stop’, starting again with the next character, if another ‘read’ is called before the working file is closed.

With .readlines() the file pointer, as set by the value of B, increments to the next \n position, rather than the next character. Any value of B that is less than the position of the next \n is ignored.

If B is omitted or has the value of zero, then the function will perform its default action.

The file pointer will not be reset until the working file has been closed and will not read past the EOF.

Write a file

file syntax

file_object.write(<some_data>)

A method to write data to a file. Existing data may or may not be overwritten depending on the access mode used when said file was opened.

Said file may also be protected by the OS if it has a ‘read-only’ attribute set.

<some_data> could be anything from a single byte to multiple lines of text, formatted any way you choose.

Close a file

file syntax

file_object.close()

By Rob via Discussions on Python.org at 19Jun2022 21:58:

I’ve been writing up my notes over the past few days, this being one
such note in my files. It kind of speaks for itself, but if someone
could please review this and let me know if what I have here is correct
and suggest any improvements.

I’m building my own library of functions and useful code, so that I don’t have to ‘reinvent the wheel’ each time I start a new project.

This is a very good thing. It also means that if you change/improve your
practices, a lot of that can happen once in your library.

I know that this is not a short read and as such is not a small ask, but the other side of the coin is that it may help others who are, like me, less experienced.

I’ll add some random comments below…

Handling files

Avoid using os.chdir and relative paths (remembering PEP 20:
Explicit is better than implicit.). Instead construct the full
(absolute) paths:

Yes and no. We avoid chdir() because it is
process-wide/programme-wide. If you chdir, relative paths will now
mean something else.

If you don’t chdir, you can use relative paths faily widely because
the programme context is the same is that of the person who invoked the
programme, so relative paths will retain their meaning.

We definitely try not to chdir in a library, unless we also undo the
chdir before we return. But in a multithreaded programme, that is not
enough.

Constructing full paths is a robust way of avoiding difficulty, and
also as you say, it makes the paths very clear in messages etc.

#!/usr/bin/python3

I usually use:

#!/usr/bin/env pyhton3

myself. This causes the script to use the python3 in my $PATH
instead of some system supplied python3 such as /usr/bin/python3,
which may not exist. The /usr/bin/env form is also recognised on
Windows by the Python launcher.

import os.path

#---------------------------------------------------#
def process_file(filename, path=None):
if path is not None:
filename = os.path.join(path, filename)
return(filename)
#---------------------------------------------------#

Personally I like:

from os.path import join as joinpath

This is a personal foible, but I find it makes the code easier to read,
as it has less os.path. visual noise.

path = ‘/home/rob/Desktop’
file = ‘days.txt’

f = open((process_file(file, path)), ‘r’)

You should have a look at pathlib too - it makes the path joining
easier.

output = f.read()
print(output)
f.close()

The usual idom for reading (or writing) a file is like this:

with open(filename) as f:
    output = f.read()
    print(output)

This closes the file reliably, even if there is an exception or you
break out of the code by hand.

file syntax

file_object = open(file_name, access_mode)

As mentioned, we prefer the context manager form:

with open(file_name, access_mode) as file_object:

because it does reliable close.

There are three functions with which we can read the data in a file.

  • .read()
  • By default, returns all the characters in a file, until EOF or until the value of B.
    • Can be used in a loop function to read a set number of characters or bytes.
  • .readline()
  • By default, returns all the characters up to EOL (e.g ‘\n’) or until the value of B.
    • Can be used in a loop function to read a set number of lines.
  • .readlines()
  • By default, returns all the lines in a file in the format of a list where each element is a line in the file.
    • Can be used in a loop to read a set number of lines.

There days, all of these are uncommon when reading text, if you care
about lines-of-text. Files are iterable, yielding lines. So:

for line in file_object:
    ... work with line here ...

Working with binary data is different, because the definition of a
“record” (such as “line” with text) is far more arbitrary. Also, there
are text formats which are not naturally in “lines”, eg XML. It is
usually better to use a presupplied parse with these instead of using
the basic “text file” methods.

Cheers,
Cameron Simpson cs@cskk.id.au

1 Like

Indeed; that’s the whole idea :slight_smile:

Thanks for your input. I’ll take on board what you’ve said and update my notes this side, leaving my post as is.

Thank you.

I find it really useful to have similar notes. Maybe later you will notice that you can find the information in the official documentation and you will note just some hard-to-find or important and hard-to-remember information. Currently I play with small parts of code in JupyterLab notebooks then I keep the interesting code examples there and add some notes around.

Here are just some notes about the code style.

  • Try to use identifier names really describing the purpose, code function. process_file() does not process a file, it prepares a file name (with path). …I have no good idea how to name this function :frowning:
  • Start using docstrings. It is great when just a glance on the first line of the docstring gives the essential idea and you do no need to analyze the function body.
  • return is not a function. It is a statement. It is very confusing to write it with the redundant parentheses.
  • I noticed other redundant parentheses. They can make the code reading harder.
  • There is no reason not to use with and open() as a context manager.
  • The whole code looks like a file. It has a shebang. If you plan to use it as a file:
    • It could be beneficial to put the example code to a function. You can prepare it as a test. Later this preparation will make it easier to start using testing tools like pytest.
    • Multiple examples can be in separate functions.
    • After adding the if __name__ == '__main__': guard you can start using the file as a module just by importing it.
#!/usr/bin/python3

import os.path

#---------------------------------------------------#
def process_file(filename, path=None):
    """Construct file name by adding optional path."""
    if path is not None:
        filename = os.path.join(path, filename)
    return filename
#---------------------------------------------------#

def test_process_file():
    path = '/home/rob/Desktop'
    file_name = 'days.txt'

    with open(process_file(file_name, path), 'r') as file:
        output = file.read()

    print(output)

if __name__ == '__main__':
    test_process_file()
1 Like

Just a small note: The print() does not belong to the context because at that point all the operations with the file were done and we should close it:

with open(filename) as f:
    output = f.read()
print(output)

Thank you for your input.

I’m not sure where I’ve picked up this bad habit of using return(), but it’s one that I need to break.

Docstrings and the other (what I consider to be advanced) features that you suggest are definitely on my list of concepts I need to get my head around in order to level-up my coding skills.

Yes, the idea is that my function will be importable, so that I don’t have to C&P it at the top of any code that needs to handle files. This is another skill that’s on my ‘things to learn’ list.

I’ve seen this line of code if __name__ == '__main__' many times and I’ve never really understood it, tbh; I’ll add it to my (growing) list, above, along with pytest.

Thank you.

Further to what others have said:

file_object = open(file_name, access_mode)

open takes many more arguments than that, but most of them are quite specialised and for advanced usage.

Perhaps the main one that you might care about is the optional encoding parameter. If you find yourself getting mojibake or Unicode errors when reading from an existing file, you may need to set the encoding.

Unicode is a big topic, but here are two simple introductions to it:

You have missed some access modes. Probably the most important is “x”, which opens a new file in exclusive mode, erroring if the file already exists.

Your description of file_object.read() and friends are a bit odd.

You refer to “the value of B”, but I think what you mean is until B characters are read. E.g. fp.read(1028) will read no more than 1028 from the file object fp.

Likewise readline() will read:

  • up to EOF (there’s nothing left to read);
  • EOL (e.g. a newline); or
  • a maximum of B characters (or bytes in binary mode).

You refer to “The constructor (B) can be used…” but that is the wrong word, I think you mean argument. There is no constructor involved in calling fp.read() and friends. Any call to the constructor of the file object was long ago, when you called open().

Now you are calling a method on the object, which has a parameter, and you are supplying an argument for that parameter. In the case of text file objects, the method read has a parameter called “size” in the docs, and the argument you pass might be (say) 1024, as in the example above.

Note that this is a positional-only parameter, so the method will not accept being called like this: fp.read(size=1024).

Also, a value of 0 for the size will read zero bytes. To read an unlimited number of bytes, use -1 for the argument, or just leave it blank: fp.read().

Regarding writing to files, files can be protected by the OS under many circumstances, which we can summarize as “you don’t have permission to write to the file”. There can be many reasons for that, not just a read only attribute.

In 2022, there is almost never any need to call fp.close() manually. 99.9% of the time you should use the with statement:

with open("myfile.txt", "w") as fp:
    fp.write("something interesting\n")

When execution leaves the indented block, the file is automatically closed.

1 Like

By Rob via Discussions on Python.org at 19Jun2022 23:43:

I’ve seen this line of code if __name__ == '__main__' many times and
I’ve never really understood it, tbh; I’ll add it to my (growing) list,
above, along with pytest.

The purpose of this boilerplate is that you can write a module which is
both importable and also runnable directly as a script.

When you import a module, it is executed. That execution is what
defines the things within it.

When you import a module, the module global __name__ is the name of
the module. Eg, when I go:

import cs.x

then inside the cs.x module, __name__ == 'cs.x'.

When you run a module directory, either:

python3 cs/x.py ...arguments... # my cs.x module source file

or:

python3 -m cs.x ...arguments...

then inside the module __name__ == '__main__'.

This lets you detect whether your module is being run as a command line
programme or being imported for use by something else.

A “pure module” (not a technical term) which just defines things won’t
“do” anything when you run it. It just defines things.

However, often there can be a good use for running a module at the
command line. If I’m developing a module, command line mode might run
some test code to exercise the new stuff. Later, it might run units
tests for the module (a suite of simple tests to check correctness). Or
perhaps the module provides a real facility which makes sense as a
command; then it should run as a command.

Whatever case above applies, sticking this:

if __name__ == '__main__':
    ... command line mode here ...

lets the module detect that it is being run from the command line and
act accordingly when it is.

Cheers,
Cameron Simpson cs@cskk.id.au

1 Like

Thank you very much, Steven.

Again, I will update things my side, leaving my initial post as is: I feel the thread will read better that way.

My note on this topic (as well as my other notes) are an amalgamation from a few different sources and it seems that I’ve mangled things a little, not least the terminology, which I hope is not too confusing for others who choose to read this thread.

I will study what you have posted; much appreciated.

By Cameron Simpson via Discussions on Python.org at 20Jun2022 04:07:

Whatever case above applies, sticking this:

if name == ‘main’:
… command line mode here …

lets the module detect that it is being run from the command line and
act accordingly when it is.

I should have said: this goes at the bottom.

Because, as previously mentioned, things are defined by executing them.
If you put this at the top nothing will yet be defined, so whatever is
in it probably won’t work.

My personal habit, aside from when the __main__ stuff is just ad hoc
tests, is to just use:

if __name__ == '__main__':
    sys.exit(main(sys.argv))

That goes at the bottom, but the main() function I put right up the
top where it’s obvious. That works because it doesn’t get called until
all the definitions are done, down the bottom.

Cheers,
Cameron Simpson cs@cskk.id.au

Thank you very much, Cameron.

I will use this information to create a new note file topic my end. I’ll need to get my head around this, as much of it is new to me, and post a new topic about it, if I come unstuck.

Your help (along with the help from the others posters here) is much appreciated and I’ll digest all this over the next few days.

Again, many thanks.

I think the description in Glossary — Python 3.12.1 documentation is a good start:

A string literal which appears as the first expression in a class, function or module. While ignored when the suite is executed, it is recognized by the compiler and put into the __doc__ attribute of the enclosing class, function or module. Since it is available via introspection, it is the canonical place for documentation of the object.

In other words: If the first expression in a module, function body or class body is a string literal (single text value between quotes) then it is a docstring. It is automatically set as a property of the object and you can access it in Python REPL (interactive mode) using help(the_object). Editors (like VS Code) will make it even easier to show the docstring.

It is good to start with one-line docstrings: PEP 257 – Docstring Conventions | peps.python.org

The [one-line] docstring is a phrase ending in a period. It prescribes the function or method’s effect as a command (“Do this”, “Return that”), not as a description; e.g. don’t write “Returns the pathname …”.

2 Likes

Thanks for the details and links: I recall reading a few posts in this Forum with regard to Docstrings, which I will review, along with the info you’ve posted.

My post has opened a real ‘can of worms’ for me, but that’s a good thing for me, as I’ve gaps in my knowledge that I could get a Truck through.

As always, your help is much appreciated.