File backup app [SOLVED]

As a coding project, I’m working on a file backup application, but I can’t see how to get a list of files. The idea is (and this could be the wrong approach) to build an operational manifest (the format for which I have yet to establish), but I can’t get a list of file names.

This is some dev code that reads the file system based on a starting point set by path1.

import easygui
import os

path1 = easygui.diropenbox(msg='Source dir')

filepaths = [os.path.join(path1, f) for f in os.listdir(path1)]

for file in os.listdir(path=path1):
    print(file)
print()
for file in filepaths:
    print(file)

Note: I’m using easygui just for the dev and may migrate this to a full Tkinter front-end, if needs be. Some of the above code is from here.

The output (on my system: a Debian based OS):

dev1.py
read_me
test_files
test_bu
dev2.py

/home/rob/Desktop/python/file_system_backup/dev1.py
/home/rob/Desktop/python/file_system_backup/read_me
/home/rob/Desktop/python/file_system_backup/test_files
/home/rob/Desktop/python/file_system_backup/test_bu
/home/rob/Desktop/python/file_system_backup/dev2.py

The issue is…

dev1.py    <- a file
read_me    <- a file
test_files <- a directory
test_bu    <- a directory
dev2.py    <- a file

As you can see, on the face of it, there seems to be no way to distinguish the files from the directories. I understand that technically and so far as the file system is concerned, there is no difference; it’s simply how the names are flagged by the file Mode that makes them so: please correct me if I’m wrong

Is there a way to read said ‘Mode’ so that I can code the script to distinguish between different file modes and thus build said manifest?


Related to: Read/write byte objects

There are tonnes of ways to distinguish directories from files. Here are seven:

1 Like

Thank you. I’ll look at each and see which would be best for my use case.

I don’t know why I didn’t think to RTFM.

[Update]
The scandir() function seems to offer all the functionality I need for this project, so far least ways.

1 Like

I kinda like pathlib.Path.rglob()

In [29]: from pathlib import Path
In [30]: list(Path('.').rglob('*'))

much more useful if you want to sub-select certain filenames, but useful in any case.

-CHB

I wrote a file backup engine: Backshift

Instead of having python traverse the hierarchy, it expects the user to pipe ‘find / -print0’ (or similar) into it. So the Python code just reads null-terminated lines from stdin.

Be careful to watch out for symlink races, especially if your program will be running with privileges.

Symlinks are perhaps primarily an issue on Linuxes and Unixes (including MacOS). Recent Windows has Junctions which are similar.

My approach is to generate a list of paths, (rooted at a given source). Then check each filename in each path against the target (backup) path/filenames. If a filename match is found, generate a MD5 hash for each. If the returned hash is the same, leave as is, if not, then rename the target file by adding .bak, then copy the source to the target.

Clearly, if the source file is not found in the target path, simply copy source to target.

I have this function (one of many that I have tried):

def get_source_paths(source):
    path_list = []
    for dirname, dirnames, filenames in os.walk(source):
        for subdirname in dirnames:
            path_list.append(os.path.join(dirname, subdirname))
    return path_list

… which seems to be doing what I need and may server for the target files also (in which case I’ll rename it). As you can see, I’m not using the filenames object right now (but the option is there) because I have another function that gathers the filenames. That said, I think that the above could make the other function redundant.

I’m working with the principle that Two is one and one is none.

It’s all very much in the early stages right now. I don’t intend to follow symlinks or run it with elevated privileges: the backup is for user generated files only.

I’ll take a look at Backshift.

Thank you for your time, advice and the link.


My MD5 hash function:

...
from hashlib import md5
...

def hash_file(filename):
    # make a hash object
    h_obj = md5()
    # open file for reading in binary mode
    with open(filename, mode='rb') as file:
        rBytes = file.read()
        h_obj.update(rBytes)
    return h_obj.hexdigest()

Thank you for the reply.

I want to ‘auto select’ each file, in each directory, starting at a given root and pass that information to another function for processing.

Having tested a number of ways to do that, I’ve hit upon this:

...
import os
...
def get_source_manifest(source):
    for dirname, dirnames, filenames in os.walk(source):
        print(f"path: {dirname}")
        if filenames:
            print("files...")
            for name in filenames:
                print(name)
        else:
            print("No files")
        print()

Clearly, the above is simply displaying the information as is, but I can pass said to another function.

Note that following symlinks is the default for many operations (eg
open() etc) and that you may need an option to avoid doing so. Using
os.lstat to explicitly check if a path is a symlink will let you
sidestep them.

Oh, and you may want to compute relative paths if you’re duplicating
your file tree to another directory as your backup process. See
os.path.relpath.

Cheers,
Cameron Simpson cs@cskk.id.au

Thank you Cameron.

The way that I create my user files (which is what I want to backup) never involves symlinks.

Yes, I will want to preserve the file tree, which is what I’m working on right now, as coincidence would have it. I was thinking about constructing (or duplicating) the structure in a dictionary object, but maybe that’s not necessary and I should be using os.path.relpath, as per your suggestion – thanks for that.

Python makes that easy:

from shutil import copytree

And you are done.

Thank you for the suggestion.

I want to be a little more selective, rather than an indiscriminate operation, such as if the src file is identical to the dst file, then leave it as is, if not, then make a copy. On that last part, I think that using shutil.copy2() may be a better option than my read/write byte objects function.

This function:

def get_manifest(src):
    manifest = []
    for dirname, dirnames, filenames in os.walk(src):
        if filenames:
            for filename in filenames:
                manifest.append(f"{dirname}/{filename}")
    return manifest

… is generating one side of the operation, with a preserved tree structure.

I want to be a little more selective, rather than an indiscriminate
operation, such as if the src file is identical to the dst file, then
leave it as is, if not, then make a copy. On that last part, I think
that using shutil.copy2() may be a better option than my read/write
byte
objects

function.

Yes. It depends how much you want to implement yourself for learning or
fine tuned control purposes. I’ve assuming you want that, or you’d just
be using rsync :slight_smile:

This function:

def get_manifest(src):
   manifest = []
   for dirname, dirnames, filenames in os.walk(src):
       if filenames:

You don’t need this if-statement; you for-loop will just append 0 files
if filenames is empty, which is fine.

           for filename in filenames:
               manifest.append(f"{dirname}/{filename}")

A purist might suggest that portable code might just use os.path.join
here, avoiding special hardwired knowledge of the path separator. Though
/ works in windows these days.

I tend to use dirpath instead of dirname to remind myself that it
looks like /path/to/the/subdir. Leaving dirname free if you wanted
to iterate on dirnames.

Cheers,
Cameron Simpson cs@cskk.id.au

Yes, the os.path. methods could be a the way to go. I have some code that uses the .split to isolates a sub-path, which I then use with os.makedirs(target), so I’ll revisit that and yes, dirname is a full path and should be named accordingly.

I’ve not done any testing to see if any or the metadata that .copy2() attempts to preserve (which my function does not) is of importance to me. I’m simply spit-balling right right now and I’ll go with whatever I find to be the most useful.

Yes, this project is just as much a coding exercise as it is to have a customized file copying app. I know that there are tonne of these apps already out there, but I’ve never been 100% happy with any of them, save for one (the name of which escapes me) that I used years back, when I ran MS Windows 95 through Windows7 systems.

Thank you and have a good day.

Or, using the more modern pathlib approach:

from pathlib import Path

def get_manifest(src):
    manifest = []
    for dirpath, dirnames, filenames in os.walk(src):
            for filename in filenames:
                manifest.append(Path(dirpath) / filename)
    return manifest

Of course, you’ll need to make sure your consumer(s) doesn’t choke on Paths (otherwise, wrap in str() or convert when needed).

Also, that inner loop is just asking for a comprehension:

def get_manifest(src):
    manifest = []
    for dirpath, dirnames, filenames in os.walk(src):
        manifest += [Path(dirpath) / filename for filename in filenames]
    return manifest

With the new Path.walk() added in 3.12 thanks to the hard work of Barney & friends, the pathlib solution gets even easier, as you can do:

def get_manifest(src):
    manifest = []
    for dirpath, dirnames, filenames in Path(src).walk():
        manifest += [dirpath / filename for filename in filenames]
    return manifest

Thank you. I’ll consider that.

So it is – old habits die hard :slightly_smiling_face:

Update: my app is all but finished now, so a final ‘thank you’ to everyone who has taken the time and trouble to read and respond.

Finishing touches to do, include creating a log file and a config file: the log file will document the changes to the backup files and the config file will contain the source and target directories.