Os.walk and copying files

I’m attempting to create a script to back up all directories and files from my linux home folder to a USB drive, excepting all which are not hidden and those which are already save on the USB. My code successfully copies any new folders and sub-folders from the source (src) directory to the destination (dst) directory, but it is not copying any files. I believe something’s not right with the os.walk in the “for file in files” section, but can’t figure what. Likely something simple but this novice doesn’t see it! Thanks for any suggested fixes you may have!

Below is the current code:

‘’‘Check if all subdirectories in a directory exist and then copy any new subdirectories and files
to usb drive while excluding hidden files
‘’’
import os
import shutil

def check_and_copy(src, dst):
# Check if source and destination directories exist
if not os.path.exists(src):
raise FileNotFoundError(f"The source directory {src} does not exist.“)
if not os.path.exists(dst):
#os.makedirs(dst)
raise FileNotFoundError(f"The Destination USB drive {dst} does not exist.”)

# Walk through the source directory
for root, dirs, files in os.walk(src):
    # Skip hidden directories
    dirs[:] = [d for d in dirs if not d.startswith('.')]

    # Create corresponding directory in destination
    for d in dirs:
        src_dir = os.path.join(root, d)
        dst_dir = os.path.join(dst, os.path.relpath(src_dir, src))
        if not os.path.exists(dst_dir):
            os.makedirs(dst_dir)
            print(f"Creating directory {d}")
        else:
            print(f"Directory {d} already exists in destination, skipping.")

    # Copy files from source to destination
    for file in files:
        if not file.startswith('.'):
            # Check if the file does not already exist in the destination
            if not os.path.exists(src):
                src_file = os.path.join(root, file)
                dst_file = os.path.join(dst, os.path.relpath(src_file, src))
                shutil.copy2(src_file, dst_file)
                print(f"Copied file named {file}")
            else:
                print(f"File {file} already exists in destination, skipping.")

print(f"********** END OF BACKUP **********")

check_and_copy(‘/home/[anonymized]’, ‘/run/media/[anonymized]’)

You might like to check up on the rsync tool that already implements all your backup requirements. It is likely already installed on your system. A web search for “rsync backup” should provide lots of examples of use.

When debugging a script like this I add lots of print() calls to show the flow of the code and what is in the variables.
For example in the for loop start by adding a print at the top of the loop showing the files and directories returned from os.walk.
Next what is in dirs after your filter line.

Also assigning to dirs[:] = can just to dirs =.

So, rsync.

This is the part that draws my attention:

    for file in files:
        if not file.startswith('.'):
            # Check if the file does not already exist in the destination
            if not os.path.exists(src):

Should it be checking if src exists here? More plausible here would be dst_file (which would need to be brought up out of the if block), I think.

There’s an important difference here - if you mutate the directory list, it will reduce the number of directories that os.walk() traverses. This is a documented feature, as long as the walk is being done top-down (which is the default and is being done here).

Details: os — Miscellaneous operating system interfaces — Python 3.13.2 documentation

1 Like

Possibly to belabor the point here but: dirs is a reference to the
list os.walk will be using to descend the subdirectories. The
dirs[:] incantation is necessary to modify that list. A plain dirs=
just repoints the local variable, and makes no change to the list
os.walk is using.

Thanks for the pointer on rsync. I had recently come across that option, but did not attempt to implement anything as I wanted to develop a script which I could recycle later for saving specific files. Thought I’d start with all items in my home folder and if that worked I could alter it as needed in the future.

On the [:] after dirs: when the colon is removed only the 1st directory in the root is created but no sub- directories or files, and I get this error:

 File "/home/[anonymized]/BackUpToUSB42.py", line 44, in <module>
    check_and_copy('/home/[anonymized]', '/run/media/[anonymized]/')
    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/maviko/BackUpToUSB42.py", line 25, in check_and_copy
    os.makedirs(dst_dir)

NOTE: line 25 is the “os.makedirs(dst_dir)” line

if I take that check out and tab everything else back, ala this:

        for file in files:
            if not file.startswith('.'):
                # Check if the file does not already exist in the destination
                src_file = os.path.join(root, file)
                dst_file = os.path.join(dst, os.path.relpath(src_file, src))
                shutil.copy2(src_file, dst_file)
                print(f"Copied file named {file}")
            else:
                print(f"File {file} already exists in destination, skipping.")

… it copies everything, even if a copy exists in the “dst” already.
Is this what you were suggesting?

I was suggesting checking if the destination already exists. You’re currently checking that the source exists, which it always will.

Yes. I didn’t mention that I tried changing that to “dst” also, and it still didn’t copy files, but copied folders.

In the general case always include the trace back reason.
In this case we can make an educated guess it is file already exists.
But we have to guess.

Thanks to all who gave suggestions. I managed to find the problems. I restructured the file copying section similar to the structure in the directory section. Also, Chris, you were correct that it was also a problem of referencing the “src” and not the “dst”.

I also learned something about files in Python:
I had some files that I began with an underscore (e.g., “_MyFolder”) which I used often and wanted at the top of my Home directory. When I ran this script I noticed that it was not giving me a “printing” or “skipping” message about files in those folders, though it was copying them. Not sure why that is. Anyone have an answer?

Here’s the new code:

'''Check if all subdirectories in a directory exist and then copy any new subdirectories and files
   to usb drive while excluding hidden files
'''
import os
import shutil

def check_and_copy(src, dst):
    # Check if source and destination directories exist
    if not os.path.exists(src):
        raise FileNotFoundError(f"The source directory {src} does not exist.")
    if not os.path.exists(dst):
        #os.makedirs(dst)
        raise FileNotFoundError(f"The Destination USB drive {dst} does not exist.")

    # Walk through the source directory
    for root, dirs, files in os.walk(src):
        # Skip hidden directories
        dirs[:] = [d for d in dirs if not d.startswith('.')]

        # Create corresponding directory in destination
        for d in dirs:
            src_dir = os.path.join(root, d)
            dst_dir = os.path.join(dst, os.path.relpath(src_dir, src))
            if not os.path.exists(dst_dir):
                os.makedirs(dst_dir)
                print(f"Creating directory {d}")
            else:
                print(f"Directory {d} already exists in destination, skipping.")

        # Copy files from source to destination
        for file in files:
            src_file = os.path.join(root, file)
            dst_file = os.path.join(dst, os.path.relpath(src_file, src))
            # Skip hidden files
            if not file.startswith('.'):
                # Check if the file does not already exist in the destination
                if not os.path.exists(dst_file):
                    shutil.copy2(src_file, dst_file)
                    print(f"Copied file named {file}")
                else:
                    print(f"File {file} already exists in destination, skipping.")

    print(f"********** END OF BACKUP **********")

check_and_copy('/home/[anon]', '/run/media/[anaon]/')

1 Like

Cool cool, glad that worked.

That sounds very odd. I have no idea at this stage what the problem is, so I would use my standard debugging technique: If In Doubt, Print It Out! For example, right at the top of the main for root, dirs, files loop, add: print("Walking", root) so that you can see the directories it’s checking. This should pair nicely with your “Creating directory” // “Directory already exists” lines; you should see the initial root directory, followed by the creation of any needed children, and then you walk into those subdirectories.