Better way to traverse two directiries

I’m new to python. I’m trying to create a program to check two directories and remove the all files in one directory that have the same size and modification date.
This is my attempt.

import os
from os import path
import sys
print “Source " + sys.argv[1] + " Destination” + sys.argv[2]
path = sys.argv[1]
#print ‘checking’, path, os.path.isdir(“path”)
for root, dirs, files in os.walk(sys.argv[1]):
for file in files:
file1=root + os.sep + file
l=len(path)
file2 = sys.argv[2] + root[l:] + os.sep + file
if os.path.isfile(file2):
sst = os.stat(file1)
dst = os.stat(file1)
if sst.st_size == dst.st_size :
if sst.st_mtime == dst.st_mtime:
pass

print sst.st_size , sst.st_ctime, file1

print dst.st_size , sst.st_ctime, file2

      else:
        print "file different " + file2
  else:
    print file2 + " missing"

I’m new to python. I’m trying to create a program to check two
directories and remove the all files in one directory that have the
same size and modification date.
This is my attempt.

Look basicly fine to me. Since you’re only interested in the
intersection of the 2 directories, walking one is sufficient.

I have a few remarks, inline in the code:

import os
from os import path

This is usually written “import os.path”. What you have imports os.path
as the local name “path”, meaning you can call eg path.isdir(…). It is
more normal to either:

import os.path
os.path,isdir(...)

or:

from os.path import isdir
isdir(...)

import sys
print “Source " + sys.argv[1] + " Destination” + sys.argv[2]

Looks like you’re using Python 2. I strongly recommend using Python 3,
Python 2 is end of life. That would mean writing print() with brackets
as it is now a function call, not a statement:

print("Source " + sys.argv[1] + "  Destination" + sys.argv[2])

You can do this in Python 2 by making sure that this:

from __future__ import print_function

is the first line of your script. Then print() works as a function in
Python 2 and Python 3.

Also, print() accepts multiple arguments, so you can write this:

print("Source",sys.argv[1],"Destination", sys.argv[2])

path = sys.argv[1]

Usually we pull things off the command line and don’t refer to it
afterwards. Eg:

cmd, srcpath, dstpath = argv
print("Source", srcpath, "Destination", dstpath)

and likewise in the rest of the programme. More readable, easier to
debug.

#print ‘checking’, path, os.path.isdir(“path”)
for root, dirs, files in os.walk(sys.argv[1]):
for file in files:
file1=root + os.sep + file

This is better written:

file1 = os.path.join(root, file)
 l=len(path)
 file2 = sys.argv[2] + root[l:] + os.sep + file

Probably better written:

file2 = os.path.join(dstpath, os.path.relpath(file1, root))

The os.path module has lots of useful things for working with file
paths. Look it up.

 if os.path.isfile(file2):

Also check if file2 is a file.

   sst = os.stat(file1)
   dst = os.stat(file1)
   if sst.st_size == dst.st_size :
     if sst.st_mtime == dst.st_mtime:
       pass

You’re not removing file2 yet. That is good. At least put a print()
statement here to make clear what will happen.

ALSO, very very important, use os.path.samefile() to check you’re not
removing the original. WHat would happen if you went:

my-rmove-script.py dirpath dirpath

i.e. compare a directory with itself? A disaster!

Finally, a size/mtime check is a fast way to check files, but really it
only tells you that they are different if they differ. If they are the
same you still need to compare the file contents to be sure.

Cheers,
Cameron Simpson cs@cskk.id.au

Thanks for the changes. I’m using python3 with python2 syntax. Most of the code
was found thru googling different statements then using the ones that work. Most
examples don’t say that they are python2 or python3. In fact sometimes the resulting
google search does not work. The walk dir syntax was fairly easy to understand but
trying to find the sub directory names was a problem.
The only python code the I’ve played with was the project autopoweroff. I wanted to be able
to poweroff when there was no disk activity after a certain time. Was fun learning to understand the code. I will try to make these change tomorrow.
Again thanks

I tried a 4 line program and got argv not defined. I know that this is python2 becasue I’m testing this
on windows 10 bash window which has python2. Is argv a python3 feature?

import os
import sys
cmd, srcpath, dstpath = argv
print cmd, srcpath,dstpath

Ah, sorry. In this context that will be sys.argv. Usually I’ve got a
main programme function and have passed argv in. Like this:

def main(argv):
    .... do stuff with argv ...

main(sys.argv)

So we define main() accepting the 'argv" parameter, and we pass it
sys.argv, which is where we obtain the command line arguments.

You’re not using a function, so just access sys.argv directly.

So:

import os
import sys
cmd, srcpath, dstpath = sys.argv
print(cmd, srcpath, dstpath)

Cheers,
Cameron Simpson cs@cskk.id.au

I didn’t mention that the folders that I am checking are duplicate copies keep on two different systems. I’m
in the process of having both system access the same folders rather than a copies. This was done because both systems were not always on line. Also most of these folders are only being kept for backup. But I like the idea of using the os.path.samefile()

samefile is pretty important when removing things, it is too easy to
compare a directory with itself and destroy it.

Since you’re using size/mtime as an heuristic, I’m sure you’re aware of
rsync. You can get a quick summary from rsync like this:

rsync -n -iaO dir1/ dir2/

and so on.

Cheers,
Cameron Simpson cs@cskk.id.au