Identifying duplicate files where speed is a concern

I need to process arbitrary files or folders only once and avoid any dupes. For simplicity, when i say “files” it must be a path that satisfies Path.is_dir() or Path.is_file()

Issues I need to deal with:

  • These files can be on Windows, Linux, and MacOS so my solution needs to be OS agnostic
  • Files are of arbitrary size and name, with no standard or meaning. They can have identical names or not so I can’t just rely on name
  • Needs to be fast, hashing files would take egregiously long considering they can be insanely big (>100 GiB)

My current solution:
I’m logging the filename (file.name) and it’s size (file.stat().st_size) and skipping any files that match both. This seems to work quite well for me but I’m wondering if it can be improved

Unfortunately, there’s no great solution for this issue, which I also have run into (with the kicker that the names didn’t have to exactly match!)

Any two files could be different in one byte in the middle.

I did what you did. The one different thing I did was to also hash the first 4k and the last 4k of the file - the first 4k will include headers and perhaps a checksum, so it might make things more reliable.

I did not, in fact, get burnt by this, as far as I know…

3 Likes

It really depends how well you want to define “duplicate”. If you’re unwilling to hash the entire content of the files, it is fundamentally impossible to be absolutely 100% sure they are identical, but there are a number of clues that will tell you that they are definitely different - that might be sufficient.

The first thing to do is definitely looking at the directory entries. Checking the file’s size is cheap and easy. I would normally suggest checking for symlinks and duplicated inode numbers (ie hardlinks) here too, but you want this to be OS-agnostic so maybe not.

After that, I’d say just read in some portion of the file and hash that. It’s then up to you to determine what sort of probability of collision you’re willing to accept. If you know something about the files’ structures, you might be able to choose which parts to hash (for example, a ZIP archive might be better recognized by the last bytes rather than the first), but otherwise, just grab a big ol’ chunk of data, hash it, and hope that different files are likely to be different at the beginning.

What you may want to consider is a multi-pass approach. Use just information from the directory entries (file size etc) to get a list of candidates. This will give you a number of sets of files. For each set, go through each file and hash the first X bytes of data. Any that still look identical, go back to the files and hash a different part. Etcetera. Eventually, you’ll either notice a difference, or decide that they’re likely enough to be identical that it counts. Or maybe, once you get it down to a small enough set of files, it’ll actually be worth hashing the entire file (ie something like “if the first 1MB of the files is identical and the sizes are the same, get the SHA256 on the entire file”).

But it all depends how confident you want to be that they’re the same. There will be plenty of ways to be completely certain that files are different, but no way to be completely certain they’re the same.

4 Likes

I’d broadly agree with this, although I would grab more than 4KB at the start of a file. Depending on the file format, there could easily be a lot of commonality in a small chunk. (For example, imagine a large number of PDFs generated naively by a program that embeds the entire font used at the start of the file. No, I’ve never had to wade through oodles of blob like that, why do you ask?)

2 Likes

You can conclude that if the sizes are different then the contents are different.

But you will need to check the contents of all files of the same size to be sure.
Using a strong hash is the best way to do this, I’d use sha256 at the moment.

If you run this often you could keep a database of (path, size, timestamps, hash) and
when the timestamps have not changed assume the contents is unchanged.

The short cuts you take depend on the importantance of not detecting duplicates.

1 Like

As OP pointed out:

hashing files would take egregiously long considering they can be insanely big

Very good point. In my application, the files were audio, so the headers were small.

If the strategy I described above had failed, my plan was to identify potential duplicates using the simple, fast strategy, and then hash only the duplicates at the end. That would be certain to be correct, at the cost of at least some full-file hashing.

Luckily, it didn’t come to that.

1 Like

Lots of helpful advice here, thank you!

I didn’t even consider the fact that I could hash a portion of the file instead of the entire thing

2 Likes

What do you consider “egregiously long”? How long does hashing a 100 GiB take, and how long would be acceptable?

Do I understand correctly that your current solution has both false positives and false negatives? You skip files that happen to have the same name and size, even if their contents differ, and you don’t skip files with the same content if they happen to have different names?

The work of hashing is relatively cheap compared to reading that much from mass storage. If that’s on rusty iron (and given the file sizes here I wouldn’t be surprised), transfer rates are of the order of 100MB/second, so 100GB would cost you 1000 seconds. That’s a quarter hour, per file. [1]


  1. These are very rough numbers, but I’m just looking for a Fermi estimate here. ↩︎

1 Like

Yeah I had measured 300 MB/s with SHA256 and 600 MB/s with SHA1. I asked to make sure they had measured as well and weren’t just assuming something much slower, and to get their actual speed. Quite possibly it’s truly too slow for them, I’d just like to make sure.

1 Like

Yeah, the actual hashing can easily be blazingly fast (that’s why a naive “just hash the password” isn’t sufficient, and something like bcrypt will do 2**N iterations of sha256 - it’s too easy to hash large amounts of text). But the overall task of “hash this file” requires fetching the data. Maybe that data is stored on some kind of ginormous RAID array, but if it isn’t, the bottleneck will likely be the disk reads.

Anyhow, depending on the files, there’s a good chance they will differ in the first N bytes for some relatively value of N.

1 Like

Figured I’d get some actual measurements rather than just guessing at it.

import hashlib
data = b"a" * 1024 * 1024 * 1024
hash = hashlib.sha256()
for _ in range(100):
	hash.update(data)
print(hash.hexdigest())

On my system, 41 seconds for the hashing without disk access, or a couple of gigabytes a second.

rosuav@sikorsky:/home/git/media/archive$ ll -L *.mov
-r--r--r-- 1 rosuav rosuav 421253411 Aug 29  2022 giftsub_alert_raw.mov
-r--r--r-- 1 rosuav rosuav 481632349 Aug 25  2022 raid_alert_anim_raw.mov
-r--r--r-- 1 rosuav rosuav 264385999 Aug 25  2022 sub_alert_anim_raw.mov
rosuav@sikorsky:/home/git/media/archive$ echo 3 | sudo tee /proc/sys/vm/drop_caches
3
rosuav@sikorsky:/home/git/media/archive$ time sha256sum *.mov
bc8d67ab2a326e23e72d7b4b38a4e2e5b75d3669c41a46cf0b0f98864b85e7b2  giftsub_alert_raw.mov
10d438d35bf4d4c19a293c9585d31eab2544dedeb54f90f349f1d35b0eb34004  raid_alert_anim_raw.mov
a5158b70f7c6069db0168ddacd8fb1cfa7965961facd11a169c317548a1291eb  sub_alert_anim_raw.mov

real	0m7.811s
user	0m3.296s
sys	0m0.360s

That’s about 150MB/sec for both hashing and reading from disk.

1 Like

I think hashing may be a waste of time. It makes great sense if you have a mostly unchanging collection of files and want to see whether a new file duplicates one of thmm. Then you maintain a persistent database of file hashes and just check the new file’s hash against that.

But if you’re starting from scratch, you may as well skip the hashing and check bytes directly. Here’a a variant of what I’ve used for this. It breaks the filenames into equivalence classes by file size first. Singleton classes are thrown away.

Each equivalence class is then refined again, by checking whether the first 4K bytes match. That yields one or more refined equivalence classes. Singleton classes are again thrown away.

This continues along the same lines, but doubling the number of fresh bytes to check (up to a max of 2**20).

There are lots of things that can be fiddled here, Last time I used this, I only checked about 128 bytes at first, and that weeded out “almost” all possible duplicates.

EDIT: moved the redefinition of keyfunc out of the inntermost loop.

from collections import defaultdict
import os

# Generate lists of filenames with the same keyfunc(fn).
# Singleton lists (only one file has this keyfunc value)
# are suppressed.
def refine(fns, keyfunc):
    e = defaultdict(list)
    for fn in fns:
        e[keyfunc(fn)].append(fn)
    for cands in e.values():
        if len(cands) > 1:
            yield cands

fns = # ... list of file paths to be checked ...
classes = list(refine(fns, os.path.getsize))
CHUNKSIZE = 4096
start = 0
while classes:

    def keyfunc(fn, start=start, CHUNKSIZE=CHUNKSIZE):
        with open(fn, "rb") as f:
            f.seek(start)
            return f.read(CHUNKSIZE)

    newclasses = []
    for eqsofar in classes:
        if os.path.getsize(eqsofar[0]) <= start:
            print("identical files:", eqsofar)
            continue

        newclasses.extend(refine(eqsofar, keyfunc))

    classes = newclasses
    start += CHUNKSIZE
    CHUNKSIZE = min(2 * CHUNKSIZE, 1 << 20)
5 Likes

Potentially, since Python will implicitly hash the files. It does depend a bit on how many files are going to be compared, and how many bytes from each file. If it’s (say) 1MB from each file, and 10,000 files, then that’s 10GB in memory; but if you only store the hashes, it’s still easily manageable. It all depends how much file it takes to be confident enough of similiarity.

I wonder how long that takes, comparing two 100 GB files 1 MB at a time, always opening and seeking for each chunk. (I have no idea how fast/slow seeking is.)

Season to taste. I’ve never lived in a world where there’s a realistic possibility, but if I did I might just keep CHUNKSIZE fixed at 4K for a very quick (if slower than possible) fix. Now you can object that, well, they may have 10 billion files to check. That’s almost as plausible to me :wink:,

1 Like

BTW, there’s nothing to stop you from writing a keyfunc in my code that, say, reads the next 500 million bytes, and returns a (say) 256-byte hash of that. The framework couldn’t care less what keyfunc() returns, beyond that it can be used as a dict key.

Indeed, if I did live in a world where it was plausible that I’d have tens of thousands of files that are identical in the first billion bytes, I’d move to a keyfunc like that after CHUNKSIZE got “too big”.

1 Like

I guess if the problem is here are a 1000 files please find the duplicates then hashing works faster as each files is hashed once.

If the problem is are these 2 files the same the compare contents.

The trick of only hashing the start of each files is a nice way of speeded up finding the files that cannot be duplicates.

I do like the idea of looking at chunks of files that match to avoid a whole file hash.