Identifying duplicate files where speed is a concern

tim.one · January 30, 2024, 8:20pm

I think hashing may be a waste of time. It makes great sense if you have a mostly unchanging collection of files and want to see whether a new file duplicates one of thmm. Then you maintain a persistent database of file hashes and just check the new file’s hash against that.

But if you’re starting from scratch, you may as well skip the hashing and check bytes directly. Here’a a variant of what I’ve used for this. It breaks the filenames into equivalence classes by file size first. Singleton classes are thrown away.

Each equivalence class is then refined again, by checking whether the first 4K bytes match. That yields one or more refined equivalence classes. Singleton classes are again thrown away.

This continues along the same lines, but doubling the number of fresh bytes to check (up to a max of 2**20).

There are lots of things that can be fiddled here, Last time I used this, I only checked about 128 bytes at first, and that weeded out “almost” all possible duplicates.

EDIT: moved the redefinition of keyfunc out of the inntermost loop.

from collections import defaultdict
import os

# Generate lists of filenames with the same keyfunc(fn).
# Singleton lists (only one file has this keyfunc value)
# are suppressed.
def refine(fns, keyfunc):
    e = defaultdict(list)
    for fn in fns:
        e[keyfunc(fn)].append(fn)
    for cands in e.values():
        if len(cands) > 1:
            yield cands

fns = # ... list of file paths to be checked ...
classes = list(refine(fns, os.path.getsize))
CHUNKSIZE = 4096
start = 0
while classes:

    def keyfunc(fn, start=start, CHUNKSIZE=CHUNKSIZE):
        with open(fn, "rb") as f:
            f.seek(start)
            return f.read(CHUNKSIZE)

    newclasses = []
    for eqsofar in classes:
        if os.path.getsize(eqsofar[0]) <= start:
            print("identical files:", eqsofar)
            continue

        newclasses.extend(refine(eqsofar, keyfunc))

    classes = newclasses
    start += CHUNKSIZE
    CHUNKSIZE = min(2 * CHUNKSIZE, 1 << 20)