Identifying duplicate files where speed is a concern

tim.one · January 30, 2024, 10:20pm

Picking a slice of bytes at a fixed position in a file is “a hash code”, just one that’s entirely trivial to compute. If two files are the same, the byte slices are necessarily equal. That’s the only fundamental property “a hash code” has to satisfy: if objects are equal, their hash codes are equal.

The point is that duplicates are usually rare, and it’s a hell of a lot cheaper to read, say, a few thousand bytes than a billion bytes when that suffices for “the hash codes” to differ. It doesn’t matter to this whether “the hash code” is the raw bytes themselves or a fancy hash (like SHA).

And it works for that purpose regardless of which kind of hash function (trivial or fancy) you use.

I used to routinely check a folder with tens of thousands of files for duplicates. They were JPEG files, and were usually only a few megabytes large at worst (back in the days when disks were much smaller). The job typically finished finding a few dozen duplicates, but did so after reading less than 1% of the JPEG bytes stored on disk. No “fancy hash” function used.

As noted above, writing a keyfunc() that goes on to apply a “fancy hash” to the raw bytes is certainly possible and principled, and may be helpful in some circumstances I’ve personally never been in. So, e.g., could be writing a keyfunc() that reads some number of trailing file bytes.

The key idea is successive refinement of equivalence classes, using a sequence of increasingly expensive hash functions. The initial os.path.getsize() is a hash function too, and the cheapest by far.

tim.one · January 31, 2024, 6:37am

For fun, I ran the code I posted against my \Windows\System32 directory:

           4868 File(s)  2,647,776,853 bytes

So about 2.5 gig in about 5K files. They range in size from 8 bytes to a few hundred meg.

The code ran fast and found about 15 clusters of identical files. Like:

identical files: ['dxmasf.dll', 'msdxm.ocx']
identical files: ['dpnaddr.dll', 'dpnathlp.dll', 'dpnet.dll', 'dpnhpast.dll', 'dpnhupnp.dll', 'dpnlobby.dll', 'dpnsvr.exe']

The grand total of bytes read from disk was 17_408_787, so under 1%. Computing fancy whole-file hashes would require reading up over 100 times more bytes.

If I change the initial CHUNKSIZE from 4096 to 128, total bytes read drops to 6_204_288. If instead I raise it to 64K, rises to 152_943_003. If those numbers don’t seem to make sense, here’s the trick: lots of files are under 64K in size, and I’m counting only the number of bytes actually read (not the total number of bytes requested of f.read()).

YMMV.

CAM-Gerlach · January 31, 2024, 9:00am

Would be curious to know how the total time changed between these options—smaller chunksize would seem to imply less total I/O bandwidth and execution time per-chunk, but more IOPS and execution bandwidth; so I’d wonder whether and at what point one set of factors vs. the other would be dominant. Also, the former would seem to perform better the fewer duplicates and the more dissimilar files there are (which seems to be generally the case in your example).

barry-scott · January 31, 2024, 10:11am

File systems have minimum allocation quantities that need to be taken into account. 4k would be common so any read less then 4k will not change IOPS at all. Further the OS may do read ahead that means larger reads also do not increase IOPS either.

Monarch · January 31, 2024, 8:13pm

Yep! These files are mostly on HDDs where disk speed is a bottleneck

This is quite amazing, thanks for the helpful code example!

tim.one · February 1, 2024, 2:20am

Here’s another version of the code, that falls back to an SHA256 digest of the tail remaining if the chunksize ever exceeds MAX_CHUNKSIZE. This uses a new generator to create the sequence of key functions. That’s very flexible.

Tu support this cleanly, key functions have to return a tuple, (hash, resolved), where resolved is True if and only if all the data in the file has been processed - there’s no point to ever calling a keyfunc on that file again. In return, the main loop needs no knowledge of chunk and file sizes (etc). Those are specific to the keyfuncs that do partial reads.

Note: just comment out the

     yield lambda fn: (os.path.getsize(fn), False)

line if you want to see how much the initial refinement by file size helps. I was surprised to see that it helps less than I thought in my Windows system directory test. It does help, but not that much. Turns out that about 60% of these files have unique file sizes.

from collections import defaultdict
import os
import hashlib

# Generate lists of filenames with the same keyfunc(fn).
# Singleton lists (only one file has this keyfunc value)
# are suppressed.
# A keyfunc returns a (hash, resolved) pair, where resolved
# is True iff all the data in the file has been processed.
# The values generated by `refine` are (resolved, list_of_fns)
# tuples. When `resolved` is True, all the files in the `fns`
# list have been found to be identical.
def refine(fns, keyfunc):
    e = defaultdict(list)
    for fn in fns:
        e[keyfunc(fn)].append(fn)
    for (key, resolved), cands in e.items():
        if len(cands) > 1:
            yield resolved, cands

CHUNKSIZE = 1 << 12
MAX_CHUNKSIZE = 1 << 16

def keyfunc_generator():
    yield lambda fn: (os.path.getsize(fn), False)

    start = 0
    chunksize = CHUNKSIZE
    while chunksize < MAX_CHUNKSIZE:
        def keyfunc(fn, start=start, chunksize=chunksize):
            with open(fn, "rb") as f:
                f.seek(start)
                chunk = f.read(chunksize)
                resolved = not f.read(1) # i.e., at EOF
            return chunk, resolved

        yield keyfunc

        start += chunksize
        chunksize <<= 1

    def keyfunc(fn, start=start):
        with open(fn, "rb") as f:
            f.seek(start)
            digest = hashlib.file_digest(f, "sha256").digest()
        return digest, True

    yield keyfunc

os.chdir("/Windows/System32")
fns = [fn for fn in os.listdir() if not os.path.isdir(fn)]
classes = [fns] # NB: at the start this is a singleton list,
                # whose sole entry is a list of paths to check
for keyfunc in keyfunc_generator():
    print(sum(map(len, classes)), "files in", len(classes), "classes")
    newclasses = []
    for eqsofar in classes:
        for resolved, fns in refine(eqsofar, keyfunc):
            if resolved:
                print("identical files:", fns)
            else:                
                newclasses.append(fns)
    classes = newclasses
    if not classes:
        break
assert not classes # else final keyfunc didn't always say "resolved"

Rosuav · February 1, 2024, 2:30am

Yeah, it’s going to depend hugely on the exact job at hand. Some file types have massive headers with very little that changes, even if the rest does; some have chunked formats that tend to result in similar file sizes. Which is why it’s so much fun to design these sorts of programs!

tim.one · February 1, 2024, 2:49am

I don’t have enough use for this to pursue it, but if I did … after the first refinement (by file size), keyfuncs are applied to batches of files of the same size. Before such a batch starts, we could use SystemRandom to create a list of “random” file offsets. Say 50 of `em. Then a keyfunc could seek madly to just those offsets and paste together the single bytes at those offsets. The result is “the hash”.

Systematic regularity near the start, or the middle, or the end, or … wouldn’t fool that. As the world moves to SSDs, the expense of seeking falls dramatically too.

Rosuav · February 1, 2024, 2:54am

Yeah, but then, as the world moves to SSDs, the expense of just hashing the entire file also falls dramatically.

I can’t WAIT for the day when all these tricks about how to reduce disk read times become obsolete.

sirosen · February 1, 2024, 3:01am

Looks over at his users’ tape storage systems…

If only.

tim.one · February 1, 2024, 4:47am

Not to be a pessimist, but they differ in this respect: the time to do (say) 50 seeks will drop no matter how large files grow to be, but the time to hash “the entire file” only falls reliably if file sizes don’t grow.

I have individual files now that vastly exceed the total capacity of the hard drives I had some time ago. The more capable hardware becomes, ambitions have grown faster so far.

I remember thinking “wow! that should be enough for years and years” the first time I got a computer with a megabyte of RAM. Vastly larger than the maxed-out 32KB of RAM on my beloved TRS-80 Model 100 laptop.

Now Microsoft recommends a minimum of 4 thousand times as much just to bring up Windows 10. You can boot it with half that much, but it would be a painful experience trying to use it.

As the world moves to SSDs. the one thing we can be sure of is that applications will evolve to make them appear small and slow to the next generation .

kknechtel · February 1, 2024, 9:16am

For what it’s worth…

barry-scott · February 1, 2024, 9:51am

The number of applications where a programmer has to care will be fewer, but I’m sure there will always be some where you have to know how to tune for acceptable performance.

I’ve worked in SSD only environments for a few years already and been tuning I/O to get performance.

chepner · February 1, 2024, 5:32pm

Who is producing the files, and is it possible to “crowdsource” the hashing to them? That is, if whoever provides the file also provides an associated file contains the hash precomputed, do you trust the provided hashes to accurately identify duplicates?

Monarch · February 2, 2024, 9:52am

Files are from a third party which we have no control over.

tim.one · February 2, 2024, 7:24pm

A quick stab at that yielded surprisingly (to me) intriguing results. I won’t post the full code. A new keyfunc was added, which uses globals to associate the most recently seen file size with a random subset of file offsets (an ugly hack just to get results ASAP). The keyfunc generator was also changed to yield a brief description of what each keyfunc does.

The new keyfunc:

def keyfunc(fn):
    global indsize, inds # indsize initialized to -1 at  global level
    size = os.path.getsize(fn)
    if size != indsize:
        indsize = size
        inds = sorted(random.sample(range(size), RANDBYTES))
    bs = []
    with open(fn, "rb") as f:
        for i in inds:
            f.seek(i)
            bs.append(f.read(1)[0])
    return bytes(bs), False

yield keyfunc, f"{RANDBYTES} random bytes"

Some globals were changed:

CHUNKSIZE = 1 << 5
MAX_CHUNKSIZE = CHUNKSIZE + 1
RANDBYTES = 4

Yes, only 4 random bytes. The others were changed so that only one refinement by raw byte-slice is tried. This is just 32 bytes, so that Chris can sleep at night knowing the hashes used are never larger then SHA256 delivers .

On my Windows sysdir case:

starting with 4_868 files
refining by size ...
    leaves 3000 files in 629 classes
refining by 32 bytes starting at 0 ...
    leaves 2890 files in 621 classes
refining by 4 random bytes ...
    leaves 237 files in 79 classes
refining by SHA256 digest ...
   ...

So refining by 4 random byes was far more effective than refining by the first 32 bytes.

For a different directory with over 30K JPEG files, with no duplicates:

starting with 32_932 files
refining by size ...
    leaves 2985 files in 1447 classes
refining by 32 bytes starting at 0 ...
    leaves 286 files in 142 classes
refining by 4 random bytes ...
    leaves 0 files in 0 classes

There refinement by size, and by initial 32 bytes, were both effective, and 4 random bytes finished everything else. The SHA256 keyfunc was never invoked. So no more than 36 bytes from any file were ever looked at.

Since the offset order “is random”, results can vary across runs. For example, on the 6th try of the JPEG example, a pair of files did sneak by the randomized keyfunc, and so the SHA digest keyfunc was invoked.

Again, YMMV. But from all I’ve seen so far, even setting RANDBYTES to 1 is effective.

Contrived “bad cases” can certainly be constructed. For example, create 2**32 files by taking each int in range(2**32), and slam a billion 0 bytes before it and also after it. No scheme can detect that two such files are different short of reading up the 4 differing bytes “in the middle”. The random scheme is very unlikely to pick a relevant offset either, unless RANDBYTES is impractically large

MegaIng · February 2, 2024, 7:35pm

Not surprising for windows dll and exe because of the mess of PE files having a basically-standardized DOS code header. I would expect that if you list the sizes of those classes that are left over, there are going to be 1 or 2 massive ones specifically for all dll/exes compiled by the same compiler with the same settings, since the first few bytes have no relations to the content, but just to what the linker decides to do.

barry-scott · February 2, 2024, 7:45pm

The first 40ish bytes are identical for .dll/.exe its only after that DOS header that you see variations in content. I think looking at first 64 bytes would yield differences; first 128 certainly will.

barry-scott · February 2, 2024, 7:47pm

You could use the mime type to specialise the offset for the a small byte range that is likely to be different.

For example the COFF header contains fields likely to be unique for each .dll/.exe.

struct_coff_header = namedstruct.namedstruct( 'COFF Header', '<'
    'H:Machine '
    'h:NumberOfSections '
    'l:TimeDateStamp '
    'l:PointerToSymbolTable '
    'l:NumberOfSymbols '
    'h:SizeOfOptionalHeader '
    'h:Characteristics '
    )

tim.one · February 2, 2024, 8:42pm

Folks, the entire point of trying a random subset of possible offsets is to see whether reading up a very small number of bytes can be effective without knowing anything about the files in advance.

At this point, my real question is whether it would pay to run the randomized keyfunc before the first-32-bytes one. For the JPEG directory, then the randomized one finishes the job:

starting with 32_932 files
refining by size ...
    leaves 2985 files in 1447 classes
refining by 4 random bytes ...
    leaves 0 files in 0 classes

For the Windws sysdir:

starting with 4_868 files
refining by size ...
    leaves 3000 files in 629 classes
refining by 4 random bytes ...
    leaves 319 files in 67 classes # NB: may vary across runs
refining by 32 bytes starting at 0 ...
    leaves 215 files in 59 classes
refining by SHA256 digest ...
    ...

So only 319 files survived the randomized pass, and first-32-bytes threw out over 100 of those. Most executable files were already weeded out (but there are still, e.g., 80 .png files), No matter how HW evolves, for those 100+ it’s going to be cheaper to just ask for the first 32 bytes than to ask a cryptographic hash function to consume the entire file.