Identifying duplicate files where speed is a concern

So you don’t have files smaller than 4 bytes …

1 Like

:laughing: As I said, this was a quick, hacky POC. My actual randomized keyfunc now is:

        def keyfunc(fn):
            global indsize, inds
            size = os.path.getsize(fn)
            if size != indsize:
                indsize = size
                inds = sorted(random.sample(range(size),
                                            min(RANDBYTES, size)))
            bs = []
            with open(fn, "rb") as f:
                for i in inds:
                    f.seek(i)
                    bs.append(f.read(1)[0])
            return bytes(bs), size <= RANDBYTES

EDIT:: BTW, it didn’t matter in the original post because, at that time, the first-32-bytes keyfunc ran first. No file of size <= 32 bytes survived that pass.

Why don’t you build a Merkle tree for each file when storing the files on storage media?

I guess that also explains the global instead of nonlocal? :slight_smile: Or does global have a benefit here?

Yes, & I said the use of globals was “an ugly hack just to get results ASAP” in the original post. This is about ideas, not about writing production code. If I were doing this “for real”, an equivalence class of file paths would be modeled by an instance of a Python class. That would be the principled and natural place to store the list of randomized offsets specific to a all-members-known-to-be-of-the-same-size equivalence class.

Sorry, didn’t realize that referred to the globals.

How about using “middle 32 bytes”? Might avoid common file prefixes/suffixes while being simpler than multiple random positions.

The “ugly hack” was in a parenthesized comment in a clause that was specifically about the use of globals.

If you want to try other ideas … try them :wink:. You already have working code, to which you can easily add a new “try middle bytes” keyfunc yourself.

I’m disinclined to make assumptions about file structure. For example, I once worked in an environment where there were many files on disk storing “large” persistent stacks. This was on Windows, where HDD fragmentation can easily become “a real problem”. So an empty stack was initialized by creating a largish file and defragmenting it at once, before anything was written to it. Entirely filled with zero bytes. Pushes and pops were done by opening the file in binary mode, seeking to the current “virtual end” offset (stored in the first 8 bytes of the file), and writing/reading/updating-“the-end” appropriately. No fragmentation was possible because nothing after creation could change the file size.

Before the stack was nearly full, looking at trailing bytes was useless. Before nearly half full, looking at middle bytes useless.

I don’t want to have to out-think every “clever” file structure in existence. The only way to reliably fool a random-offset scheme is to ensure that, across files of the same size, the bulk of all offsets have bytes with the same value…

That said, “middle bytes” is a worthy contender too. On my test directories:

JPEG:

starting with 32_932 files
refining by size ...
    leaves 2985 files in 1447 classes
refining by middle 32 bytes ...
    leaves 0 files in 0 classes

Windows sysdir:

starting with 4_868 files
refining by size ...
    leaves 3000 files in 629 classes
refining by middle 32 bytes ...
    leaves 284 files in 77 classes
refining by 8 random bytes ...
    leaves 185 files in 44 classes
refining by 32 bytes starting at 0 ...
    leaves 85 files in 38 classes
refining by last 32 bytes ...
    leaves 74 files in 33 classes
refining by SHA256 digest ...

Maybe; but every time I’ve needed any sort of duplicate file detection, there’s a specific context, like a large collection of images or audio files or something. If knowing the file type makes for a faster and more reliable dupe detector, I’m all for it.

This appears to be the happiest ordering of keyfuncs so far, with no “hash code” ever being larger than 32 bytes. Stefan 's “middle bytes” is the most effective of the dirt simple ones, so is tried first:

JPEG:

starting with 32_932 files
refining by size ...
    leaves 2985 files in 1447 classes
refining by middle 32 bytes ...
    leaves 0 files in 0 classes

Windows sysdir:

starting with 4_868 files
refining by size ...
    leaves 3000 files in 629 classes
refining by middle 32 bytes ...
    leaves 284 files in 77 classes
refining by 32 bytes starting at 0 ...
    leaves 181 files in 70 classes
refining by last 32 bytes ...
    leaves 161 files in 61 classes
refining by 8 random bytes ...
    leaves 75 files in 32 classes
refining by SHA256 digest ...

“Different code for different file types” is, on the face of it, opposed to the goal of reliability - unless you have only one type of file :wink:.

I’ll almost always prefer uniform code with no special cases, and happily settle for “fast enough”. In fact, I have no promising ideas for any scheme that could process my JPEG or Windows sysdir directories noticeably faster than the 100% file-type-agnostic code I have now.

1 Like

Thas has been fun to play with, at least for me :wink:.

I made more changes to account for some other observations made during the course of this.

Primarily, various I/O subsystems have various minimum transfer sises, so that, e.g., at the lower levels it’s really no faster to ask for 32 bytes than for 4K. So I changed the “refine by middle” keyfunc to check the middle 4K bytes. Bigger (4K) “hashes” then, but I don’t care. The gain is that all files no larger than that are wholly resolved by that keyfunc (which is still the first run after classifying by size). This can be a real help when dealing with many smaller files.

The “middle” keyfunc was also changed to wrap try/except around the open(), and if PermisionError is raised it returns (None, True). When refine() sees a None key at the end, it reports that all the file paths in the list raised PermissionError and throws them out.

Relatedly, the “random” keyfunc was beefed up a bit to catenate 8 4-byte chunks starting at random offsets. It makes no difference to speed.

My Windows sysdir case only looked at the top level. Recursing into the whole subtree gives very different behavior. Duplicates abound! The worst by count:

identical files: 21_618 files of size 20_509 for 443_363_562 bytes
    .\CatRoot\{F750E6C3-38EE-11D1-85E5-00C04FC295EE}\100.CAT
    .\CatRoot\{F750E6C3-38EE-11D1-85E5-00C04FC295EE}\1000.CAT
    .\CatRoot\{F750E6C3-38EE-11D1-85E5-00C04FC295EE]\1001.CAT
    .\CatRoot\{F750E6C3-38EE-11D1-85E5-00C04FC295EE}\1002.CAT
    .\CatRoot\{F750E6C3-38EE-11D1-85E5-00C04FC295EE}\1003.CAT
    ...

On and on. At the very end:

635 clusters of duplicate files found
629_077_375 bytes in those would be consumed with no duplicates
an extra 1_150_758_369 bytes are consumed by 22_804 duplicates

The SHA256 keyfunc had to do the vast bulk of the work (only duplicates of size <= 4096 bytes can be identified before then).

Restricted to the root directory, the SHA256 keyfunc had little to do:

16 clusters of duplicate files found
2_602_304 bytes in those would be consumed with no duplicates
an extra 2_750_784 bytes are consumed by 22 duplicates

Three more points, and I’m retiring :wink:.

First, the initial by-size keyfunc is slightly improved like so:

    def keyfunc(fn, ec):
        size = os.path.getsize(fn)
        return size, not size
    yield keyfunc, "size"

That is, if (& only if) the size is 0 bytes, we’re done with that file. BTW, keyfuncs changed to take an equivalence class (ec) object too, with stores the list of file paths, their common file size, and the randomized list of seek offsets for that size.

Second, on my Win10 Pro system, with an SSD, opening a file is a bigger expense than seeking multiple times. So rather than a pile of keyfuncs, there’s only one more run before the SHA256 keyfunc. It takes bytes “from all over the place” (start, end, most from the middle, and 8 4-byte chunks “at random”)

   def keyfunc(fn, ec):
        try:
            f = open(fn, "rb")
        except PermissionError:
            return None, True
        bs = bytearray(f.read(CHUNKSIZE))
        size = ec.size
        start = max(size // 2 - SMALL // 2, 0)
        f.seek(start)
        bs.extend(f.read(SMALL))
        f.seek(-min(CHUNKSIZE, size), 2)
        bs.extend(f.read(CHUNKSIZE))
        for i in ec.inds:
            f.seek(i)
            bs.extend(f.read(RANDBYTES))
        f.close()
        return bytes(bs), size <= SMALL
    yield keyfunc, f"{3*CHUNKSIZE + SMALL} various bytes"

Third, the most interesting test on this box was recursing down "/Program Files (x86)". That has files from all sorts of different sources, with no particular regularity. Small and large, text files, executables, redistributable libraries, fonts, various config formats, HTML, graphics files in all known formats, PDF files, vendor-specific formats, …

refining by size ...
...
found 0 duplicate bytes in 28 files # i.e., 28 empty files
leaves 42_599 files in 10_632 classes for 5_696_455_122 bytes

refining by 4192 various bytes ...
...
found 6_446_825 duplicate bytes in 3_901 files
leaves 6_749 files in 2_574 classes for 4_265_139_450 bytes

refining by SHA256 digest ...
...
found 4_201_549_718 duplicate bytes in 6_516 files
leaves 0 files in 0 classes for 0 bytes

4_004 clusters of duplicate files found
1_484_830_481 bytes in those would be consumed with no duplicates
an extra 2_723_166_062 bytes are consumed by 6_441 duplicates

So duplicates were abundant, Surprisingly, to me, the most duplicated bytes were found in “small” (<= 4096 bytes) files. EDIT: oops! That’s an “off by a factor of 1000” error :wink: The duplicate bytes in the small files were in the millions, but billions for the large files.

“Seeking” in those is next to free: I’m using buffered I/O, and the underlying library has the entire file in RAM. Seeking is just doing add/subtract on its buffer pointer.

Note that the keyfunc that found them also eliminated “almost all” the non-identical larger files (the SHA256 keyfunc started with only 6749 files to look at, of which 6516 turned out to be dups).

Recursing down a Python git checkout tree has similarities:

refining by size ...
...
found 0 duplicate bytes in 97 files
leaves 8_617 files in 2_187 classes for 139_668_009 bytes

refining by 4192 various bytes ...
...
found 3_694_092 duplicate bytes in 3_558 files
leaves 1_485 files in 525 classes for 105_417_346 bytes

refining by SHA256 digest ...
...
found 105_180_177 duplicate bytes in 1_476 files
leaves 0 files in 0 classes for 0 bytes

1_731 clusters of duplicate files found
49_459_920 bytes in those would be consumed with no duplicates
an extra 59_414_349 bytes are consumed by 3_400 duplicates

For a Python-on-Windows checkout, though, many of the “large file” dups are actually due to external projects. Like:

identical files: 3 files of size 178_842 for 536_526 bytes
    .\externals\tcltk-8.6.13.1\amd64\include\tclDecls.h
    .\externals\tcltk-8.6.13.1\arm64\include\tclDecls.h
    .\externals\tcltk-8.6.13.1\win32\include\tclDecls.h
identical files: 5 files of size 7_331 for 36_655 bytes
    .\externals\zlib-1.2.13\contrib\vstudio\vc10\zlibvc.def
    .\externals\zlib-1.2.13\contrib\vstudio\vc11\zlibvc.def
    .\externals\zlib-1.2.13\contrib\vstudio\vc12\zlibvc.def
    .\externals\zlib-1.2.13\contrib\vstudio\vc14\zlibvc.def
    .\externals\zlib-1.2.13\contrib\vstudio\vc9\zlibvc.def
2 Likes

Trying this on a hard drive instead of an SSD made clear that randomized seeking is painful to endure.

So I dropped randomization. The keyfunc now reads the whole file if the size <= SMALL, else catenates the CHUNKSIZE bytes at the start, middle, and end.

    CHUNKSIZE = 1 << 9
    SMALL = 10 * CHUNKSIZE

The “hashes” can therefore be as large as SMALL bytes, or 5120 with those specific values. The primary point of this keyfunc is to spare SHA256 from needing to read up entire large files when they’re not duplicates. Since it’s reading up bytes anyway, may well as weed out SMALL duplicates at once. Those specific values were effective at weeding out huge unique files early.

For my “Program Files” tree:

refining by <= 5120 bytes, or 512 at ends & middle ...
...
unique files: 32_020 files for 1_475_841_413 bytes
dup files: 4_319 files in 1_713 classes for 8_358_774 bytes; 3_580_208 canonical
PermissionError files: 0 files in 0 classes for 0 bytes
leaves 6_258 files in 2_330 classes for 4_109_809_991 bytes

The 8+ meg it found in 4K small-file duplicates is minor compared to the 1.5 gig it weeded out from 32K files. The vast bulk of the files SHA256 had to deal with were dups, so it was really necessary to read up almost all the remaining bytes:

refining by SHA256 digest ...
...
unique files: 162 files for 12_617_166 bytes
dup files: 6_096 files in 2_289 classes for 4_097_192_825 bytes; 1_430_027_801 canonical
PermissionError files: 0 files in 0 classes for 0 bytes
leaves 0 files in 0 classes for 0 bytes

For completeness, the grand summary:

totals
total files: 85_760
total bytes: 21_001_943_328
unique files: 75_317 files for 16_896_391_729 bytes
dup files: 10_443 files in 4_003 classes for 4_105_551_599 bytes; 1_433_608_009 canonical
PermissionError files: 0 files in 0 classes for 0 bytes

The new hybrid keyfunc is straightforward:

    def keyfunc(fn, ec):
        try:
            with open(fn, "rb") as f:
                size = ec.size
                if size <= SMALL:
                    return f.read(size), True
                else:
                    bs = bytearray(f.read(CHUNKSIZE))
                    mid = size // 2 - CHUNKSIZE // 2
                    f.seek(mid)
                    bs.extend(f.read(CHUNKSIZE))
                    f.seek(-CHUNKSIZE, 2)
                    bs.extend(f.read(CHUNKSIZE))
                    return bytes(bs), False
        except PermissionError:
            return None, True
    yield keyfunc, f"<= {SMALL} bytes, or {CHUNKSIZE} at ends & middle"

Did you try with just the middle bytes?
I wonder if the end bytes are adding anything to the solution.

For media files I suspect that the middle bytes are going to be a very good key.

For small files the key is upto 5120 bytes but if larger you only use 1024 bytes.
Why not two 2560 reads from mid and end?

I have never tried anything without refining by file size first. But I already showed you output from running ny sysdir test refining by file size and then by middle only. Middle was quite effective, but the 284 files it left behind were cut down to 74 by subsequent pre-SHA256 passes.

In isolation (but running file-size first), the pre-SHA256 keyfuncs were, most to least effective (although these all with 32-byte slices):

  • 4-byte chunks from 8 randomized offsets
  • middle
  • last
  • first

Alas, the random one is the most costly, and especially on hard drives (not so much on SSD), due to the 8 seeks it requires. Cut the number of seeks, and it gets less effective.

As before, I still have 0 interest in catering to specific file formats :wink:.

There’s no inherent virtue in making hashes the same length, so “why?” comes to mind. There is some potential advantage, though, in keeping them smaller. In “large file” cases, there are CHUNKSIZE bytes from each of start, middle, and end , for 512*3 = 1536 bytes. If they’re identical in all those bytes, why would you expect that picking even more bytes from same regions would do significantly better?

While the first bytes are least effective overall, they still help in some cases, it appears close to costless to read up the first bytes: we just opened the file, and the OS has already fired off the seek to reach the start of the file.

One more trick, but again more effective with an SSD: parallelism. keyfuncs are I/O heavy, and several can run fruitfully in parallel in Python threads, even with the GIL.

It was amazingly easy to add this. The original code:

def dupdetect(root):
    ...
    for fn in ec.fns:
        e[keyfunc(fn, ec)].append(fn)
    ...

And after:

import concurrent.futures as cf
from itertools import repeat

NTHREADS = 4

def dupdetect(root, tex):
    ...
    fns = ec.fns
    for res, fn in zip(tex.map(keyfunc,
                               fns,
                               repeat(ec)),
                       fns):
        e[res].append(fn)
    ...

with cf.ThreadPoolExecutor(NTHREADS) as tex:
    dupdetect(root, tex)

That’s all. On a directory tree with over 80K files, on SSD, it even reached 100% usage of 3 CPUs at times.

It gets less effective over time, as equivalence classes shrink (the number of files in the current equivalence class is an upper bound on how many threads can be simultaneously active). It would be better overall to refine multiple equivalence classes in parallel, but that would require heavier code changes and adding locks (keyfuncs on their own don’t mutate any state, or access any mutable shared state, so running any number of those in parallel creates no problems for the Python code).

It helps with a tree on HDD too, but not as much. I’m not quire clear on why not. Best offhand guess is that reading from multiple files at once overwhelms the drive, and we’re spending lots of time blocked in the kernel waiting for bytes to show up. Indeed, on a HDD there are significant stretches of time with barely any visible load on the CPU.

Nope! That’s even easier than what I did :wink:. Amounts to changing

            for ec in classes:
                # build key->file_list dict from keyfunc and ec

to

            for ec, e in tex.imap_unordered(refine, classes):

where tex is now an instance of multiprocessing.pool.ThreadPool and refine() is fiddled to return its equivalence class input too:

        def refine(ec):
            e = defaultdict(list)
            for fn in ec.fns:
                e[keyfunc(fn, ec)].append(fn)
            return ec, e

concurrent.fuiures doesn’t appear to have a close work-a-like for imap_unordered, but the latter is wanted here. We don’t care in which order classes get refined, and don’t want to wait to ensure (as map() does) strict first-to-last order.

WIth 4 threads, on a large SSD tree this kept 4 CPUs mostly busy for the duration. On an HDD it again bogged down, and especially in the SHA256 pass. Apparently doing just one large-file digest exhausts this particular HDD’s read bandwidth.

But the order changed when boosting from 32 to 512 byte chunks, and using the larger & more varied “Program Files” tree. The random one was changed to use 8 64-byte chunks, to keep the total number of bytes looked at 512 in all cases.

Running each alone (after refining by file size) is now, again best to worst:

  • last (left 887 unique files for SHA256 to detect)
  • random (left 1_221)
  • first (left 1_951)
  • middle (left 2_615)

Doing all 4 left just 98. Doing all except random left 162. Doing just “first” and “last” left 612, so “middle” is definitely helping despite that it fell to last place.

I didn’t expect those results. There’s really no substitute for doing test drives :wink:.

My conclusion is that first, middle, and last are all worth doing, but, given those, “random” isn’t worth the extra expense. People already explained why “first” was far less effective in the sysdir tree when using just 32 bytes. In that tree now, using 512-byte first, middle, and last, SHA256 finds just 16 unique files (and 23_125 duplicates).

2 Likes

No, this never ends :wink:. But I’ve reached full circle, so it’s pretty much our of steam: since first+middle+last is very effective at larger chunk sizes, boost CHUNKSIZE to 2**12. File IO subsystems probably don’t transfer less than that in one gulp, so there’s scant real expense there in boosting it.

But I’ve come to take more seriously Chris’s cautions about hash sizes. Apps are dealing with far larger numbers of far larger files than when I was a kid, so I’m using sllm SHA256 hashes for everything now. SHA256 is fast; the thing we’re trying to avoid is running it on gigantic files in their entirety. 99% of the point is to weed out unique files early.

Here’s the new keyfunc:

def keyfunc(fn, ec):
    try:
        with open(fn, "rb") as f:
            if ec.resolved:
                h = hashlib.file_digest(f, "sha256")
            else:
                h = hashlib.new("sha256")
                for s, e in ec.spans:
                    f.seek(s)
                    h.update(f.read(e - s))
            return h.digest(), ec.resolved
    except PermissionError:
        return None, True
descr = f"<= {SMALL} bytes"
if FB:
    descr += f" {FB=:}"
if MB:
    descr += f" {MB=:}"
if LB:
    descr += f" {LB=:}"
if RB > 0 and RN > 0:
    descr += f" & {RN} random {RB}-byte chunks"
yield keyfunc, descr

This looks so involved because nothing about which slices to read up is hard-coded anymore. Instead the method of the equivalence-class class that records the common file size for a class also computes a sequence of (start, end) offset pairs reflecting the number of first, middle, and last bytes wanted, and the number of randomized chunks of a given length. Any of these can be 0, in which case no offset pair is recorded. The pairs are normalized in increasing order of starting offset, and adjusted so that no adjacent pairs specify overlapping or “touching” slices. The class’s .resolved member is set to True if and only if this sequence collapses to the single offset pair (0, file_size) (we’re looking at every byte in the file iff that’s so).

Which is overkill now, but made it very easy to switch from one scheme to the next when experimenting.

Is a CHUNKSIZE of 4096 effective? Yup! In the large “Program Files” tree, it left only 13 unique files for full-blown SHA to discover; in the full sysdir tree, only 2 unique files; in a directory with about 40K unique JPEG files, they were all detected before the SHA256 pass (which was skipped, since no files remained!).

1 Like

FYI, since the relative effectiveness of the first/middle/last strategies on their own changed when moving from 32 to 512 bytes, what happened when moving from 512 to 4096? Their relative order did not change again. For the relatively large (over 80K files) and varied “Program Files” tree, best to worst:

  • last (left 73 unique files for SHA256 to detect)
  • first (left 123)
  • middle (left 596)

Using all 3 in the keyfunc left 13 unique files behind. Doubling CHUNKSIZE again (to 2**13) reduces that to 2. The same at 2**14. At 32K bytes, no unique files survived.

I think 4K is a good tradeoff for all the directories I have, but I have few files as large as one GB, and no two that large of the same size (so they’re all weeded out by the file-size refinement at the start).

YMMV.