Algorithmic complexity of difflib

Even better: if k is significantly smaller than n, on many inputs the small heap rarely requires “a heap operation” at all, leaving the algorithm more like O(n).

Because each incoming value is compared first to the current top-of-heap (constant time), which weeds out “nope, can’t be a new top-k” at once. New tops typically get pushed ever more rarely as it goes along,

However, sorting the whole input can run a lot faster when k isn’t significantly smaller than n. The key to the heap-based approach isn’t so much that a k-heap is faster than n-sorting, but in context that heap operations can often be avoided entirely.

In the worst case, the input is sorted and each new value is the new largest, requiring a full sifting of each value. I’m not sure what the average case is.

I believe the new CPYthon REPL uses its own internal implementation of Levestein.

I was always surprised that nobody stepped up to tackle sequence alignment move generally. dfflib had one clear goal: produce better (“make sense to humans”) results for hand-edited code and text files than the Unixy diff tools at the time.

Which I think it did (& still does) well at, and quadratic-time is just a theoretical fantasy for its intended uses.

Which, of course, people don’t stick to :wink: Very easy to provoke poor behaviors by, e,g, feeding it four-letter ACGT genome stings. But then there’s nothing that can be written that will “run fast” in Python for such highly demanding specialized domains, Which people new to that field quickly pick up on: they need to use special libraries developed for that field. The difference between moving from an inappropriate algorithm coded in Python to one coded in C doesn’t get them the speed they need either.

That said, I’ve played with changing difflib to use suffix arrays instead, which, with enough pain, can solve “longest contiguous subsequence between two sequences” problems in linear time (note that “contiguous” seems to be a key part of “makes sense to humans”!). Always the same: the worst cases got better, but the typical intended cases got slower, and the code as a whole got much more complicated.

2 Likes

Yeah realised that and alluded to it in my comment about “deciding if it’s included”

In that case (where k is a significant fraction of n) I would guess a quickselect-like approach would be the optimal solution, giving you the nth smallest elements without sorting them (or the other partition) more than necessary.

Not really common enough to warrant inclusion in the stdlib compared to just sorting fully.

Sure! And in the best case the input is reverse sorted, and no heap operation is ever done after making the initial heap from the first k values (which takes time linear in k, via heapify()).

However, in both those cases, Python’s full blown sort takes linear time, but faster.

Across all permutations of n distinct elements, going left to right to find A max finds a new one about log(n) times. Things get more complicated when finding the k largest, but the high-order bit doesn’t change much so long as k is much smaller than n: reaching into new extremes is rare, for data following a uniform random distribution.

Possibly even better, generalize quickselect to partition the data into more than just 2 buckets, then only recurse on the single bucket of interest. Of course there are tradeoffs to balance - at the extreme, M buckets for n inputs, it’s just a clumsy way to spell “sort it” :wink:.

Agreed. And it’s a bottomless pit :wink:. There’s a package on PyPI anyway that’s already quite good at this:

It uses quickselect under the covers, incrementally sorting a list, and reusing partitions across calls to speed later searches,

1 Like

From my investigations, current difflib algorithm is only faster than Automaton for very short strings ~n <= 5.

However, I think short strings are pretty much what difflib does with its default parameters. autojunk pretty much flags all whitespace as junk, resulting in maximum work being single word.

Automaton implemented in Python can solve 2 strings of length 1M in ~4 seconds.

So yeah, maybe not for billions, but can cover fairly big space of problems.

1 Like

In context, I was talking about an alternative based on constructing suffix arrays. Hairy code, but worst-case asymptotic linear time.

I don’t know what “Automaton” means here, and a few minutes of web searching didn’t find anything relevant. A link or two, please?

As mentioned, the primary use case for difflib was for comparing files of hand-edited code. It’s used at two levels then: comparing by line, and later possibly comparing lines by character. The former is what I’m aiming at. The “alphabet” then has potentially unbounded size: from a differencing view, an entire line is “a character”. That’s the only part the suffix tree was aimed at.

When the alphabet is known in advance, and especially when it’s small, there are better ways to proceed. But since I don’t know what Automaton means, can’t say more about that.

As above, “string” is ambiguous without knowing the alphabet. In my context, a file with a million lines (however long they may be on their own) is a “string of length 1M”, but from a very large alphabet.

I think you have in mind the more usual case of a million characters from a small alphabet (like a Python string).

Can’t say without knowing more. Note that difflib can be used on sequences of any hashable type. For example, I’ve sometimes used it to find differences between lists of float results. “The alphabet” then contains on the order of 2^{64} “characters”.

That could be done instead by comparing lists of the floats’ string representations, but preserving the line structure again becomes important then, to avoid goofy matches between, e.g., the end of one float string and the start of another. Each float needs to be viewed as an indivisible unit - as “a character”.

1 Like

I have only started dabbling in this direction lately, so not sure how accurate this is, but:

  1. Suffix automaton - Wikipedia
  2. Suffix tree - Wikipedia via Ukkonen's algorithm - Wikipedia

AI says that the former is simpler to implement correctly and is a bit faster in practice.
Both approaches are O(n + m) and similar memory consumption.

I have the former implemented - it indeed is simpler than what I was expecting.
And haven’t touched the latter.

Maybe we both are talking about the same thing…?

CASES AND TIMINGS

These are the cases:

    if 0 <= case <= 5:
        ubs = [1, 2, 8, 255, 1024, 1024*8]
        ub = ubs[case]
        s1 = ''.join([chr(random.randint(0, ub)) for _ in range(N)])
        s2 = ''.join([chr(random.randint(0, ub)) for _ in range(N)])
    elif case == 6:
        s1 = s2 = 'A' * N
    elif case == 7:
        s1 = 'B' * N
        s2 = 'A' * N
    elif case == 8:
        s1 = 'A' * (N // 2) + 'B' * (N // 2)
        s2 = 'A' * (N // 2)
    elif case == 9:
        s1 = ''.join([chr(random.randint(0, 256)) for _ in range(N)])
        s2 = 'a' * 100 + s1[:N//2] + 'b' * (N // 2)
    elif case == 10:
        s1 = [chr(random.randint(0, 1024*8)) for _ in range(N)]
        s2 = [chr(random.randint(0, 1024*8)) for _ in range(N)]
        s1[200:300] = chr(1024*8 + 1) * 100
        s2[300:400] = chr(1024*8 + 1) * 100
        s1 = ''.join(s1)
        s2 = ''.join(s2)

And these are timings.
n = 10 (case 10 is not really of size 10)

-------------------------------------------------------------------------
| macOS-11.7.10-x86_64-i386-64bit-Mach-O | CPython: 3.13.4
| 7 repeats, 1 times | 2026-01-21T20:01:01
-------------------------------------------------------------------------
| Units: µs | lcsub_difflib(s1, s2)[2] | LCSUBAutomaton(s2).find(s1)[2] |
| --------- | -----------------------: | -----------------------------: |
| case=0    |                       27 |                             20 |
| case=1    |                       26 |                             20 |
| case=2    |                       20 |                             18 |
| case=3    |                       25 |                             17 |
| case=4    |                       21 |                             20 |
| case=5    |                       19 |                             16 |
| case=6    |                       33 |                             15 |
| case=7    |                       19 |                             12 |
| case=8    |                       20 |                             12 |
| case=9    |                       33 |                             83 |
| case=10   |                    1,495 |                            164 |

and n=1000

-------------------------------------------------------------------------
| macOS-11.7.10-x86_64-i386-64bit-Mach-O | CPython: 3.13.4
| 7 repeats, 1 times | 2026-01-21T20:01:46
-------------------------------------------------------------------------
| Units: µs | lcsub_difflib(s1, s2)[2] | LCSUBAutomaton(s2).find(s1)[2] |
| --------- | -----------------------: | -----------------------------: |
| case=0    |                   86,184 |                          2,727 |
| case=1    |                   65,408 |                          1,725 |
| case=2    |                   20,007 |                          1,551 |
| case=3    |                    1,237 |                          1,290 |
| case=4    |                      806 |                          1,382 |
| case=5    |                      666 |                          1,004 |
| case=6    |                  178,940 |                            911 |
| case=7    |                      420 |                            784 |
| case=8    |                   39,853 |                            513 |
| case=9    |                    1,272 |                          1,508 |
| case=10   |                    2,528 |                          1,070 |

I ran two 1M strings in 4 seconds for case 1, thus alphabet of size 3.

In general for O(n + m) automaton alphabet size doesn’t matter that much.
Maybe it can fluctuate a bit (1x - 5x), but in comparison to difflib’s algorithm it is fairly flat. :slight_smile:


I haven’t done any testing on actual difflib.
However, I think Automaton has a decent chance to make things faster for line-by-line.

It handles junk as well.
See the code

If someone has some difflib use cases to share I can see what the actual benefit is (if any).

1 Like

Preliminary timings look very promising! I can’t make time to dig deeper now, but it also gave correct results in the few cases I tried.

Not sure what “handles junk” means to you - the gist doesn’t appear to contain any variation of that string, Is, perhaps, “void_el” your code’s spelling of “junk”? It’s not obvious to me, because “void_ei” doesn’t appear to be a collection or function of any kind, just a single “character”.

“junk” was introduced to improve quality of diffs, for users motivated to work for the very best results in their domain. It was later abused to make things “go a lot faster” in pathological cases - which I regret (making autojunk=True the default was a losing idea).

I agree this is simpler than code for suffix arrays. Thanks!

2 Likes

It should be correct.
I have cross-tested it against several implementations.

Exactly. This is subject to change if needed, but currently to handle junk, one would just preprocess:

new_string = [_JUNK_SENTINEL if isjunk(c) else c for c in string]

And give _JUNK_SENTINEL as void_el.

Initially made it this way as thought it would be performant to do:

new_string = re.sub(r'[\t \n.ad]', chr(0), string)
...
result = ...(..., junk_el=chr(0))

Created issue: Implement Suffix Automaton for difflib's Longest Common Substring · Issue #144132 · python/cpython · GitHub

Let’s drop it for now. I don’t think this will last:

  • Too much work for the user.

  • Will definitely change results. The code now looks for the longest “wholly interesting match”, meaning free of any “junk characters”. But it’s not done then. It goes on to try to extend that match, on both ends, to capture too junk characters that just happen to be adjacent to the wholly interesting match. That’s in fact pretty common for things like blank lines.

    But to do that, it has to have the original junk characters to compare. They can’t be collapsed into a catch-all character class; a blank line isn’t the same as, e.g., a bare else: on its own line;

A higher-level issue: finding a maximal max is just the start. Then each input is partitioned into 3 areas: before the match, the match, and after the match. The “before” and “after” parts are then smaller instances of the same basic problem to be solved.

This was really clumsy with suffix arrays (would have been easier with suffix trees, but they consume much more space). In general, we want to find a way to find the longest match within the virtual slices a[alo : ahi] and b[blo : bhi]

The code now relentlessly visits indices from lower to higher values, which makes this easy although a bit clumsy.

And it allows to meet the doc’s promise: of all maximal matching blocks, one that starts earliest in a will be found, and of all maximal blocks starting at that position in a, the one that starts earliest in b will be returned.

I don’t know that there’s practical value to the promise, it was just a way to define “the” result crisply.

2 Likes

Trick of the trade for string algorithms: Fibonacci strings often provoke bad cases! Here’s how to generate them:

def fib(a='a', b='b'):
    yield a
    yield b
    while True:
        a, b = b, a + b
        yield b

f = fib()
xs = [next(f)]
while len(xs[-1]) < 1_000_000:
    xs.append(next(f))

That creates 31 strings, with the last having 1_346_269 characters. By construction, lots & lots & lots of “self-similar” regions within and across strings.

Using your gist:

>>> import auto #  @dg-pb's suffix automata gist
>>> k = auto.LCSUBAutomaton(xs[-1])
>>> for sub in xs:
...     print(k.find(sub))
(0, 0, 1)
(0, 1, 1)
(0, 0, 2)
(0, 2, 3)
(0, 0, 5)
(0, 5, 8)
(0, 0, 13)
(0, 13, 21)
(0, 0, 34)
(0, 34, 55)
(0, 0, 89)
(0, 89, 144)
(0, 0, 233)
(0, 233, 377)
(0, 0, 610)
(0, 610, 987)
(0, 0, 1597)
(0, 1597, 2584)
(0, 0, 4181)
(0, 4181, 6765)
(0, 0, 10946)
(0, 10946, 17711)
(0, 0, 28657)
(0, 28657, 46368)
(0, 0, 75025)
(0, 75025, 121393)
(0, 0, 196418)
(0, 196418, 317811)
(0, 0, 514229)
(0, 514229, 832040)
(0, 0, 1346269)

runs very fast, considering. Less then 3 seconds to finish all of them.

It’s a timing disaster for difflib, though:

>>> from time import perf_counter as now
>>> from difflib import SequenceMatcher
>>> k = SequenceMatcher(None, b=xs[-1], autojunk=False)
>>> for sub in xs:
...     start = now()
...     k.set_seq1(sub)
...     print(k.find_longest_match(), now() - start)
Match(a=0, b=0, size=1) 0.15416760000152863
Match(a=0, b=1, size=1) 0.22853810000015073
Match(a=0, b=0, size=2) 0.36407519999920623
Match(a=0, b=2, size=3) 0.7399129000004905
Match(a=0, b=0, size=5) 1.1449673999995866
Match(a=0, b=5, size=8) 1.9096559999998135
Match(a=0, b=0, size=13) 3.03649360000054
Match(a=0, b=13, size=21) 4.875035299999581
Match(a=0, b=0, size=34) 8.257377900001302
Match(a=0, b=34, size=55) 12.996435400000337
Match(a=0, b=0, size=89) 20.356891400002496
Match(a=0, b=89, size=144) 34.38862720000179
Match(a=0, b=0, size=233) 55.32236929999999
Match(a=0, b=233, size=377) 87.39998639999976
Match(a=0, b=0, size=610) 143.34997649999787
Match(a=0, b=610, size=987) 232.15511960000003

and I gave up waiting then. The elapsed times also roughly grow like a Fibonacci sequence from one line to the next.

So there’s massive potential here. While the results of the runs above are the same, I don’t believe “leftmost maximal match” is guaranteed by the automata construction.

2 Likes

I think it is. I looked into it in the beginning and I think I got my conviction that in terms of result, all guarantees are the same as difflib’s algorithm.


Yes, anything resembling half decent length contiguous string will run much faster.

However, difflibs usual case is 2-10 length intervals separated by junk. That timing in the issue was off. Current implementation runs 10-20% slower for file diff with autojunk.

Which is both: disappointing as I really thought that maybe by some miracle it will be faster with little effort, and not so bad at the same time - I think 10-20% lower is quite good and it might be possible to close it.

1 Like

I don’t understand the algorithm well enough to say, and can’t make time now. Chatbots assured me it’s not the case.

Just asked Gemini (Google’s free AI):

change suffix automaton to return leftmost match when searching for longest maximal match between two strings

and it gave a very detailed reply on how to change the construction and search, starting with

But I don’t know. Bots always sound supremely confident :wink:

I’m not much concerned. As mentioned before, enabling autojunk by default was an expedient “go faster” hack, abusing the purpose of junk (to improve diff quality, not to “run faster”). If often improves both, but too often destroys diff quality (many cases of long inputs from small alphabets).

I am convinced now that alphabet size doesn’t much matter to the newer approach, despite that chatbots also insist that if the size isn’t known in advance, a * log(size) factor is introduced to the runtime. When pressed on that, Gemini conceded that using a hash map to record transitions made that more of a theoretical fine point than a practical concern. “The usual” analyses seem to assume transitions would be recorded in some kind of balanced binary search tree when the alphabet size isn’t known in advance.

1 Like

And they are often very wrong from the first time. It often takes a bit of time to steer them in the right direction until they start getting a grasp on what is going on.

Backtracked it. It ensures leftmost in both s1 and consecutively in s2. Leftmost s2 is ensured by copying q_state[3] which propagates first best as opposed to overriding with current position:

states.append([
                        p_length_p1,
                        q_state[1],
                        q_state[2].copy(),
                        q_state[3]
                    ])

Pushed Claude to get to conviction and he slowly went from:
“leftmost match in s2 is not ensured”
to
“60-70% → 95% → 100%, that leftmost match in s2 is ensured”

Does Gemini agree? :slight_smile:


So what is acceptable here?
One issue with Automaton is that it needs to rebuild for every new (blo, bhi) and building is the major cost.

I ran test suite and with Automaton it is 3x slower… From 0.3 s to 0.9 s.
Thus, implemented it conditionally to only be used when autojunk=False.

I do agree that it would be great to be able to choose to run theoretically exact solution.

Now 2000 LOC file diff (with autojunk=True) runs in 2.3 ms.

With Automaton, (regardless whether autojunk is True/False) - 2.9ms.

However, test suite suggest that performance penalty can get a high as 3x.

1 Like

Don’t know and don’t want to spend more time arguing with bots :wink:.

This is something where I need to understand the algorithm at a deeper level myself - but that also takes time I don’t have much of these days.

I have no doubt that rebuilding would be convenient, but not convinced it’s necessary. The automaton knows about every suffix in b. Similarly so for my old experiments with suffix arraya. Instead of rebuilding for each new (alo, ahi) and (blo, bhi) ranges (a suffix array builds knowledge of every suffix in both inputs - it basically catenates both strings into a single input), it required tedious & delicate code to ensure that it cut searches off “early” when reaching a slice boundary. No rebuilding needed. But would have been much more convenient.

Don’t know whether that’s possible here.

Can’t say anything meaningful without knowing what, exactly, “test suite” refers to.

It’s also possible to introduce a new class implementing the new approach, and let the user decide. The current .find_longest_match() method is the heart of it all, and most other functionality can be inherited from SequenceMatcher.

Alas, because nobody ever stepped up to offer other sequence alignment approaches, the module as a whole isn’t as “pluggable” as it should be.

1 Like

Probably so, or continued in private.

About “leftmost”, @dg-pb,. which algorithm did you start from? Like just about everything else, there are many ways to approach building a suffix automaton (and I lost count of how many radically different ways are out there to build suffix arrays). Perhaps you started from a base that did address “leftmost”?

I ask because “leftmost guaranteed” doesn’t seem common. The bot wasn’t just making things up. For example, this appears to be an excellent introduction written by an actual human :wink:

Scroll way down to the “First occurrence position” section and you’ll find a briefer version of the detailed account Gemini gave me.

Which is fine. “Worst case” means worst case, and hash maps have horrid worst cases. Using a balanced tree assures the worst case is log time. But in real life a hash map typically runs circles around it.

One latent concern: a hash map per automaton state is quite a memory burden - dicts are fat objects. The current scheme has only one dict[Hashable, list[int]]. Just in the back of my mind - I doesn’t drive anything for me.

2 Likes

I am 99% convinced that this is the case, will cover the rest 1% in due time.

It is. But that would mean non-trivial increase in complexity. Also memory and performance.

I think what I have added is so far is reasonably simple.

test_difflib.py

But it was dominated by 1 single case.
However, it nudged me towards dynamic selection - current approach simply can not be beaten for some cases.

Well, a lot can be done.
E.g. LCS (non contiguous) based approaches.
However, this is next level of complexity - nothing remotely resembling O(n) there. To get anything decent need to combine 2 algorithms - Myers Diff and Hunt. And ideally code them up in C.

I think current approach is quite neat - it avoids a lot of such complexity and gives good results in reasonable time.

And I think adding Automaton kind of solves all the problems with it. Once it is merged in, can expose a flag to methods - exact=False so that users can choose to run theoretically exact Gestalt pattern matching - Wikipedia or use SequenceMatcher to calculate correct LCSUB.

I will post the what I did in the issue shortly.

No, I explicitly had a bit of a battle in the beginning with this. I started from a version where this wasn’t guaranteed. Then I noticed that results don’t match difflib and smoothed this out during refinement iterations. It was a month ago maybe so did not remember it straight away.

This is useful.
Will see if I can get some value from it.
However, I quite like current simplicity.
If there are no practical issues with it, I think it would best be not to ramp up complexity too quickly.

Yes, this is true - memory consumption is much higher. Could use single dictionary and have keys as tuples: (node_id, key). But maybe this can be considered later / when needed.

1 Like