Algorithmic complexity of difflib

Ah, that explains a lot! Also why your code looked different than anything else I’ve seen :wink:

FWIW, I’ve run close to 100 million randomized cases on strings from a 2-character alphabet, and your results always matched difflib’s.

Performance is my concern here, but worst case. Not for a single call to “get the longest match”, but across the whole process of calling that possibly many times to generate a full diff. It seems easy to contrive inputs whose first, third, fifth, …, elements match, but whose other elements never match (thinking of “large alphabet” here, like files viewed as sequences of lines).

So it finds a match on the first elements. Then the slice bounds change to restrict to what remains. b is rebuilt. Then the first remaining elements match again. Slice bounds change again, and b is again rebuilt.

Lather, rinse, repeat. Now it takes, in all, quadratic time to rebuild b roughly len(b)/2 times.

Yes, I was most surprised that nobody ever acted on that. Driver: difflib’s idea of “ratio” isn’t symmetric. Swap a and b and the result can change. “Proper” similarity measures are symmetric, and “longest common subsequence (not necessarily contiguous)” approaches are symmetric. But difflib was aiming to generate diffs, A notion of similarity fell out of that as an incidental consequence.

I’m not a fan of premature optimization either :smiley: Carry on!

Hm… This is very much case dependent. Each rebuild is O(n). However, if LLCSUB (Length of Longest Common Substring) findings are long, then need less. Also, if findings are in the middle, then total complexity is in the direction of O(nlogn).

So yes, in case of single-element matches at extreme ends all throughout the computation, the complexity will be O(n^2).


Well, I guess this can not be avoided given current framework.

I think this is an improvement on current O(n^4) worst case with autojunk=False. Is it O(n^4)? There would need to be very small alphabet, but with very short consecutive matches, which are nicely distributed in a way that largest ones are at the beginning.


Theoretically this can be addressed, but then it explodes through another end.

To support arbitrary ranges, each node would need to store all possible end positions. This means O(n^2) worst case memory - and you have raised some concerns about the current memory situation already…

I think it is quite good nevertheless. Its awareness of contiguousness results in good suggestions. A quality that resembles affine gap penalties in non-contiguous-longest subsequence-based ratios. And at least for user input suggestions I prefer it to simple Levenstein.

1 Like

Sure. I was contriving a worst case.

Can’t say - don’t know the code well enough. In the end, you do report the starting indices and length of a maximal match, so you have some way of getting at those already.

Yes, that’s easy for me to say :wink:.

Me too, but not a popular view. I never saw a reason to imagine that, for humans, changing sequence A to B is necessarily exactly as hard as B to A.

Order can matter. For example, what’s 31% of 50? It’s exactly the same question as 50% of 31, but almost everyone finds the latter easier.

2 Likes

FYI, Gemini says:

This uses the first_pos an earlier reference introduced to solve “leftmost”.

An additional int per state is minor memory bloat.

2 Likes

It is not a single int per state.

E.g.:

s2 = 'bcbcab'
s1 = 'abc'
Automaton(s2).find(s1, stop2=5)

It matches a, then expects to match b after it.
The automaton is built in such way that after b is matched it needs no recollection of bc and bc prefixes as they will not give better answer.

Now for different ranges, it needs to store 2 additional pointers. Depending on a query range it might need either of the 2.

This results in quadratic memory worst case.

Or even simpler case:

s2 = 'aaaaaaaaaaaaaaaaa'

If range is not known in advance it pretty much needs to store indices of all possible contiguous permutations. And why Automaton works is exactly because it is able to construct efficient and unambiguous path of traversal where it can discard most of these.


I am not 100% sure yet, will give it a bit more thought.

But at roughly 70% that it isn’t possible without worst case memory explosion. And even if it is possible, unlikely to be worth it. Let’s say 70% that it isn’t. Thus probability of this bearing fruits is ~ 9%.

Will stop when it is less than 5. :slight_smile:

1 Like

I don’t know the code well enough to discuss specifics. So I’ll shut up now until I do :wink:.

Don’t know what stop2=5 might mean.

??? Add a million new ints to each state and it’s still O(len(sequence)) memory burden - the number of states doesn’t increase, just the amount of storage taken per state. Which remains linear.

Fair enough :smiley:.

1 Like
s2[:5]
1 Like

Just finished a session with ChatGPT-5, which I usually use (Gemini was an experiment).

It went great :wink::. It’s very pessimistic about the practicality of not rebuilding the automaton when its slice bounds change. It came up with several approaches anyway, the sanest of which would require adding, to each state, a set of all indices at which suffixes could end at that state.

More interestingly, it had genuine insights into why similar problems don’t arise when using suffix arrays instead. At heart, they abstract away nothing about suffix positions. Instead they show all possible matches everywhere across the strings, and you’re left to weed out the nonsense yourself (“no, I don’t want to see 7 repetitions of ‘xyz’ in a single input, nor do I want to see all the ways ‘xxxx’ can overlap with itself, nor do I want to see matches outside the current slice bounds” - all easy enough to answer, but a small mountain of tedious and error-prone coding).

BTW, easiest way to build a suffix array:

def sa(s):
     return sorted(range(len(s)), key=lambda i: s[i:])

Then:

>>> s = "defabc"
>>> sa(s)
[3, 4, 5, 0, 1, 2]

Very compact! And ridiculously slow and memory-hogging. It’s not at all obvious it can be done in linear time, and doing so has given birth to some of the subtlest algorithms I’ve seen. The fastest I know of is called DivSufSort, and occupies a few thousand(!) lines of C

To be useful for the purpose at hand, the suffix array needs to be augmented with an LCP (longest common prefix) array, another list[int] where

LCP[i] = length of longest prefix common to s[sa[i-i]:] and s[sa[i]:]

Is your head exploding yet? Suffix arrays are “like that”.

It took a bunch of years to discover that an LCP can be computed from a suffix array and the string in linear time too (the code is brief, but subtle). Which also requires a rank: list[int]

rank[i] = index of i in sa

That’s trivial, just the inverse of sa.

Anyway, given sa, lcp, and rank, all linear-time to build “in theory”, and all lists of little ints, a universe of string problems can be solved quickly. Coding is delicate, though!

Except for their hostility to bounding search slices, suffix automatons are much easier to build in linear time and to use for our purpose. But they also require more RAM.

2 Likes

Did some investigation to get a feel whether it could be a better candidate.

Short story long:

  1. Its ability to query arbitrary ranges is impressive
  2. Its memory consumption is much lower.
  3. Python implementation is 4-5x slower than Automaton. Various reasons for that. Among which:
    1. Automaton only needs to build 1 sequence. This is a nice feature that makes things much faster. It can do both ways though. E.g. for K sequences, it can either build joint K-sequence and query without input. Or it can build (K - 1)-sequence automaton and query with new input. The latter is much faster, especially for 2 sequence case.
    2. Suffix array O(n) methods are heavy in array operations. In C they would be blazing fast. In Python they take significant time to compute.
  4. Building suffix array in O(n) is more complex that Automaton.
  5. Automaton’s ability to build on sequence A and query sequence B aligns with how SequenceMatcher is currently structured. General expectation is that setting b it does preparation for querying any a. Not ideal (having to rebuild for different ranges), but somewhat ok fit. Suffix array would need a slight change in mental model.

Long story short:

  1. Implemented in C it would be the fastest option and is well in line with the need for several queries for same sequences in different ranges.
  2. Python implementation is several times slower and most (if not all) benefits of (1) would be offset, while complexity is several times higher.

I have no good ideas how to make it more memory efficient.

1 Like

Having played with that over the years, I doubt it. Their great theoretical attraction is that matches are always worst-case linear time after building the arrays once, no matter how often slice bounds change.

But very delicate. I doubt anyone has coded this correctly at fist. “Even ChatGPT-5” feel into the trap that because without specifying slice bounds, “the answer” is always gotten from a pair of adjacent suffix array entries, it must be the case that’s also true when bounds are in play.

But it’s not. Instead it needs a “sliding window” approach, marching over slices of the suffix array. You have to find a window that contains slices from both inputs in range, then use the minimum lcp value within that window. Doing that efficiently in turn requires building a special queue that magically keeps track of the smallest lcp value in the window across arbitrary sequences of pushes and pops.

Whether or not a new best is found, then you need to move “the left” of the window to the right, one at a time, until (if ever) the current minimum value in the queue is a new plausible best. If that empties the window, back to the top to find a new plausible window by moving “the right” of the window again.

This is typical of “advanced” applications of suffix arrays: blazing fast but really tricky to get right.

Very important in large-scale bioinformatics, but I think not so much for us. Suffix arrays were invented to save RAM (suffix trees are easier to work with, but require significantly more memory - suffix arrays capture some of trees’ most important properties more compactly).

Depends on many details. The fastest suffix array builder I know of (DivSUfSort) pushes thousands of lines of C for a reason, and it’s not even O(n). It does very sophisticated analysis of the string to take shortcuts when possible, and resorts to “just sort” across some unfavorable regions.

In CPython. Different under PyPy. That specializes lists of “small enough” ints, under the covers, to use native machine ints, so no “Python object” time or space overheads. PyPy is also good at speeding simple list-indexing code.

Much more complex. There are bearably straightforward ways that reduce a sorting approach to mostly sort small ints instead of strings, and that speeds things a whole lot. I used that at times, and at other times a “bucker refinement” approach (basically a form of radix sort).

I didn’t really care about O(n) building time because, in context, the “build once” cost is amortized over a potentially large number of slice-bounded searches.

Sure you do! As you already noted, the per-state dicts could be replaced by a single dict indexed by (state_id, character) tuples. Somewhat slower, but should slash memory burden. I don’t think anything can be done to eliminate any of the few ints stored per state, or to reduce the number of states to (as suffix arrays do) len(seq1 + seq2). But there are never more than about twice len(seq2) states, so no clear potential for marterial savings there regardless.

2 Likes

I suppose you are talking about doubling rank thing.
It is opt='nlogn' below.
And opt='n' is AS-IS.

xs = make_fib_xs(n=1_000_000)   # your fibonacci sequences
string = xs[-10]
# len(string) == 28_000

%timeit lcsub_sa(string, string)
%timeit lcsub_sa(string, string, opt='nlogn')
%timeit lcsub_sa(string, string, opt='n')
%timeit _LCSUBAutomaton(string).find(string)

------------------------------------------------------------------
| macOS-11.7.10-x86_64-i386-64bit-Mach-O | CPython: 3.13.4
| 7 repeats, 1 times | 2026-01-23T22:01:18
------------------------------------------------------------------
| Units: ms                             | Time |                 |
| ------------------------------------- | ---: | --------------- |
| lcsub_sa(string, string)              |  544 | ███████████████ |
| lcsub_sa(string, string, opt='nlogn') |  327 | █████████       |
| lcsub_sa(string, string, opt='n')     |   97 | ███             |
| _LCSUBAutomaton(string).find(string)  |   17 |                 |

And additional complexity and runtime penalty is non-negligible.
Can not do this easily at runtime as dict-state needs to be copied over to cloned nodes.
Thus would need extra traversal routine.
If the structure was a list of nodes, would be a bit easier.
But currently (for performance reasons) I made nodes hold direct references, thus it is a nested structure with circular references.

Finally, extra tuple and int per key (not per node) - no guarantees that memory would not be larger.

Without venturing into restricted alphabets and similar paths that lead to loosing generality, I think current situation has little opportunities for worthwhile improvements.

For comparison, memory peak is only twice as big as for suffix array. Even if it was made to work, I doubt dict removal would eliminate more than 20% of it.

string = list(make_fib_xs(n=1_000_000))[-10]
# len(string) == 28_000

| Units: kB                            |     0 |
| ------------------------------------ | ----: |
| _LCSUBAutomaton(string).find(string) | 9,257 |
| lcsub_sa(string, string, opt='n')    | 4,530 |
1 Like

Pretty much. “Bucket refinement” improves on it, by partitioning the input into buckets. At the start they’re all in the same bucket. Thereafter, one bucket is refined at a time. Buckets can then immediately benefit from the additional accuracy of each bucket refined so far, if the first character of the unresolved prefix of the remaining suffix refers to a bucket that’s already undergone its next refinement step.

I realize that isn’t clear enough to follow :frowning::, but it’s of so little relevance here it’s not worth putting more effort into flesh it out. The offshoot is that “more than just double” often obtains.

Which is just one of the kinds of advanced analysis DivSufSort employs.

Impressive! The peanut gallery should note that the current difflib isn’t shown there: It would take so long @dg-pb would still be waiting for the result.

Good point!

That’s OK. Everything doesn’t need to be resolved at once :wink:.

I really don’t know what you’re measuring. The closed to 28_000 characters in the fib generator I posted were of lengths 17_711 and 28_657. Maybe you’re rounding for display?

Regardless, multiple megabytes of RAM is quite high for a 28K input string. Under the same circumstances (which I can’t guess at from here), what’s the memory use for

SequenceMatcher(None, b=string, autojunk=False)

? Disabling autojunk to make memory use as high as possbile.

2 Likes

Ah, also, suffix arrays require comparability of elements - it would not fit into difflib as current approach doesn’t need that.

Yes, just rounding.

string = list(make_fib_xs(n=1_000_000))[-15]
# len(string) == 1597 (Here is without rounding:)

| Units: kB                            | Peak |                 |
| ------------------------------------ | ---: | --------------- |
| _LCSUBSimple(string).find(string)    |  162 | ███             |
| lcsub_sa(string, string, opt='n')    |  403 | ███████         |
| _LCSUBAutomaton(string).find(string) |  819 | ███████████████ |

So compared to current approach, the memory is 5x.
(You can find _LCSUBSimple and _LCSUBAutomaton in PR.)

1 Like

By using slotted class for nodes can slash memory by 9%.
At the expense of ~15% slower runtime…

1 Like

Yes! Thanks for pointing that out. I should have already, but it slipped my mind. The current scheme only requires that sequence elements be usable as dict keys, not also that they be orderable.

If it worked out, I intended to propose it as an alternative scheme in difflib, not as a replacement for the current scheme.

Although I’ve never heard of a case where SequenceMatcher was used on elements that weren’t sortable.

Bingo. Now I know exactly what you’re measuring.

Mixed bag, eh? It doesn’t make this easier :wink:

Way back when, typical memories were much smaller, and lots of Python’s earlier code (like difflib) was written with that in mind. It was intentional that the code used only one “big” data structure (the b2j dict) with very simple structure (dict[Hashable, list[int]]).

But typical memories are more than 10x larger now, so a factor of 5 bloat isn’t unreasonable on the face of it. Then again, people are applying computers to ever-more demanding tasks too, so it isn’t reasonable on the face ot if either :wink:

I can live with it. But it would be prudent to add an option to force falling back to the current scheme, complementing your current good efforts to auto-detect when that’s likely the better approach.

2 Likes

What would default be?
With what sort of memory footprint would you say the hybrid approach could be left as “one and only”?


I thought about creating a separate sequence matcher leaving SequenceMatcher unchanged, but this seems to have little value - 2 matchers that are doing exactly the same would be unnecessary confusion. I think having 1 which just works is a better option.

New matcher would be appropriate for new method - e.g. LCS non-contiguous.


Finally, there is high level issue.
I think it is best to be left for different PR.
But it would be good to tackle it a bit now.

Most of methods that make use of SequenceMatcher do not support any parameter propagation - just hard-codes its instantiation.

I think the cleanest approach would be to introduce one extra argument matcher=DEFAULT_MATCHER to all appropriate places. This is straight forward pretty much everywhere.

The only modification to MatcherBase (Current SequenceMatcher) class needed would be to add set_junk(self, isjunk) method (culprit for needing this is Differ, which takes in 2 arguments that are passed down to hard-coded Matcher instantiations.).

1 Like

I don’t think anything achievable. Users overwhelmingly aren’t interested in advancing our algorithms’ state of the art, or even so much in speeding their apps, as in not needing to spend time fighting “surprise!” performance (whether time and/or memory) degradations, which often show up in code they didn’t even write (but rather inherit via some other package they happen to be using).

Even a factor of two would cause real pain for a bunch of users

Remember that Python has millions of users now. So I’m thinking a more resource-intensive algorithm has to be made “opt-in”, no matter how much faster it runs in bad - or even typical - cases.

In some ways yes, in others no.

Right. As I mentioned before, difflib is woefully “unpluggable”.

I believe we should think bigger here. There are many approaches that could be adopted over time. For example, git offers 4 differencing engines. One extreme (“minimal”) is aimed at minimizing the size of diffs, the other (“histogram”) at maximizing comprehensibility for human readers. Both perfectly sensible goals, for different purposes.

Where difflib aims at ignoring ;junk when finding synch points, “histogram” works from the other end: focus on using rare characters (really lines in git’s context0 as synch points. It runs longest common subsequence (not contiguous substring) on the rarest characters to find a way to partition the input files, and then gives up. Those partitions are then compared by something more like Meyers.

So hybrid methods are also reasonable to offer.

Indeed, git offers yet another twist: for comparing individual lines in a diff display, --word-diff can be used to specify what “junk” means to it (a regexp, defaulting to a sequence of whitespace). The lines are split by the regexp, and Meyers applied to the words. Kind of :wink: I’m not entirely clear on the details.

Funny that you mention Differ :wink: That was the point of difflib: a way to compare lines of code. SequenceMatcher was a consequence, packaging the guts of what Differ needed to do at two levels: comparing files as sequences of strings (many characters from an unbounded alphabet), and comparing lines as sequences of characters (rarely more than 100, from a small alphabet - this was before Unicode caught on, and 8-bit Latin-1 ruled the Python world).

But there’s no good reason to imagine the same difference engine is the best for both levels. Both levels of Differ “should be” pluggable. Not just their different notions of “junk”, but also the differencing engine to use.

Picturing the future, it seems difflib may need to grow a thread-local context object, which would allow specifying a gazillion choices just once. Difflib would, of course, consult the context object for what to use, rather than grow a gazillion arguments everywhere.

The decimal module may be a good model here. “Almost everything” can be accessed as a function attribute of the context object, but the methods on Decimal objects also sometimes allow overriding various context choices (e.g., the optional rounding= argument to the .quantize() method overrides the context’s idea of current rounding mode).

random has a simpler model: everything is internally a method of a Random class, but the methods of a single instance of that class are exposed as module-level functions. Users can use any underlying PRNG they like instead, by subclassing Random, overriding a handful of basic methods, and using their own class’s methods. There is no attempt to help them out using those methods via a functional interface instead.

But “almost nobody” does that, apart from the core itself offering SystemRandom to use the platform’s crypto-strength RNG at the bottom level.

difflib seems to live somewhere between random's bare-bones simplicity and decimal’s quite elaborate context facilities.

At the start, perhaps best to expose a new engine via a subclass of SequenceMatcher, and enough new optional arguments (as you said) to module-level functions to make them aware of it. That’s as “opt-in” as it gets Should SequenceMatcher itself attempt to guess at which engine is best to use? With only one other possibility in play, I realize that’s tempting.

But it shouldn’t guess at all at the start (unless possibly a new argument asks it to), and as engine possibilities grow, it would be better to offer a factory function to guess and return an instance of the appropriate class.

Ideally, SequenceMatcher itself shouldn’t be abused to “act like” a factory function - it’s a class with its own ideas of what to do.

Sorry for the rambling - it’s clear that I’m not clear on what best to do either :wink: Time will clarify things.

1 Like

My latest best is to expose matcher: Callable arguments. Differ would get 2 of these.

Provided callable would be:

def get_matcher(isjunk: Callable | None, a: Sequence, b: Sequence) -> MatcherBase:

This way, current SequenceMatcher could be sourced as it is.


Then, by default SequenceMatcher just runs current algorithm.

If user wanted to enable hybrid Automaton behaviour with autojunk=False:

matcher = partial(SequenceMatcher, autojunk=False, hybrid=True)

result = unifier_diff(..., matcher=matcher)

This way, would be able to flexibly source matchers as appropriate with only 1 argument.

Not as neat as it could be, but given the state of affairs, this is my best idea now.


Refactoring would be fairly simple:

class MatcherBase:
    @abstract
    def get_matching_blocks(self):
        ...

    # Copy over everything down from SequenceMatcher


class SequenceMatcher(MatcherBase):
    def find_longest_match(self, alo=0, ahi=None, blo=0, bhi=None):
        # Specific to this implementation

    def get_matching_blocks(self):
        ...

I thought about it and ratios would work sensibly with pretty much anything that implements get_matching_blocks.
Maybe they wouldn’t be something that is encoded into theory via wiki page, but theoretically they would make sense.

1 Like

Ok, so might have overkilled here a bit, but here we are.

2 Test Cases len(s1), len(s2) == (1597, 1597):

def get_case(case):
    if case == 'fib':
        s1 = s2 = list(make_fib_xs())[-15]
    elif case == 'rng':
        import random
        random.seed(1)
        string = list(make_fib_xs())[-15]
        N = len(string)
        s1 = ''.join([chr(random.randint(1, 2)) for _ in range(N)])
        s2 = ''.join([chr(random.randint(1, 2)) for _ in range(N)])
    else:
        raise ValueError
    return s1, s2

Automatons:

  1. Automaton v1 - current one in PR
  2. Automaton v2 - Instead of creating many node lists, use 4 lists for each field and navigate by indices
  3. Automaton v3 - in addition to (2), instead of size-1-dict, store key and value.
  4. Automaton v4 - in addition to (3), do the same for size-2-dicts

So in practice, at least 50% of dicts are of size 1.
For binary alphabets size-2-dicts are rare.
As alphabet size increases size-2-dicts go up to 25%.


Results:

  1. Baseline for memory is Simple (difflib)
  2. Baseline for performance is Automaton v1
  • cints=1 means array.array('q') is used instead of list for 2 appropriate node fields. This naturally results in less memory, but slower performance (due to conversions).
Algorithm Complexity Case 1: “fib” Case 2: “rng”
Memory (kB) Speed (ms) Memory (kB) Speed (ms)
Simple (difflib) 1 :green_circle: 160 (1.00×) :black_circle: 237.35 (126.4×) :red_circle: 160 (1.00×) :black_circle: 232.00 (102.5×) :red_circle:
Suffix Array 20 :red_circle: 499 (3.12×) :orange_circle: 13.67 (7.28×) :red_circle: 430 (2.69×) :orange_circle: 12.30 (5.43×) :red_circle:
Automaton v1 4 :green_circle: 818 (5.11×) :red_circle: 1.84 (1.00×) :black_circle: 954 (5.96×) :red_circle: 2.26 (1.00×) :black_circle:
Automaton v2 6 :orange_circle: 772 (4.83×) :red_circle: 1.90 (1.04×) :yellow_circle: 871 (5.44×) :red_circle: 2.71 (1.20×) :orange_circle:
Automaton v2 (cints=1) 6 :orange_circle: 661 (4.13×) :red_circle: 2.41 (1.31×) :orange_circle: 789 (4.93×) :red_circle: 3.88 (1.71×) :red_circle:
Automaton v3 7 :orange_circle: 326 (2.04×) :orange_circle: 1.48 (0.80×) :green_circle: 530 (3.31×) :orange_circle: 2.24 (0.99×) :green_circle:
Automaton v3 (cints=1) 7 :orange_circle: 215 (1.34×) :green_circle: 1.56 (0.85×) :green_circle: 448 (2.80×) :orange_circle: 2.67 (1.18×) :orange_circle:
Automaton v4 10 :red_circle: 325 (2.03×) :orange_circle: 1.46 (0.80×) :green_circle: 418 (2.61×) :orange_circle: 2.50 (1.10×) :orange_circle:
Automaton v4 (cints=1) 10 :red_circle: 213 (1.33×) :green_circle: 2.06 (1.12×) :orange_circle: 336 (2.10×) :green_circle: 3.67 (1.62×) :red_circle:
1 Like

Offhand, I think nothing visible about SequenceMatcher should change (rearranging the internal implementation is fine, though).

There should be a different class to force use of the suffix automaton engine.

And for Hybrid, a third class. although probably with a different name (“hybrid” may become ambiguous in the future when more than 2 engines are available - or perhaps its constructor could accept a sequence of engines to consider - and then be a factory function instead, returning the best guess at which engine class to use).

“First do no harm” :wink:

Sure! For Levenshtein, I believe it uses its own ratio calculation, which can differ depending on whether or not the definition of Levenshtein in use does or doesn’t allow “substitute” as an alternative to “delete+insert”. The former adds 1 to “the distance”, the latter adds 2.

For example, the “edit distance” between “apple” and “xyz” is 5 if substitution is allowed (worst case is max of the input lengths) but 8 if not (worst case is sum of the inputs’ lengths).

A class to implement that should supply its own ratio methods, if it cares.

2 Likes