Algorithmic complexity of difflib

“Theoretically optimal” linear-time suffix array construction runs like a pig in pure Python. Faster to use “theoretically worse” approaches based on sorting. The old “rank-doubling” O(n log n) approach works well in Python, leveraging CPython’s sort speed and relatively fast comparison of small ints. “Bucket refinement” even faster, sorting on just small ints rather than 2-tuples of ints (partitioning one bucket at a time implicitly takes the place of the first int in the “sort everything every time” approaches sorting on 2-tuples of ints). To get started on bucket refinement, just group all suffixes by their first character,. No other character comparisons are ever needed.

SAM construction is linear time “for real”, though, without massive hidden constant factors due to elaborate code and excruciating analysis.

1 Like

Ok, SAM it is.

There is one more tiny detail that I am trying to address.

What happens when only substrings of length 1 are left.

Longest Substring approach looses all its edge.
There is no hierarchy of different lengths to capture with priority.

The impact on actual test data? Small. Very.

How much I am unsettled about it? Moderately.

What is the ideal solution here? Standard LCS (non-contiguous).

What have I done so far? Made a check if it can be solved in O(nlogn). If no, continue recursion, if yes, solve it via LIS.

That delivered identical result as if I have just force-used exact algorithm.

case Myers Patient Myer Gestalt(aj=0) Gestalt SkewFix Gestalt SkewFix P1
[L] diff_key ( 15, 15) 8 8 0 5 -3 8 0 8 0
[L] diff_mid1 ( 28, 27) 11 11 0 11 0 11 0 11 0
[L] diff_mid2 ( 128, 128) 73 63 -10 72 -1 72 -1 72 -1
[L] diff_mid3 ( 131, 131) 63 62 -1 50 -13 57 -6 57 -6
[L] difflib.py ( 2041, 2334) 1883 1882 -1 1879 -4 1881 -2 1883 0
[L] linux_sched.c (11373, 11832) 10844 10839 -5 10837 -7 10844 0 10844 0
[L] react.js ( 3555, 3337) 2631 2581 -50 2581 -50 2582 -49 2582 -49
[L] three.js (36282, 36793) 30688 30620 -68 30588 -100 30626 -62 30631 -57
[L] webpack.js ( 2327, 3429) 929 886 -43 899 -30 912 -17 915 -14
-----
Total: 47130 46952 -178 46922 -208 46993 -137 47003 -127
Avg % diff: 0% 2.46% 7.25% 1.65% 1.61%
Runtime (s): 29.953 0.214 0.292 0.372 0.352

The impact is small ~9% fewer mismatches.
And we know there are actual mismatches without any benefit.
It makes code a bit faster as well ~7-9%.
And perfectly solves [L] difflib.py case :slight_smile:

1 Like

I’m unclear on what that means. Concrete example?

I don’t think I am. “Lexicographically smallest (a_start, b_start)” probably defines the result no matter what the endcase is you have in mind. That’s at least easy for users to understand.

I think you may be falling into a trap of “let’s optimize something I can measure, for its own sake”. I have no reason to believe “minimal edit distance” is of any particular relevance to human comprehensibility, and the latter is what I want to optimize.

If a local change can match two blocks of length 4 over a single block of length 5, ya, that can matter to human eyeballs. Beyond that, not so much. Humans suck at recognizing non-contiguous matching subsequences.

So I don’t really care how many characters Meyers can align, let alone how many fewer other approaches can reach. Unless there’s strong locality too, it’s measuring something that seems to me irrelevant to diff quality for human consumption.

Which is another way of saying “blindly try to reach the Meyers bound, regardless of whether it matters to humans” :wink:

LCS is of value as a preliminary step in patience/histogram, by identifying a relatively small number of high value synch points to separate regions within which to restrict “longest substring” matching, preventing accidental matches of boilerplate code across such regions. That it’s a “minimal edit distance” computation is irrelevant to that, though: it’s “high value synch point” that matters.

LIS there meant LCS, yes?

Don’t know what “exact algorithm” means here. Meyers?

Does the new complication lead to better diffs? I don’t accept that approaching the Meyers bound more closely is a measure of that. Although it’s related when locality is also in play.

1 Like
s1 = 'a_b_a_c_ddd'
s2 = 'bcaddd'

We find ddd, then partition. The left side is now:

s1 = 'a_b_c_'
s2 = 'bca'

Maximum length of substring is now 1.
And there might be many of these.
Now block oriented approach has no edge anymore.
It will find “a” (being leftmost) in s1. And that will be it.

This only concerns the the situation above.
This is simply for situation where there are no blocks larger than 1 anymore and nothing else can be done apart from trying to maximize the number of matches.

ok…

Longest Increasing Sequence.

Myers / Hunt Szymanski / DP - anything that finds Longest Common Subsequence (non contiguous). They all will give the same number of match points (just different variations if more than 1 exists).

I believe so.

At this point, there are no blocks anymore.

And there is nothing much to be done for readability - nothing to prioritize.

If there is nothing to prioritize anymore, then the best that can be done is maximizing match count.

Let’s get back small example from above:

s1 = 'a_b_c_'
s2 = 'bca'

Current approach will match “a” and nothing else.
GestaltFix will correct some simple cases a bit (e.g. this one).
But more complex ones could jut use plain maximization.

1 Like

A theoretical note.

Let’s assume that we can solve anything with sufficient performance.
Then what would we solve for?
What is underlying intent behind what “greedy largest block matching” tries to do?

Is it exactly that? It doesn’t seem so.

The fact that lookahead skewfix adjustment that “prevents matching large blocks where they put algorithm into suboptimal deadlock” likely delivers improvement, suggests that it isn’t the complete story.

So my attempt to formalise it:

\text{max} \sum_{i} f(\text{length}_i)

,where

\text{length}_i=\text{ length of contiguous block match}

for all combinations of non-overlapping contiguous common substrings, where:

  • f(length) is some cost function that takes length of each substring as an argument (convex to prioritise larger blocks).

Although this isn’t necessarily the holy grail, I think in this space of intuitive judgement on “what good diff is and where block based approach is aiming” this can give some quantitative common ground.

1 Like

I never tried to quantify it. I have no reason to presume that “a number” can quantity human comprehensibility. I’m sure your formulation captures part of it, just as sure as I am that minimal edit distance doesn’t capture it. The latter is why difflib exists :wink:

I stared and stared at actual diffs, and tweaked things, until they “stopping sucking so badly”.

I earlier qualified my endorsement of skew, which aims at maximizing your expression, with the caution provided the difference fits on a screen. Nobody remembers what a diff did two screens before the one they’re currently looking at.

I’m opposed to complication without clear justification. “Maximizing a theoretical number” doesn’t make that cut to me without compelling justification beyond that, sure, it can be done.

Code has costs too, of many kinds. The SAM code is already substantially more involved than anything difflib currently does. The “cognitive load” on reviewers, readers, and maintainers is already taking a big hit. In return,

  • We get enormous improvement in worst-case time.
  • Which is routinely hit in small-alphabet cases unless the autojunk hack throws out most of the characters, massively harming diff quality in order to complete in bearable time.
  • And renders autojunk in general so useless there’s no need to offer it.

Highly compelling on all counts.

Now you’re proposing to add yet another entirely different approach (longest non-contiguous subsequence), with its own unique subtle code. To what good end? Compared to what’s already been achieved, very little at best.

So “take the giant wins and move on” is my preference.

If this were a tiny tweak to code that’s already been written, I wouldn’t much care, beyond concern over complicating the explanation for curious users.

Then again, if you’re convinced this has real value, by all means pursue it now. If it’s not there from the start, “backward compatibility” will hamper any attempt to add it later.

2 Likes

At heart, difflib is trying to maximize that but under the additional constraint that \text{length}_i \gt 1 for all i. Without that constraint, it’s just some flavor of “minimal edit distance”.

I don’t know how to solve that efficiently, though. difflib approximates it via a greedy approximation (“pick the leftmost longest remaining at each step”), Skew can improve on that in some cases by considering local variations (other ways to group in the region of “the leftmost longest”).

But for a technical meaning of “best” that appears to have scant relevance - as you said, “there is nothing much to be done for readability” anymore when all remaining matching blocks are of length 1.

It wouldn’t matter to me if it were cheap and easy. But it isn’t. Best I can guess, you’re keen to exploit that, in some cases, a solution to Longest Increasing Subsequence can be used to compute Longest Common Subsequence in worst case O(n log n) time, better than quadratic time.

A pile of new subtle code relying on obscure knowledge (that LIS sometimes can solve LCS is very specialized knowledge), of no use anywhere else in the algorithm.

I’d prefer to stay out of that. Future maintainers will have challenge enough wrapping their heads around the details of SAMs, and obfuscated from “textbook” versions by various optimizations (for example, last I saw you were creating your own “list overallocation” strategy, not content to rely on Python’s built-in strategy, or to create enough room for the largest possible number of states just once at the start - and not saying you “shouldn’t”, just saying that every deviation from “textbook” creates a new head-scratcher for readers to work out).

2 Likes

Simple case, where all matching blocks are of length 1:

>>> SequenceMatcher(None, "diet", "tide").ratio()
...                 
0.5
>>> SequenceMatcher(None, "tide", "diet").ratio()
0.25

In the original order, “d” is leftmost, and matches first:

  diet
tide

Which leaves e in common in the trailing parts, so total matches adds to 2. Which is also a maximal common subsequence, although that’s accidental (“ie” and “de” are the two maximal common subsequences).

In the other order, “t” is leftmost and matches first:

   tide
edit

Nothing can match across the blocks remaining (the “before” block is empty for “tide”, and the “after” block is empty for “edit”), so total matches is 1.

Any of these can be used to create a correct diff, but none appears “better” than any other to me. At least “leftmost” captures something human eyes favor.

In the context of the difflib.Differ class, “tide” and “edit” on their own lines are considered too unalike to bother showing any diff between them. Just delete one line and insert the other.

2 Likes

First, just to note fundamental special case. With

f(\text{length})=\text{C} \times \text{length}

for any constant C, it is simply a Longest Common Subsequence (non-contiguous). Match point count is identical to any exact algorithm.


difflib’s greedy approach is a special case for this with scoring function:

f(\text{length})=\text{length}^{p\rightarrow\infty}

,when all matching substrings are of different length.


If the above is not the case (there are matching substrings of the same length or scoring function is not that aggresive), then the behaviour differs in several ways:

  1. Above algorithm might give up longer matches for the sake of more shorter ones
  2. For the sake of longer matches, it might intentionally give up a lot of short matches. difflib doesn’t take the structure of all other lower length substrings. This, interestingly, sometimes leads to difflib having more individual match points as fully greedy approach has accidentally left good alignments for shorter matches, while global optimization routine has intentionally sacrificed them.
  3. If there are many substrings of the same length, they get weighted equally and solved for combination for largest number. There is a tendency to do this (if scoring function is sufficiently convex), but it isn’t exactly that either as it considers all variations at the same time with all lengths included.

I have tested it with different scoring functions on a set of data and this is how current difflib’s approach relates to this:

  1. Result of purely greedy difflib approach is similar to the ones delivered with above algorithm with scoring function f(\text{length}) = \text{length}^{10}.
  2. SkewFix adjusted approach results are similar to the ones delivered with above algorithm with scoring function roughly f(\text{length}) = \text{length} \times (\log_2(1 + \text{length}))^3.

Similarity means:

  1. ~70% of individual match points are the same
  2. The total count of match points differs by ~20% across all cases.
  3. ~60% of new match points that SkewFix delivers compared to difflib without this adjustment, are the same with new match points that length * log2(1 + length)^3 gives compared to length^10.

So not exact equivalence, but parallels can be drawn.
(At least when looking with a set of data with similar quality in terms of common substring distribution)
And I think formulation above is a good tool for investigation.


I named this algorithm - snake diff.
Not 100% sure of correctness, but I did few iterations and it seems that this dynamic programming solution manages to do it correctly: Snake Diff · GitHub


With f(\text{length}) = \text{length} \times \log_2(1 + \text{length}) it gives beautiful diffs.

It is in a way a gradual transition difflibMyers as block sizes go shorter and shorter with a reasonable block size window within which things get sensibly optimized.

I picked a bit simpler case for comparison. You can see the output of different approaches (with _fancy_replace disabled for clarity) here:

(You can open 2 browsers and compare different approaches side-by-side)

1 Like

Agree with you on all counts here.

I gave up on this.
At least in a sense of considering adding it to difflib.
Apart from the obvious, the main issue is that there is no ideal option - all are lacking in one way or the other to be a good fit to do it.

And eventually, it would be nice to keep algorithm that can be generalized to k-strings while retaining performance.

And now it can!
Automaton for k-strings can be done, which is linear time.
And this is the only component really.


Instead, I will make a bit of effort to make an entry point so that it is possible to intercept divide-and-conquer routine with arbitrary logic.
This way one can easily do patience routine of depth 1 at the beginning, optimize 1s or anything really.
All while keeping rest of the steps the same.
Similarly to how string.Formatter does it, where different steps can be replaced.

1 Like

Your pursuing “the math” uncovered some interesting ideas to explore! Thanks for that.

It shouldn’t be a surprise that suffix arrays can do that too. I have a function for that, like so:

>>> longest(["aaabcde", "aabcdcd", "aacdeaa"])
(2, [[(0, 1), (0,), (0, 5)], [(4,), (3, 5), (2,)]])

The result is

max_length, list_of_list_of_index_k_tuples

each entry in the big list is a list of k (k is the number of input strings) elements, each a tuple giving the starting index (or indices) at which a maximal match starts in its corresponding string.

So

[(0, 1), (0,), (0, 5)]

says “aa” is a _maximal match, starting at either index 0 or 1 in the first string, at index 0 in the second string, and at 0 or 5 in the third string. In all, then, 2*1*2 = 4 ways to align them all on “aa”. Similarly for the second list, which records the 2 ways to align on the equally long “cd”.

This requires 2 linear-time passes over a Generalized Suffix Array, using a “sliding window” approach. I believe it could be done in one pass, but the code got so hairy I gave on that: the first pass now just finds the max length, and beyond that only remembers the earliest index in the suffix array at which a window containing the max length was found. The second pass then starts at that index to record all the maximal match indices.

The code is harder and slower than using suffix automatons, but in return all the index info is available directly. And I expect you’d want all the possibilities if, e.g., you wanted to use it to guide a “snake diff” kind of approach.

My one nod in that direction was to expose find_longest_match() as an advertised API function, so subclasses could override it to use any differencing engine they wanted instead. But its contract remains to find the longest match; even back then I had in mind maybe moving to suffix arrays, but nothing more general than that,

I agree it would be good to make difflib “more pluggable”.

2 Likes

Regarding this,
That was part of memory optimization where instead of having one list of nodes I chose to have 5 lists, one for each field, leading to overhead of 5 list objects instead of N.

I completely trust Python’s built-in strategy.
But the above meant very large number of append calls, which decreased performance.

Initially I had just 1 initial allocation, but optimized after remembering that this has worked well in the past.

Agree that this isn’t ideal from complexity POV.
Maybe it is possible to tell how many nodes is going to be needed in advance…
Will look into it.

1 Like

Sure! Not a real problem. It was just illustrating that any deviation from “textbook” creates new cognitive loads. I knew what you were doing, and why, but it did consume some time I could have spent instead looking at more important issues.

I also know, e.,g., why you save away endpos instead of firstpos (which is used in all accounts I’ve seen on the web for locating the starting index of the maximal match). It’s a little less computation while the code is running, requiring just one adjustment at the end to return the result. But it’s so obvious to you, it didn’t merit a clarifying comment

Little things add up. That’s all.

As I recall, you initially allocate space for 4/3\ n states. The maximum possible is 2n-1, about 1.5 times larger. As such things go, ya, I would have just allocated the max possible and be done with it. But I’m fine too with what you did,

The number of states is equal to the number of endpos equivalence classes - I doubt there’s a cheap way to guess that short of running the algorithm. I expect strings of the form ab^n will reach the max - but without the leading a, needs instead the minimum number of states (n+1),

2 Likes

PR is ready to play around with: gh-144132: Implement Suffix Automaton for difflib's Longest Common Substring by dg-pb · Pull Request #144164 · python/cpython · GitHub

As it is pure Python, can just copy-paste to try it out.

I think all is as planned. With the exception that matcher arguments to various functions can be callables (not necessarily subclasses of SequenceMatcherBase). Otherwise, can not do partial to set different parameters…


Also, here is an example of how one can intercept the algorithm.
This simple case runs myers diff when largest block size left is 1.

class MyersForOnesMatcher(difflib.GestaltSequenceMatcher):
    def _modifier(self, depth, block, alo, ahi, blo, bhi):
        length = block[2]
        if length == 1:
            aligns = myers_diff(self.a[alo:ahi], self.b[blo:bhi])
            blocks = [(alo + i, blo + j, 1) for i, j in aligns]
            return difflib.RESULTBLOCKS, blocks

Also, can have a look at diffs of PR produced with all 3 variations:

  1. Original Matcher with autojunk=True
  2. New Matcher (same as original with autojunk=False though)
  3. New Matcher with balancing turned on.
1 Like

Decided to exclude skew fix from the scope. Reasons being:

  1. I am not convinced that it is the best thing to commit to. For the simplicity vs benefit it is quite good, but there might be better paths towards quality once more tools are available. e.g. I managed to do quite efficient heuristic for snake diff concept (now called unified diff as it has unique bonus included - so 3 in 1 - myers + block + patience), but it needs Aho Corasick and some other stuff. In short, I think commitment to skew fix is unnecessary for the time being.
  2. More importantly, I figured a way to eliminate O(n^2). Not completely - I am sure it can be broken with strategic effort, but to the degree that would solve pretty much all the issues that people are having. Thus, I would rather move complexity from skew fix here.

Regarding (2). This is the idea.

If there are many different lengths, then there is no issue as every iteration manages to brush off large block. The issue only occurs for many small sized blocks. And you can not have many small sized blocks of different lengths. If there are many blocks and they are of different lengths, then some of them will be large.

So, for the case of many matches of equal size, it is very often possible to get many matches in 1 go. e.g.:

seq1 = 'a-b-a-b'
seq2 = 'b+a+b+c'

With automaton we can get all non-overlapping matches of maximum length. So we have matches from above:

(0, 2, 1)
(2, 0, 1)
(4, 2, 1)
(6, 0, 1)

Now usually we take just the 1st one. But instead we can iterate and using simple search find matching position which comes after the last block in seq2.

(0, 2, 1) - we take ('a')

(2, 0, 1) - j=0 comes earlier than than previous j+k=2+1 = 3
pattern = seq1[2:2+1] = 'b'
new_j = find(pattern, seq2, start=3) = 4
We instead pick:
(2, 4, 1)

Now we have:
(4, 2, 1) - ('a' again)

We stop this at the first failure to find later j as overlapping issues do not allow to continue.
In this case we can not find any 'a' after last match.

So final result is:
(0, 2, 1)
(2, 4, 1)

Which took 1 automaton run and collected all blocks.

For usual large block case this has little impact.
However it solves all pathological cases that I have encountered, such as:

seq1 = 'a+b+c+d+e+f'
seq2 = 'a-b-c-d-e-f'

seq1 = 'a+a+a+a+a+a'
seq2 = 'a-a-a-a-a-a'

# Or when running with char junk
seq1 = '[0, 0, 0, 0, 0, 0, ...'
seq2 = ' [0, 0, 0, 0, 0, 0, ...'

The last one (2 sequences of 70K chars from the issue on GH) takes forever to run with both old and new matchers (when charjunk=' \t').
With this adjustment it evaluates in 200ms.

So the above needs 1 more component - single pattern search.
Simple 50 LOC search does the trick, but it is complexity nevertheless and I think this is more beneficial than skew fix.


So current scope is:

  1. New Sequence matcher, which has functionality identical to SequenceMatcher(autojunk=False), but with polished performance.
  2. Plug-ability of matchers/differs to different functions - pretty much done. Just thinking maybe it would be good to allow string argument for preset with new Matcher. What do you think?
  3. Making new Matcher modifiable. So that people can experiment with different ideas. Maybe some new good ones will surface over time that are better than skew fix

How does this sound?

1 Like

My apologies. For personal health-related reasons, I’m utterly overwhelmed, with no clear end in sight (doctors are still testing, testing, testing … with lots of time in doctors’ offices). I can’t keep up with you :frowning:

That’s good, and especially so cutting off simple predictably quadratic-time cases.

I don’t understand what “allow string argument for preset with new Matcher” might mean to you, and don’t have plausible guesses. Sure, optional new arguments could be added to anything - but to what end? The motivation for “a preset” wasn’t spelled out that I could see.

You know more about this than me, since you’re living the experience of trying to make things more pluggable. Do what you think best!

Agreed “skew” is best put on hold for now - there’s potential for better diff quality there, but the space of possible approaches remains unclear. Docs should explicitly say that the precise diffs generated using the new matcher are subject to change across releases, as approaches to higher quality diffs are implemented. That is, be proactive about not allowing “backward compatibility” to kill progress.

2 Likes

No worries at all.
And hope things sort out in the best possible way for you.

I will be brief then.


Good news.

I think theoretical worst case complexity is O(n sqrt(n)) now. It is likely that

a-aa-aaa-aaaa-aaaaa-...
a+aa+aaa+aaaa+aaaaa+...

is the worst that can be done.
And 100K of this runs in 6 seconds.

That wasn’t clear at all…

All I meant is to allow string selection for convenience:

difflib.Differ(..., linematcher='exact')
1 Like