Trick of the trade for string algorithms: Fibonacci strings often provoke bad cases! Here’s how to generate them:
def fib(a='a', b='b'):
yield a
yield b
while True:
a, b = b, a + b
yield b
f = fib()
xs = [next(f)]
while len(xs[-1]) < 1_000_000:
xs.append(next(f))
That creates 31 strings, with the last having 1_346_269 characters. By construction, lots & lots & lots of “self-similar” regions within and across strings.
Using your gist:
>>> import auto # @dg-pb's suffix automata gist
>>> k = auto.LCSUBAutomaton(xs[-1])
>>> for sub in xs:
... print(k.find(sub))
(0, 0, 1)
(0, 1, 1)
(0, 0, 2)
(0, 2, 3)
(0, 0, 5)
(0, 5, 8)
(0, 0, 13)
(0, 13, 21)
(0, 0, 34)
(0, 34, 55)
(0, 0, 89)
(0, 89, 144)
(0, 0, 233)
(0, 233, 377)
(0, 0, 610)
(0, 610, 987)
(0, 0, 1597)
(0, 1597, 2584)
(0, 0, 4181)
(0, 4181, 6765)
(0, 0, 10946)
(0, 10946, 17711)
(0, 0, 28657)
(0, 28657, 46368)
(0, 0, 75025)
(0, 75025, 121393)
(0, 0, 196418)
(0, 196418, 317811)
(0, 0, 514229)
(0, 514229, 832040)
(0, 0, 1346269)
runs very fast, considering. Less then 3 seconds to finish all of them.
It’s a timing disaster for difflib, though:
>>> from time import perf_counter as now
>>> from difflib import SequenceMatcher
>>> k = SequenceMatcher(None, b=xs[-1], autojunk=False)
>>> for sub in xs:
... start = now()
... k.set_seq1(sub)
... print(k.find_longest_match(), now() - start)
Match(a=0, b=0, size=1) 0.15416760000152863
Match(a=0, b=1, size=1) 0.22853810000015073
Match(a=0, b=0, size=2) 0.36407519999920623
Match(a=0, b=2, size=3) 0.7399129000004905
Match(a=0, b=0, size=5) 1.1449673999995866
Match(a=0, b=5, size=8) 1.9096559999998135
Match(a=0, b=0, size=13) 3.03649360000054
Match(a=0, b=13, size=21) 4.875035299999581
Match(a=0, b=0, size=34) 8.257377900001302
Match(a=0, b=34, size=55) 12.996435400000337
Match(a=0, b=0, size=89) 20.356891400002496
Match(a=0, b=89, size=144) 34.38862720000179
Match(a=0, b=0, size=233) 55.32236929999999
Match(a=0, b=233, size=377) 87.39998639999976
Match(a=0, b=0, size=610) 143.34997649999787
Match(a=0, b=610, size=987) 232.15511960000003
and I gave up waiting then. The elapsed times also roughly grow like a Fibonacci sequence from one line to the next.
So there’s massive potential here. While the results of the runs above are the same, I don’t believe “leftmost maximal match” is guaranteed by the automata construction.