Improvement on current `difflib.SequenceMatcher` algorithm

Took a step back from implementing new Matcher with Automaton and did some work to improve the current one.

For the worst case with autojunk=True (which this improvement is targeted at):

difflib’s current new algorithm
size 10K 10.6 s 0.2 s
size 15K 24.1 s 0.4 s
size 20K 60.5 s 0.8 s
size 50K ~ 13 m 6.6 s
size 100K ~ 1.5 h 31 s
  • Memory increase is reasonable.
  • Additional complexity is average.

As the default Matcher isn’t changing, maybe the improvement is worth it?

I think it might pay off in the long run - it would solve 99% of related issues.

4 Likes