Decided to exclude skew fix from the scope. Reasons being:
- I am not convinced that it is the best thing to commit to. For the simplicity vs benefit it is quite good, but there might be better paths towards quality once more tools are available. e.g. I managed to do quite efficient heuristic for
snake diff concept (now called unified diff as it has unique bonus included - so 3 in 1 - myers + block + patience), but it needs Aho Corasick and some other stuff. In short, I think commitment to skew fix is unnecessary for the time being.
- More importantly, I figured a way to eliminate
O(n^2). Not completely - I am sure it can be broken with strategic effort, but to the degree that would solve pretty much all the issues that people are having. Thus, I would rather move complexity from skew fix here.
Regarding (2). This is the idea.
If there are many different lengths, then there is no issue as every iteration manages to brush off large block. The issue only occurs for many small sized blocks. And you can not have many small sized blocks of different lengths. If there are many blocks and they are of different lengths, then some of them will be large.
So, for the case of many matches of equal size, it is very often possible to get many matches in 1 go. e.g.:
seq1 = 'a-b-a-b'
seq2 = 'b+a+b+c'
With automaton we can get all non-overlapping matches of maximum length. So we have matches from above:
(0, 2, 1)
(2, 0, 1)
(4, 2, 1)
(6, 0, 1)
Now usually we take just the 1st one. But instead we can iterate and using simple search find matching position which comes after the last block in seq2.
(0, 2, 1) - we take ('a')
(2, 0, 1) - j=0 comes earlier than than previous j+k=2+1 = 3
pattern = seq1[2:2+1] = 'b'
new_j = find(pattern, seq2, start=3) = 4
We instead pick:
(2, 4, 1)
Now we have:
(4, 2, 1) - ('a' again)
We stop this at the first failure to find later j as overlapping issues do not allow to continue.
In this case we can not find any 'a' after last match.
So final result is:
(0, 2, 1)
(2, 4, 1)
Which took 1 automaton run and collected all blocks.
For usual large block case this has little impact.
However it solves all pathological cases that I have encountered, such as:
seq1 = 'a+b+c+d+e+f'
seq2 = 'a-b-c-d-e-f'
seq1 = 'a+a+a+a+a+a'
seq2 = 'a-a-a-a-a-a'
# Or when running with char junk
seq1 = '[0, 0, 0, 0, 0, 0, ...'
seq2 = ' [0, 0, 0, 0, 0, 0, ...'
The last one (2 sequences of 70K chars from the issue on GH) takes forever to run with both old and new matchers (when charjunk=' \t').
With this adjustment it evaluates in 200ms.
So the above needs 1 more component - single pattern search.
Simple 50 LOC search does the trick, but it is complexity nevertheless and I think this is more beneficial than skew fix.
So current scope is:
- New Sequence matcher, which has functionality identical to
SequenceMatcher(autojunk=False), but with polished performance.
- Plug-ability of matchers/differs to different functions - pretty much done. Just thinking maybe it would be good to allow string argument for preset with new Matcher. What do you think?
- Making new Matcher modifiable. So that people can experiment with different ideas. Maybe some new good ones will surface over time that are better than
skew fix…
How does this sound?