A comment.
a) Algorithm
Majority of common cases will be much faster than re
.
A useful example:
text = 'a' * 50_000 + '\n' + 'b' * 50_000 + '\r\n'
%timeit re.findall(r'(\r\n|\n)', text) # 629 μs
# VS
%timeit text.find('\r\n') # 93 μs
%timeit text.find('\n') # 853 ns
%timeit ['\n', '\r\n'] # 60 ns
# So using this approach in total it would be:
# 93 + 1 x 2 + 0 = 95 µs
# So approximately 6 x faster.
This way, there will be a large number of cases that can be done more efficiently and without re
. From what I have seen a very large majority of cases is faster than re
for number of search items up to 20.
b) Implementation.
Although the algorithm does the job fairly well, the current implementation should ideally be improved.
IMO, it should be pushed down to fastsearch.c
, where algorithms of this sort belong. Also, this way, if the time comes when someone wants to improve on it (e.g. implement Aho Corasick), the infrastructure will allow to do it inplace.
Also, pushing it to fastsearch.c
would result in further performance improvements as gh-119702: New dynamic algorithm selection for string search (+ `rfind` alignment) by dg-pb · Pull Request #120025 · python/cpython · GitHub factors out preparation, while current approach needs to do both: preparation & search for every chunk. Also, there might be other possible optimizations by making use of other methods in fastsearch.c
.
To sum up.
IMO, this is a worthwhile endeavour.
gh-119702: New dynamic algorithm selection for string search (+ `rfind` alignment) by dg-pb · Pull Request #120025 · python/cpython · GitHub should be merged first and this then could be implemented in a way that adheres to the current infrastructure. It can be done now, but the issue is that there would be a non-negligible difference between how implementation would look now and after gh-119702.