Add tuple support to more str functions

I’ve generalised the alternatives. This allows for more complex tests. e.g.

find-tuple/python.exe -m timeit -s "import find_tuple, re; string = 'a' * 999_999 + 'b'; pattern = re.compile('a{999}(?:b|c)')"   "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'a' * 999_999 + 'b'; seps = 999 * 'a' + 'b', 999 * 'a' + 'c'" "find_tuple.find3(string, seps)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'a' * 999_999 + 'b'; seps = 999 * 'a' + 'b', 999 * 'a' + 'c'" "find_tuple.find4(string, seps)"

Is there a specific algorithm you want me to compare with? Or some specific arguments?
(I stopped manually calculating how much slower regex is, you can always check yourself)

Done, we’re back on topic for a while now.

If we add tuple support to [r]find() and [r]split(), I belive count and [r]index should have it too.
They have exactly same issue with find and split: the best algorithm needs setup cost.

Now I am +0 on this. Maybe, this issue needs PEP. Should we add “works poorly on complex input” methods to builtin type and warn it in documents? Or should we restrict only for multiple charactors, not multiple words? (e.g. count_chars, find_chars, split_by_chars, etc…)

On the other hand, removeprefix() and removesuffix() don’t have this issue.
I think adding tuple support to them doesn’t require PEP. It just fixes consistency with startswith() and endswith().

I’ve already added support for [r]index().

[r]split() and count() don’t have an obvious implementation, unlike find(subs) which is conceptually just a loop over the substrings (with some optimisations). And that’s the answer we’ll give if someone suggests to support it there too after this is merged.

For consistency, [r]partition() won’t be supported as well. (Unless you disagree)


I think we should refer to re for more complex patterns.


That should be an easy fix. I’ll create a separate pull request when I have time.

I’ll run a benchmark with 100 tuples soon, I’m currently trying to handle integer overflow.

It’s not clear what problems this thread addresses. Your benchmarks seem to be based on searching for "\/" in paths, which are very small strings.

If you are considering a specialized algorithm related to paths, you should first think about its proper location. I don’t think it has anything to do with str .

If you are thinking about “why not just add tuple support,” you first have to identify real-world problems. I wouldn’t use any of the proposed implementations in this thread. They can be efficiently solved using a producer and consumer model in a distributed system, achieving significant speedup.

That’s my feedback for this thread.


If your primary usage involves working with paths, note that a chunk size of 10,000 is too large.

2 Likes

Addressed in larger benchmark:

find chars best case - 2.40x faster
2000000 loops, best of 5: 176 nsec per loop
1000000 loops, best of 5: 221 nsec per loop
1000000 loops, best of 5: 279 nsec per loop
1000000 loops, best of 5: 225 nsec per loop
5000000 loops, best of 5: 73.4 nsec per loop
find chars mixed case - 1.07x slower
2000000 loops, best of 5: 176 nsec per loop
1000000 loops, best of 5: 220 nsec per loop
1000000 loops, best of 5: 277 nsec per loop
20000 loops, best of 5: 16.2 usec per loop
2000000 loops, best of 5: 188 nsec per loop
find chars worst case - 1.09x faster
5 loops, best of 5: 42.1 msec per loop
5 loops, best of 5: 53.4 msec per loop
50 loops, best of 5: 4.32 msec per loop
10000 loops, best of 5: 32.1 usec per loop
10000 loops, best of 5: 29.5 usec per loop
find subs best case - 2.96x faster
1000000 loops, best of 5: 214 nsec per loop
1000000 loops, best of 5: 314 nsec per loop
1000000 loops, best of 5: 221 nsec per loop
5000000 loops, best of 5: 72.4 nsec per loop
find subs mixed case - 33.5x slower
1000000 loops, best of 5: 221 nsec per loop
1000000 loops, best of 5: 289 nsec per loop
500 loops, best of 5: 731 usec per loop
50000 loops, best of 5: 7.41 usec per loop
find subs worst case - no difference
5 loops, best of 5: 54.6 msec per loop
50 loops, best of 5: 5.29 msec per loop
200 loops, best of 5: 1.46 msec per loop
200 loops, best of 5: 1.47 msec per loop
find many prefixes - 82.1x slower
1 loop, best of 5: 509 msec per loop
1000 loops, best of 5: 302 usec per loop
10 loops, best of 5: 36.7 msec per loop
10 loops, best of 5: 24.8 msec per loop
find many infixes - 5.20x slower
1 loop, best of 5: 508 msec per loop
50 loops, best of 5: 4.33 msec per loop
10 loops, best of 5: 33.1 msec per loop
10 loops, best of 5: 22.5 msec per loop

rfind chars best case - 1.24x faster
2000000 loops, best of 5: 112 nsec per loop
1000000 loops, best of 5: 296 nsec per loop
50 loops, best of 5: 8.54 msec per loop
1000000 loops, best of 5: 229 nsec per loop
5000000 loops, best of 5: 90 nsec per loop
rfind chars mixed case - 54.9x slower
2000000 loops, best of 5: 112 nsec per loop
1000000 loops, best of 5: 304 nsec per loop
50 loops, best of 5: 8.52 msec per loop
500 loops, best of 5: 597 usec per loop
50000 loops, best of 5: 6.15 usec per loop
rfind chars worst case - 1.03x slower
5 loops, best of 5: 51.6 msec per loop
5 loops, best of 5: 56.2 msec per loop
10 loops, best of 5: 22.7 msec per loop
200 loops, best of 5: 1.19 msec per loop
200 loops, best of 5: 1.22 msec per loop
rfind subs best case - 2.32x faster
1000000 loops, best of 5: 358 nsec per loop
50 loops, best of 5: 8.53 msec per loop
1000000 loops, best of 5: 230 nsec per loop
5000000 loops, best of 5: 99.2 nsec per loop
rfind subs mixed case - 20.4x slower
1000000 loops, best of 5: 363 nsec per loop
50 loops, best of 5: 8.53 msec per loop
500 loops, best of 5: 724 usec per loop
50000 loops, best of 5: 7.4 usec per loop
rfind subs worst case - no difference
5 loops, best of 5: 52.3 msec per loop
10 loops, best of 5: 20.9 msec per loop
200 loops, best of 5: 1.45 msec per loop
200 loops, best of 5: 1.45 msec per loop
rfind many suffixes - 1.06x slower
1 loop, best of 5: 508 msec per loop
1 loop, best of 5: 217 msec per loop
10 loops, best of 5: 24.6 msec per loop
10 loops, best of 5: 26.1 msec per loop
rfind many infixes - no difference
1 loop, best of 5: 511 msec per loop
1 loop, best of 5: 218 msec per loop
10 loops, best of 5: 22.5 msec per loop
10 loops, best of 5: 22.4 msec per loop

That’s a bit overkill, I would rather have find_chars() than that, as it can be used by more people.


If you’re talking about the alternatives, I wouldn’t use any of them either. How do you think to speed this up? Without an implementation this claim is meaningless.


Smaller chunks don’t speed up the performance that much for early matches. There’s too much overhead for that.

Should we add a note about re to [r]find() and [r]index()? If so, what should be in it?

Hmm, upon closer inspection it’s indeed too large for the mixed case. I’ve lowered it to 1000.
That does make the worst case a bit slower, but I think that’s acceptable.

The only cases where it’s significantly slower are covered by the re module.

1 Like

If it so happens and this gets implemented. If I see you using this, do you promise to remove it from your code on my request?

2 Likes

If anyone is interested in this the work on finite-set-string-search this is still happening in gh-118184: Support tuples for `find`, `index`, `rfind` & `rindex` by nineteendo · Pull Request #2 · nineteendo/cpython · GitHub

Sorry for bumping this. Below is the summary from that pull request. Right now I want to know just one thing: can I create a pull request, or should I just delete my branch? I don’t think any further changes I make will change someone’s opinion.

Motivation

For finding multiple substrings, there’s currently no single algorithm that outperforms others in most cases. Their performance varies significantly between the best-case and worst-case, making it difficult to choose one:

algorithm loop in loop startswith re[1] find str unit
find chars best case 1.00 1.19 1.56 1.37 x 180 nsec
find chars mixed case 1.00 1.22 1.54 91.01 x 178 nsec
find chars worst case 1262.20 1597.56 131.71 1.00 x 32.80 usec
find subs best case 1.00 1.33 1.17 x 212 nsec
find subs mixed case 1.00 1.30 3327.27 x 220 nsec
find subs worst case 35.82 3.62 1.00 x 1.46 msec
find many prefixes 1760.80 1.00 122.26 x 301.00 usec
find many infixes 122.17 1.00 7.71 x 4.33 msec
rfind chars best case 1.00 2.61 4.34 1.96 x 114 nsec
rfind chars mixed case 1.00 2.66 4.34 4561.40 x 114 nsec
rfind chars worst case 50.10 55.25 6.94 1.00 x 1.01 msec
rfind subs best case 1.58 2.55 1.00 x 229 nsec
rfind subs mixed case 1.00 1.59 1921.05 x 380 nsec
rfind subs worst case 38.07 4.97 1.00 x 1.45 msec
rfind many suffixes 1094.46 1.00 50.51 x 487.00 usec
rfind many infixes 54.70 1.00 2.41 x 9.69 msec

That’s why I’m suggesting a dynamic algorithm that doesn’t suffer from these problems:

algorithm loop in loop startswith re[1:1] find str find tuple unit
find chars best case 2.67 3.19 4.15 3.64 1.00 x 68 nsec
find chars mixed case 2.29 2.80 3.53 208.23 1.00 x 78 nsec
find chars worst case 1489.21 1884.89 155.40 1.18 1.00 x 27.80 usec
find subs best case 3.07 4.07 3.59 1.00 x 69 nsec
find subs mixed case 2.32 3.02 7713.38 1.00 x 95 nsec
find subs worst case 35.82 3.62 1.00 1.01 x 1.46 msec
find many prefixes 1760.80 1.00 122.26 82.06 x 301.00 usec
find many infixes 122.17 1.00 7.71 5.22 x 4.33 msec
rfind chars best case 1.33 3.45 5.76 2.59 1.00 x 86 nsec
rfind chars mixed case 1.08 2.86 4.67 4905.66 1.00 x 106 nsec
rfind chars worst case 50.10 55.25 6.94 1.00 1.10[2] x 1.01 msec
rfind subs best case 3.70 5.98 2.34 1.00 x 98 nsec
rfind subs mixed case 3.42 5.43 6576.58 1.00 x 111 nsec
rfind subs worst case 38.07 4.97 1.00 1.04[3] x 1.45 msec
rfind many suffixes 1094.46 1.00 50.51 50.31 x 487.00 usec
rfind many infixes 54.70 1.00 2.41 2.39 x 9.69 msec
algorithms
# find_tuple.py
def find0(string, chars):
    for i, char in enumerate(string):
        if char in chars:
            break
    else:
        i = -1
    return i

def find1(string, subs):
    for i in range(len(string)):
        if string.startswith(subs, i):
            break
    else:
        i = -1
    return i

def find2(string, pattern):
    match = pattern.search(string)
    i = match.start() if match else -1
    return i

def find3(string, subs):
    i = -1
    for sub in subs:
        new_i = string.find(sub, 0, None if i == -1 else i + len(sub))
        if new_i != -1:
            i = new_i
    return i

def find4(string, subs):
    i = string.find(subs)
    return i

def rfind0(string, chars):
    i = len(string) - 1
    while i >= 0 and string[i] not in chars:
        i -= 1
    return i

def rfind1(string, subs):
    for i in range(len(string), -1, -1):
        if string.startswith(subs, i):
            break
    else:
        i = -1
    return i

rfind2 = find2

def rfind3(string, subs):
    i = -1
    for sub in subs:
        new_i = string.rfind(sub, 0 if i == -1 else i)
        if new_i != -1:
            i = new_i
    return i

def rfind4(string, subs):
    i = string.rfind(subs)
    return i
benchmark script
# find_tuple.sh
echo find chars best case
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'ab' + '_' * 999_998; chars   = 'ab'"               "find_tuple.find0(string, chars)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'ab' + '_' * 999_998; subs    = tuple('ab')"        "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = 'ab' + '_' * 999_998; pattern = re.compile('[ab]')" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'ab' + '_' * 999_998; subs    = 'ab'"               "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'ab' + '_' * 999_998; subs    = tuple('ab')"        "find_tuple.find4(string, subs)"
echo find chars mixed case
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'b' + '_' * 999_999; chars   = 'ab'"               "find_tuple.find0(string, chars)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'b' + '_' * 999_999; subs    = tuple('ab')"        "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = 'b' + '_' * 999_999; pattern = re.compile('[ab]')" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'b' + '_' * 999_999; subs    = 'ab'"               "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'b' + '_' * 999_999; subs    = tuple('ab')"        "find_tuple.find4(string, subs)"
echo find chars worst case
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; chars   = 'ab'"               "find_tuple.find0(string, chars)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple('ab')"        "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = '_' * 1_000_000; pattern = re.compile('[ab]')" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = 'ab'"               "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple('ab')"        "find_tuple.find4(string, subs)"
echo find subs best case
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'abcd' + '_' * 999_996; subs    = 'ab', 'cd'"          "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = 'abcd' + '_' * 999_996; pattern = re.compile('ab|cd')" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'abcd' + '_' * 999_996; subs    = 'ab', 'cd'"          "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'abcd' + '_' * 999_996; subs    = 'ab', 'cd'"          "find_tuple.find4(string, subs)"
echo find subs mixed case
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'cd' + '_' * 999_998; subs    = 'ab', 'cd'"          "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = 'cd' + '_' * 999_998; pattern = re.compile('ab|cd')" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'cd' + '_' * 999_998; subs    = 'ab', 'cd'"          "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'cd' + '_' * 999_998; subs    = 'ab', 'cd'"          "find_tuple.find4(string, subs)"
echo find subs worst case
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = 'ab', 'cd'"          "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = '_' * 1_000_000; pattern = re.compile('ab|cd')" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = 'ab', 'cd'"          "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = 'ab', 'cd'"          "find_tuple.find4(string, subs)"
echo find many prefixes
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple(f'prefix{i}' for i in range(100))"                "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = '_' * 1_000_000; pattern = re.compile('|'.join(f'prefix{i}' for i in range(100)))" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple(f'prefix{i}' for i in range(100))"                "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple(f'prefix{i}' for i in range(100))"                "find_tuple.find4(string, subs)"
echo find many infixes
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple(f'{i}infix{i}' for i in range(100))"                "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = '_' * 1_000_000; pattern = re.compile('|'.join(f'{i}infix{i}' for i in range(100)))" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple(f'{i}infix{i}' for i in range(100))"                "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple(f'{i}infix{i}' for i in range(100))"                "find_tuple.find4(string, subs)"

echo ---

echo rfind chars best case
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_998 + 'ba'; chars   = 'ab'"                      "find_tuple.rfind0(string, chars)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_998 + 'ba'; subs    = tuple('ab')"               "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 999_998 + 'ba'; pattern = regex.compile('(?r)[ab]')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_998 + 'ba'; subs    = 'ab'"                      "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_998 + 'ba'; subs    = tuple('ab')"               "find_tuple.rfind4(string, subs)"
echo rfind chars mixed case
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_999 + 'b'; chars   = 'ab'"                      "find_tuple.rfind0(string, chars)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_999 + 'b'; subs    = tuple('ab')"               "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 999_999 + 'b'; pattern = regex.compile('(?r)[ab]')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_999 + 'b'; subs    = 'ab'"                      "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_999 + 'b'; subs    = tuple('ab')"               "find_tuple.rfind4(string, subs)"
echo rfind chars worst case
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; chars   = 'ab'"                      "find_tuple.rfind0(string, chars)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple('ab')"               "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 1_000_000; pattern = regex.compile('(?r)[ab]')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = 'ab'"                      "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple('ab')"               "find_tuple.rfind4(string, subs)"
echo rfind subs best case
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_996 + 'cdab'; subs    = 'ab', 'cd'"                 "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 999_996 + 'cdab'; pattern = regex.compile('(?r)ab|cd')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_996 + 'cdab'; subs    = 'ab', 'cd'"                 "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_996 + 'cdab'; subs    = 'ab', 'cd'"                 "find_tuple.rfind4(string, subs)"
echo rfind subs mixed case
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_998 + 'cd'; subs    = 'ab', 'cd'"                 "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 999_998 + 'cd'; pattern = regex.compile('(?r)ab|cd')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_998 + 'cd'; subs    = 'ab', 'cd'"                 "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_998 + 'cd'; subs    = 'ab', 'cd'"                 "find_tuple.rfind4(string, subs)"
echo rfind subs worst case
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = 'ab', 'cd'"                 "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 1_000_000; pattern = regex.compile('(?r)ab|cd')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = 'ab', 'cd'"                 "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = 'ab', 'cd'"                 "find_tuple.rfind4(string, subs)"
echo rfind many suffixes
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple(f'{i}suffix' for i in range(100))"                            "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 1_000_000; pattern = regex.compile(f'(?r){'|'.join(f'{i}suffix' for i in range(100))}')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple(f'{i}suffix' for i in range(100))"                            "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple(f'{i}suffix' for i in range(100))"                            "find_tuple.rfind4(string, subs)"
echo rfind many infixes
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple(f'{i}infix{i}' for i in range(100))"                            "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 1_000_000; pattern = regex.compile(f'(?r){'|'.join(f'{i}infix{i}' for i in range(100))}')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple(f'{i}infix{i}' for i in range(100))"                            "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple(f'{i}infix{i}' for i in range(100))"                            "find_tuple.rfind4(string, subs)"

Examples

>>> "0123456789".find(("a", "b", "c"))
-1
>>> "0123456789".find(("0", "1", "2"))
0
>>> "0123456789".rfind(("7", "8", "9"))
9

Use cases

While I haven’t found a lot of use cases, this new addition would improve the readability and performance for all of them:

>2K files with /max\(\w+\.rfind/

Python implementation

The implementation written in Python is clear and simple (in C overflow and exceptions need to be handled manually):

MIN_CHUNK_SIZE = 32
MAX_CHUNK_SIZE = 16384
EXP_CHUNK_SIZE = 2

def find_tuple(string, subs, start=0, end=None):
    end = len(string) if end is None else end
    result = -1
    chunk_size = MIN_CHUNK_SIZE
    chunk_start = start
    while True:
        chunk_end = min(chunk_start + chunk_size, end)
        if chunk_end < end:
            chunk_end -= 1
        for sub in subs:
            sub_end = min(chunk_end + len(sub), end)
            new_result = string.find(sub, chunk_start, sub_end)
            if new_result != -1:
                if new_result == chunk_start:
                    return new_result
                chunk_end = new_result - 1 # Only allow earlier match
                result = new_result
        if result != -1 or chunk_end >= end:
            return result # Found match or searched entire range
        chunk_start = chunk_end + 1
        chunk_size = min(chunk_size * EXP_CHUNK_SIZE, MAX_CHUNK_SIZE)

def rfind_tuple(string, subs, start=0, end=None):
    end = len(string) if end is None else end
    result = -1
    chunk_size = MIN_CHUNK_SIZE
    chunk_end = end
    while True:
        chunk_start = max(start, chunk_end - chunk_size)
        if chunk_start > start:
            chunk_start += 1
        for sub in subs:
            sub_end = min(chunk_end + len(sub), end)
            new_result = string.rfind(sub, chunk_start, sub_end)
            if new_result != -1:
                if new_result == chunk_end:
                    return new_result
                chunk_start = new_result + 1 # Only allow later match
                result = new_result
        if result != -1 or chunk_start <= start:
            return result # Found match or searched entire range
        chunk_end = chunk_start - 1
        chunk_size = min(chunk_size * EXP_CHUNK_SIZE, MAX_CHUNK_SIZE)

Explanation

The search is split up in chunks, overlapping by the length of a substring. After the first match, we search for the next substring in the part before (or after for reverse search). Each chunk will be twice as large as the previous one, but capped at 16384. The dynamic size ensures good best- and worst-case performance.

C call hierarchy

find_sub() and find_subs() are called based on the argument type using an inline function. find_sub() is called for tuples of length 1, chunk_find_sub() is called for tuples with more than 1 element:

Calibration

MIN_CHUNK_SIZE and MAX_CHUNK_SIZE were calibrated on this benchmark:

  • 32 was the highest mimimum size beating out all other algorithms in the best-case, setting it any lower would hurt performance for substrings after the first chunk.
  • 16384 was the highest maximum size with a measurable improvement in the worst case, setting it any higher would only hurt performance in the average case.
MIN_CHUNK_SIZE 4 8 16 32 64 128 256 512 1024 unit
find chars best case 1.01 1.00 1.07 1.09 1.06 1.10 1.06 1.10 1.10 x 66.7 nsec
find chars mixed case 1.00 1.01 1.10 1.14 1.10 1.17 1.09 1.23 1.24 x 75.3 nsec
find subs best case 1.00 1.01 1.02 1.05 1.00 1.04 1.01 1.01 1.01 x 70.9 nsec
find subs mixed case 1.00 1.01 1.06 1.22 1.39 2.22 3.31 5.51 10.0 x 84.1 nsec
rfind chars best case 1.02 1.01 1.02 1.00 1.01 1.00 1.00 1.03 1.02 x 92.9 nsec
rfind chars mixed case 1.00 1.01 1.06 1.12 1.29 1.68 2.34 3.69 6.12 x 98.8 nsec
rfind subs best case 1.00 1.01 1.02 1.01 1.03 1.00 1.00 1.02 1.02 x 96.9 nsec
rfind subs mixed case 1.00 1.00 1.06 1.08 1.29 1.89 2.75 4.54 8.05 x 106 nsec
MAX_CHUNK_SIZE 1024 2048 4096 8192 16384 32768 65536 unit
find chars worst case 1.63 1.27 1.11 1.09 1.01 1.00 1.05 x 27.7 usec
find subs worst case 1.03 1.01 1.00 1.00 1.00 1.03 1.00 x 1.47 msec
find many prefixes 1.08 1.04 1.02 1.00 1.00 1.47 1.47 x 24.7 msec
find many infixes 1.08 1.03 1.01 1.00 1.00 1.45 1.44 x 22.6 msec
rfind chars worst case 1.20 1.00 1.17 1.17 1.01 1.01 1.15 x 1.03 msec
rfind subs worst case 1.04 1.02 1.01 1.01 1.01 1.01 1.00 x 1.44 msec
rfind many suffixes 1.09 1.04 1.02 1.01 1.00 1.00 1.00 x 24.4 msec

Previous discussion


  1. Using the regex module for reverse search ↩︎ ↩︎

  2. memrchr() is not available on macOS ↩︎

  3. expected as find tuple does more work in the worst case ↩︎

1 Like

@storchaka, could you give your opinion, please? The overhead is largely eliminated by doubling the chunk size in every iteration, making the algorithm perform much better.

I do realise this feature isn’t very exciting, but that’s mostly because there are already slower ways to achieve the same goal. I was hoping we could use it to slightly speed up and simplify path operations that are too complicated for a full C implementation. I would have also made use of it when searching for the start or end of a line with universal newline support.

If this needs a PEP because of the complexity, I’ll assist in writing it.