Add tuple support to more str functions

Nineteendo · May 26, 2024, 7:59pm

I’ve generalised the alternatives. This allows for more complex tests. e.g.

find-tuple/python.exe -m timeit -s "import find_tuple, re; string = 'a' * 999_999 + 'b'; pattern = re.compile('a{999}(?:b|c)')"   "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'a' * 999_999 + 'b'; seps = 999 * 'a' + 'b', 999 * 'a' + 'c'" "find_tuple.find3(string, seps)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'a' * 999_999 + 'b'; seps = 999 * 'a' + 'b', 999 * 'a' + 'c'" "find_tuple.find4(string, seps)"

Is there a specific algorithm you want me to compare with? Or some specific arguments?
(I stopped manually calculating how much slower regex is, you can always check yourself)

Nineteendo · May 27, 2024, 8:49am

Done, we’re back on topic for a while now.

methane · May 27, 2024, 8:49am

If we add tuple support to [r]find() and [r]split(), I belive count and [r]index should have it too.
They have exactly same issue with find and split: the best algorithm needs setup cost.

Now I am +0 on this. Maybe, this issue needs PEP. Should we add “works poorly on complex input” methods to builtin type and warn it in documents? Or should we restrict only for multiple charactors, not multiple words? (e.g. count_chars, find_chars, split_by_chars, etc…)

On the other hand, removeprefix() and removesuffix() don’t have this issue.
I think adding tuple support to them doesn’t require PEP. It just fixes consistency with startswith() and endswith().

Nineteendo · May 27, 2024, 9:01am

I’ve already added support for [r]index().

[r]split() and count() don’t have an obvious implementation, unlike find(subs) which is conceptually just a loop over the substrings (with some optimisations). And that’s the answer we’ll give if someone suggests to support it there too after this is merged.

For consistency, [r]partition() won’t be supported as well. (Unless you disagree)

I think we should refer to re for more complex patterns.

That should be an easy fix. I’ll create a separate pull request when I have time.

Nineteendo · May 27, 2024, 10:46am

I’ll run a benchmark with 100 tuples soon, I’m currently trying to handle integer overflow.

elis.byberi · May 27, 2024, 11:23am

It’s not clear what problems this thread addresses. Your benchmarks seem to be based on searching for "\/" in paths, which are very small strings.

If you are considering a specialized algorithm related to paths, you should first think about its proper location. I don’t think it has anything to do with str .

If you are thinking about “why not just add tuple support,” you first have to identify real-world problems. I wouldn’t use any of the proposed implementations in this thread. They can be efficiently solved using a producer and consumer model in a distributed system, achieving significant speedup.

That’s my feedback for this thread.

If your primary usage involves working with paths, note that a chunk size of 10,000 is too large.

Nineteendo · May 27, 2024, 11:53am

Addressed in larger benchmark:

find chars best case - 2.40x faster

2000000 loops, best of 5: 176 nsec per loop
1000000 loops, best of 5: 221 nsec per loop
1000000 loops, best of 5: 279 nsec per loop
1000000 loops, best of 5: 225 nsec per loop
5000000 loops, best of 5: 73.4 nsec per loop

find chars mixed case - 1.07x slower

2000000 loops, best of 5: 176 nsec per loop
1000000 loops, best of 5: 220 nsec per loop
1000000 loops, best of 5: 277 nsec per loop
20000 loops, best of 5: 16.2 usec per loop
2000000 loops, best of 5: 188 nsec per loop

find chars worst case - 1.09x faster

5 loops, best of 5: 42.1 msec per loop
5 loops, best of 5: 53.4 msec per loop
50 loops, best of 5: 4.32 msec per loop
10000 loops, best of 5: 32.1 usec per loop
10000 loops, best of 5: 29.5 usec per loop

find subs best case - 2.96x faster

1000000 loops, best of 5: 214 nsec per loop
1000000 loops, best of 5: 314 nsec per loop
1000000 loops, best of 5: 221 nsec per loop
5000000 loops, best of 5: 72.4 nsec per loop

find subs mixed case - 33.5x slower

1000000 loops, best of 5: 221 nsec per loop
1000000 loops, best of 5: 289 nsec per loop
500 loops, best of 5: 731 usec per loop
50000 loops, best of 5: 7.41 usec per loop

find subs worst case - no difference

5 loops, best of 5: 54.6 msec per loop
50 loops, best of 5: 5.29 msec per loop
200 loops, best of 5: 1.46 msec per loop
200 loops, best of 5: 1.47 msec per loop

find many prefixes - 82.1x slower

1 loop, best of 5: 509 msec per loop
1000 loops, best of 5: 302 usec per loop
10 loops, best of 5: 36.7 msec per loop
10 loops, best of 5: 24.8 msec per loop

find many infixes - 5.20x slower

1 loop, best of 5: 508 msec per loop
50 loops, best of 5: 4.33 msec per loop
10 loops, best of 5: 33.1 msec per loop
10 loops, best of 5: 22.5 msec per loop

rfind chars best case - 1.24x faster

2000000 loops, best of 5: 112 nsec per loop
1000000 loops, best of 5: 296 nsec per loop
50 loops, best of 5: 8.54 msec per loop
1000000 loops, best of 5: 229 nsec per loop
5000000 loops, best of 5: 90 nsec per loop

rfind chars mixed case - 54.9x slower

2000000 loops, best of 5: 112 nsec per loop
1000000 loops, best of 5: 304 nsec per loop
50 loops, best of 5: 8.52 msec per loop
500 loops, best of 5: 597 usec per loop
50000 loops, best of 5: 6.15 usec per loop

rfind chars worst case - 1.03x slower

5 loops, best of 5: 51.6 msec per loop
5 loops, best of 5: 56.2 msec per loop
10 loops, best of 5: 22.7 msec per loop
200 loops, best of 5: 1.19 msec per loop
200 loops, best of 5: 1.22 msec per loop

rfind subs best case - 2.32x faster

1000000 loops, best of 5: 358 nsec per loop
50 loops, best of 5: 8.53 msec per loop
1000000 loops, best of 5: 230 nsec per loop
5000000 loops, best of 5: 99.2 nsec per loop

rfind subs mixed case - 20.4x slower

1000000 loops, best of 5: 363 nsec per loop
50 loops, best of 5: 8.53 msec per loop
500 loops, best of 5: 724 usec per loop
50000 loops, best of 5: 7.4 usec per loop

rfind subs worst case - no difference

5 loops, best of 5: 52.3 msec per loop
10 loops, best of 5: 20.9 msec per loop
200 loops, best of 5: 1.45 msec per loop
200 loops, best of 5: 1.45 msec per loop

rfind many suffixes - 1.06x slower

1 loop, best of 5: 508 msec per loop
1 loop, best of 5: 217 msec per loop
10 loops, best of 5: 24.6 msec per loop
10 loops, best of 5: 26.1 msec per loop

rfind many infixes - no difference

1 loop, best of 5: 511 msec per loop
1 loop, best of 5: 218 msec per loop
10 loops, best of 5: 22.5 msec per loop
10 loops, best of 5: 22.4 msec per loop

That’s a bit overkill, I would rather have find_chars() than that, as it can be used by more people.

If you’re talking about the alternatives, I wouldn’t use any of them either. How do you think to speed this up? Without an implementation this claim is meaningless.

Smaller chunks don’t speed up the performance that much for early matches. There’s too much overhead for that.

Nineteendo · May 27, 2024, 12:25pm

Should we add a note about re to [r]find() and [r]index()? If so, what should be in it?

Nineteendo · May 27, 2024, 1:14pm

Hmm, upon closer inspection it’s indeed too large for the mixed case. I’ve lowered it to 1000.
That does make the worst case a bit slower, but I think that’s acceptable.

The only cases where it’s significantly slower are covered by the re module.

dg-pb · May 27, 2024, 1:19pm

If it so happens and this gets implemented. If I see you using this, do you promise to remove it from your code on my request?

dg-pb · June 28, 2024, 4:56pm

If anyone is interested in this the work on finite-set-string-search this is still happening in gh-118184: Support tuples for `find`, `index`, `rfind` & `rindex` by nineteendo · Pull Request #2 · nineteendo/cpython · GitHub

Nineteendo · September 1, 2024, 3:24pm

Sorry for bumping this. Below is the summary from that pull request. Right now I want to know just one thing: can I create a pull request, or should I just delete my branch? I don’t think any further changes I make will change someone’s opinion.

Motivation

For finding multiple substrings, there’s currently no single algorithm that outperforms others in most cases. Their performance varies significantly between the best-case and worst-case, making it difficult to choose one:

algorithm	loop in	loop startswith	re^[1]	find str	unit
find chars best case	1.00	1.19	1.56	1.37	x 180 nsec
find chars mixed case	1.00	1.22	1.54	91.01	x 178 nsec
find chars worst case	1262.20	1597.56	131.71	1.00	x 32.80 usec
find subs best case		1.00	1.33	1.17	x 212 nsec
find subs mixed case		1.00	1.30	3327.27	x 220 nsec
find subs worst case		35.82	3.62	1.00	x 1.46 msec
find many prefixes		1760.80	1.00	122.26	x 301.00 usec
find many infixes		122.17	1.00	7.71	x 4.33 msec
rfind chars best case	1.00	2.61	4.34	1.96	x 114 nsec
rfind chars mixed case	1.00	2.66	4.34	4561.40	x 114 nsec
rfind chars worst case	50.10	55.25	6.94	1.00	x 1.01 msec
rfind subs best case		1.58	2.55	1.00	x 229 nsec
rfind subs mixed case		1.00	1.59	1921.05	x 380 nsec
rfind subs worst case		38.07	4.97	1.00	x 1.45 msec
rfind many suffixes		1094.46	1.00	50.51	x 487.00 usec
rfind many infixes		54.70	1.00	2.41	x 9.69 msec

That’s why I’m suggesting a dynamic algorithm that doesn’t suffer from these problems:

algorithm	loop in	loop startswith	re^[1:1]	find str	find tuple	unit
find chars best case	2.67	3.19	4.15	3.64	1.00	x 68 nsec
find chars mixed case	2.29	2.80	3.53	208.23	1.00	x 78 nsec
find chars worst case	1489.21	1884.89	155.40	1.18	1.00	x 27.80 usec
find subs best case		3.07	4.07	3.59	1.00	x 69 nsec
find subs mixed case		2.32	3.02	7713.38	1.00	x 95 nsec
find subs worst case		35.82	3.62	1.00	1.01	x 1.46 msec
find many prefixes		1760.80	1.00	122.26	82.06	x 301.00 usec
find many infixes		122.17	1.00	7.71	5.22	x 4.33 msec
rfind chars best case	1.33	3.45	5.76	2.59	1.00	x 86 nsec
rfind chars mixed case	1.08	2.86	4.67	4905.66	1.00	x 106 nsec
rfind chars worst case	50.10	55.25	6.94	1.00	1.10^[2]	x 1.01 msec
rfind subs best case		3.70	5.98	2.34	1.00	x 98 nsec
rfind subs mixed case		3.42	5.43	6576.58	1.00	x 111 nsec
rfind subs worst case		38.07	4.97	1.00	1.04^[3]	x 1.45 msec
rfind many suffixes		1094.46	1.00	50.51	50.31	x 487.00 usec
rfind many infixes		54.70	1.00	2.41	2.39	x 9.69 msec

algorithms

# find_tuple.py
def find0(string, chars):
    for i, char in enumerate(string):
        if char in chars:
            break
    else:
        i = -1
    return i

def find1(string, subs):
    for i in range(len(string)):
        if string.startswith(subs, i):
            break
    else:
        i = -1
    return i

def find2(string, pattern):
    match = pattern.search(string)
    i = match.start() if match else -1
    return i

def find3(string, subs):
    i = -1
    for sub in subs:
        new_i = string.find(sub, 0, None if i == -1 else i + len(sub))
        if new_i != -1:
            i = new_i
    return i

def find4(string, subs):
    i = string.find(subs)
    return i

def rfind0(string, chars):
    i = len(string) - 1
    while i >= 0 and string[i] not in chars:
        i -= 1
    return i

def rfind1(string, subs):
    for i in range(len(string), -1, -1):
        if string.startswith(subs, i):
            break
    else:
        i = -1
    return i

rfind2 = find2

def rfind3(string, subs):
    i = -1
    for sub in subs:
        new_i = string.rfind(sub, 0 if i == -1 else i)
        if new_i != -1:
            i = new_i
    return i

def rfind4(string, subs):
    i = string.rfind(subs)
    return i

benchmark script

# find_tuple.sh
echo find chars best case
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'ab' + '_' * 999_998; chars   = 'ab'"               "find_tuple.find0(string, chars)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'ab' + '_' * 999_998; subs    = tuple('ab')"        "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = 'ab' + '_' * 999_998; pattern = re.compile('[ab]')" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'ab' + '_' * 999_998; subs    = 'ab'"               "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'ab' + '_' * 999_998; subs    = tuple('ab')"        "find_tuple.find4(string, subs)"
echo find chars mixed case
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'b' + '_' * 999_999; chars   = 'ab'"               "find_tuple.find0(string, chars)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'b' + '_' * 999_999; subs    = tuple('ab')"        "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = 'b' + '_' * 999_999; pattern = re.compile('[ab]')" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'b' + '_' * 999_999; subs    = 'ab'"               "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'b' + '_' * 999_999; subs    = tuple('ab')"        "find_tuple.find4(string, subs)"
echo find chars worst case
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; chars   = 'ab'"               "find_tuple.find0(string, chars)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple('ab')"        "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = '_' * 1_000_000; pattern = re.compile('[ab]')" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = 'ab'"               "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple('ab')"        "find_tuple.find4(string, subs)"
echo find subs best case
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'abcd' + '_' * 999_996; subs    = 'ab', 'cd'"          "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = 'abcd' + '_' * 999_996; pattern = re.compile('ab|cd')" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'abcd' + '_' * 999_996; subs    = 'ab', 'cd'"          "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'abcd' + '_' * 999_996; subs    = 'ab', 'cd'"          "find_tuple.find4(string, subs)"
echo find subs mixed case
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'cd' + '_' * 999_998; subs    = 'ab', 'cd'"          "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = 'cd' + '_' * 999_998; pattern = re.compile('ab|cd')" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'cd' + '_' * 999_998; subs    = 'ab', 'cd'"          "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = 'cd' + '_' * 999_998; subs    = 'ab', 'cd'"          "find_tuple.find4(string, subs)"
echo find subs worst case
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = 'ab', 'cd'"          "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = '_' * 1_000_000; pattern = re.compile('ab|cd')" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = 'ab', 'cd'"          "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = 'ab', 'cd'"          "find_tuple.find4(string, subs)"
echo find many prefixes
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple(f'prefix{i}' for i in range(100))"                "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = '_' * 1_000_000; pattern = re.compile('|'.join(f'prefix{i}' for i in range(100)))" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple(f'prefix{i}' for i in range(100))"                "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple(f'prefix{i}' for i in range(100))"                "find_tuple.find4(string, subs)"
echo find many infixes
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple(f'{i}infix{i}' for i in range(100))"                "find_tuple.find1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, re; string = '_' * 1_000_000; pattern = re.compile('|'.join(f'{i}infix{i}' for i in range(100)))" "find_tuple.find2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple(f'{i}infix{i}' for i in range(100))"                "find_tuple.find3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;     string = '_' * 1_000_000; subs    = tuple(f'{i}infix{i}' for i in range(100))"                "find_tuple.find4(string, subs)"

echo ---

echo rfind chars best case
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_998 + 'ba'; chars   = 'ab'"                      "find_tuple.rfind0(string, chars)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_998 + 'ba'; subs    = tuple('ab')"               "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 999_998 + 'ba'; pattern = regex.compile('(?r)[ab]')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_998 + 'ba'; subs    = 'ab'"                      "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_998 + 'ba'; subs    = tuple('ab')"               "find_tuple.rfind4(string, subs)"
echo rfind chars mixed case
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_999 + 'b'; chars   = 'ab'"                      "find_tuple.rfind0(string, chars)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_999 + 'b'; subs    = tuple('ab')"               "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 999_999 + 'b'; pattern = regex.compile('(?r)[ab]')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_999 + 'b'; subs    = 'ab'"                      "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_999 + 'b'; subs    = tuple('ab')"               "find_tuple.rfind4(string, subs)"
echo rfind chars worst case
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; chars   = 'ab'"                      "find_tuple.rfind0(string, chars)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple('ab')"               "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 1_000_000; pattern = regex.compile('(?r)[ab]')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = 'ab'"                      "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple('ab')"               "find_tuple.rfind4(string, subs)"
echo rfind subs best case
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_996 + 'cdab'; subs    = 'ab', 'cd'"                 "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 999_996 + 'cdab'; pattern = regex.compile('(?r)ab|cd')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_996 + 'cdab'; subs    = 'ab', 'cd'"                 "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_996 + 'cdab'; subs    = 'ab', 'cd'"                 "find_tuple.rfind4(string, subs)"
echo rfind subs mixed case
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_998 + 'cd'; subs    = 'ab', 'cd'"                 "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 999_998 + 'cd'; pattern = regex.compile('(?r)ab|cd')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_998 + 'cd'; subs    = 'ab', 'cd'"                 "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 999_998 + 'cd'; subs    = 'ab', 'cd'"                 "find_tuple.rfind4(string, subs)"
echo rfind subs worst case
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = 'ab', 'cd'"                 "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 1_000_000; pattern = regex.compile('(?r)ab|cd')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = 'ab', 'cd'"                 "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = 'ab', 'cd'"                 "find_tuple.rfind4(string, subs)"
echo rfind many suffixes
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple(f'{i}suffix' for i in range(100))"                            "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 1_000_000; pattern = regex.compile(f'(?r){'|'.join(f'{i}suffix' for i in range(100))}')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple(f'{i}suffix' for i in range(100))"                            "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple(f'{i}suffix' for i in range(100))"                            "find_tuple.rfind4(string, subs)"
echo rfind many infixes
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple(f'{i}infix{i}' for i in range(100))"                            "find_tuple.rfind1(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple, regex; string = '_' * 1_000_000; pattern = regex.compile(f'(?r){'|'.join(f'{i}infix{i}' for i in range(100))}')" "find_tuple.rfind2(string, pattern)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple(f'{i}infix{i}' for i in range(100))"                            "find_tuple.rfind3(string, subs)"
find-tuple/python.exe -m timeit -s "import find_tuple;        string = '_' * 1_000_000; subs    = tuple(f'{i}infix{i}' for i in range(100))"                            "find_tuple.rfind4(string, subs)"

Examples

>>> "0123456789".find(("a", "b", "c"))
-1
>>> "0123456789".find(("0", "1", "2"))
0
>>> "0123456789".rfind(("7", "8", "9"))
9

Use cases

While I haven’t found a lot of use cases, this new addition would improve the readability and performance for all of them:

github.com/python/cpython

Lib/ntpath.py

73906d5c9


      
          i = len(p)
          while i and p[i-1] not in seps:
              i -= 1

github.com/python/cpython

Lib/ntpath.py

f90ff0367


      
          i, n = 1, len(path)
          while i < n and path[i] not in seps:
              i += 1

github.com/python/cpython

Lib/genericpath.py

0d42ac947


      
          sepIndex = p.rfind(sep)
          if altsep:
              altsepIndex = p.rfind(altsep)
              sepIndex = max(sepIndex, altsepIndex)

>2K files with /max\(\w+\.rfind/

Python implementation

The implementation written in Python is clear and simple (in C overflow and exceptions need to be handled manually):

MIN_CHUNK_SIZE = 32
MAX_CHUNK_SIZE = 16384
EXP_CHUNK_SIZE = 2

def find_tuple(string, subs, start=0, end=None):
    end = len(string) if end is None else end
    result = -1
    chunk_size = MIN_CHUNK_SIZE
    chunk_start = start
    while True:
        chunk_end = min(chunk_start + chunk_size, end)
        if chunk_end < end:
            chunk_end -= 1
        for sub in subs:
            sub_end = min(chunk_end + len(sub), end)
            new_result = string.find(sub, chunk_start, sub_end)
            if new_result != -1:
                if new_result == chunk_start:
                    return new_result
                chunk_end = new_result - 1 # Only allow earlier match
                result = new_result
        if result != -1 or chunk_end >= end:
            return result # Found match or searched entire range
        chunk_start = chunk_end + 1
        chunk_size = min(chunk_size * EXP_CHUNK_SIZE, MAX_CHUNK_SIZE)

def rfind_tuple(string, subs, start=0, end=None):
    end = len(string) if end is None else end
    result = -1
    chunk_size = MIN_CHUNK_SIZE
    chunk_end = end
    while True:
        chunk_start = max(start, chunk_end - chunk_size)
        if chunk_start > start:
            chunk_start += 1
        for sub in subs:
            sub_end = min(chunk_end + len(sub), end)
            new_result = string.rfind(sub, chunk_start, sub_end)
            if new_result != -1:
                if new_result == chunk_end:
                    return new_result
                chunk_start = new_result + 1 # Only allow later match
                result = new_result
        if result != -1 or chunk_start <= start:
            return result # Found match or searched entire range
        chunk_end = chunk_start - 1
        chunk_size = min(chunk_size * EXP_CHUNK_SIZE, MAX_CHUNK_SIZE)

Explanation

The search is split up in chunks, overlapping by the length of a substring. After the first match, we search for the next substring in the part before (or after for reverse search). Each chunk will be twice as large as the previous one, but capped at 16384. The dynamic size ensures good best- and worst-case performance.

C call hierarchy

find_sub() and find_subs() are called based on the argument type using an inline function. find_sub() is called for tuples of length 1, chunk_find_sub() is called for tuples with more than 1 element:

Calibration

MIN_CHUNK_SIZE and MAX_CHUNK_SIZE were calibrated on this benchmark:

32 was the highest mimimum size beating out all other algorithms in the best-case, setting it any lower would hurt performance for substrings after the first chunk.
16384 was the highest maximum size with a measurable improvement in the worst case, setting it any higher would only hurt performance in the average case.

MIN_CHUNK_SIZE	4	8	16	32	64	128	256	512	1024	unit
find chars best case	1.01	1.00	1.07	1.09	1.06	1.10	1.06	1.10	1.10	x 66.7 nsec
find chars mixed case	1.00	1.01	1.10	1.14	1.10	1.17	1.09	1.23	1.24	x 75.3 nsec
find subs best case	1.00	1.01	1.02	1.05	1.00	1.04	1.01	1.01	1.01	x 70.9 nsec
find subs mixed case	1.00	1.01	1.06	1.22	1.39	2.22	3.31	5.51	10.0	x 84.1 nsec
rfind chars best case	1.02	1.01	1.02	1.00	1.01	1.00	1.00	1.03	1.02	x 92.9 nsec
rfind chars mixed case	1.00	1.01	1.06	1.12	1.29	1.68	2.34	3.69	6.12	x 98.8 nsec
rfind subs best case	1.00	1.01	1.02	1.01	1.03	1.00	1.00	1.02	1.02	x 96.9 nsec
rfind subs mixed case	1.00	1.00	1.06	1.08	1.29	1.89	2.75	4.54	8.05	x 106 nsec

MAX_CHUNK_SIZE	1024	2048	4096	8192	16384	32768	65536	unit
find chars worst case	1.63	1.27	1.11	1.09	1.01	1.00	1.05	x 27.7 usec
find subs worst case	1.03	1.01	1.00	1.00	1.00	1.03	1.00	x 1.47 msec
find many prefixes	1.08	1.04	1.02	1.00	1.00	1.47	1.47	x 24.7 msec
find many infixes	1.08	1.03	1.01	1.00	1.00	1.45	1.44	x 22.6 msec
rfind chars worst case	1.20	1.00	1.17	1.17	1.01	1.01	1.15	x 1.03 msec
rfind subs worst case	1.04	1.02	1.01	1.01	1.01	1.01	1.00	x 1.44 msec
rfind many suffixes	1.09	1.04	1.02	1.01	1.00	1.00	1.00	x 24.4 msec

Previous discussion

Using the regex module for reverse search ↩︎ ↩︎
memrchr() is not available on macOS ↩︎
expected as find tuple does more work in the worst case ↩︎

Nineteendo · September 5, 2024, 12:00pm

@storchaka, could you give your opinion, please? The overhead is largely eliminated by doubling the chunk size in every iteration, making the algorithm perform much better.

I do realise this feature isn’t very exciting, but that’s mostly because there are already slower ways to achieve the same goal. I was hoping we could use it to slightly speed up and simplify path operations that are too complicated for a full C implementation. I would have also made use of it when searching for the start or end of a line with universal newline support.

If this needs a PEP because of the complexity, I’ll assist in writing it.

dg-pb · October 8, 2024, 4:47pm

A comment.

a) Algorithm

Majority of common cases will be much faster than re.

A useful example:

text = 'a' * 50_000 + '\n' + 'b' * 50_000 + '\r\n'
%timeit re.findall(r'(\r\n|\n)', text)    # 629 μs
# VS
%timeit text.find('\r\n')    #  93 μs
%timeit text.find('\n')      # 853 ns
%timeit ['\n', '\r\n']       #  60 ns

# So using this approach in total it would be:
#     93 + 1 x 2 + 0 = 95 µs
# So approximately 6 x faster.

This way, there will be a large number of cases that can be done more efficiently and without re. From what I have seen a very large majority of cases is faster than re for number of search items up to 20.

b) Implementation.

Although the algorithm does the job fairly well, the current implementation should ideally be improved.

IMO, it should be pushed down to fastsearch.c, where algorithms of this sort belong. Also, this way, if the time comes when someone wants to improve on it (e.g. implement Aho Corasick), the infrastructure will allow to do it inplace.

Also, pushing it to fastsearch.c would result in further performance improvements as https://github.com/python/cpython/pull/120025 factors out preparation, while current approach needs to do both: preparation & search for every chunk. Also, there might be other possible optimizations by making use of other methods in fastsearch.c.

To sum up.

IMO, this is a worthwhile endeavour.

https://github.com/python/cpython/pull/120025 should be merged first and this then could be implemented in a way that adheres to the current infrastructure. It can be done now, but the issue is that there would be a non-negligible difference between how implementation would look now and after gh-119702.

Nineteendo · October 8, 2024, 5:38pm

I think we should wait with pushing it down until there’s consensus as I’ve already spend way too much time on this single feature (I might even be able to write a paper about it).

I also don’t think people’s opinion on this feature will change a lot when your PR is merged, so I’m not sure if it’s worth waiting for it.

Maybe we should do a poll, there’s honestly not a lot left to say here:

+1, I like this
+/-0, I’m neutral
-1, I don’t like this

0 voters

If you’re new here, you don’t need to read the entire thread, my summary above should suffice.

bwoodsend · October 8, 2024, 7:40pm

Are we voting for just tuple support or specifically putting the implementation in fastsearch.c?

Nineteendo · October 8, 2024, 8:10pm

We’re voting for tuple support for str.[r]find() & str.[r]index() just like str.startswith() and str.endswith() (other functions can wait). Sorry that I didn’t make that clear. The implementation is here for who’s interested: https://github.com/nineteendo/cpython/pull/2.

dg-pb · October 8, 2024, 9:39pm

So this poll is for implementing it as it is?

Then my personal vote is conditional on it being implemented in fastsearch.c along the lines of Support tuples for `[r]find()` & `[r]index()` by nineteendo · Pull Request #2 · nineteendo/cpython · GitHub

Nineteendo · October 8, 2024, 9:54pm

I need to gain traction before I can make a PR. Then I’ll focus on polishing. From @erlendaasland:

I suggest you continue your experiments on your own fork. If you manage to get an improved version up and running, try first to gain traction for the feature on Discord, before you create a new issue/PR.

(Assuming he meant Discourse and not Discord).

ZeroIntensity · October 8, 2024, 10:57pm

In that case, I’m unfortunately -1 for this.

This may seem blunt, but tuple support doesn’t really seem useful for find and index (though, I do kind of like it for split, but that’s not what we’re voting on) – if you’re searching for a substring, you generally want to know what that substring is. By passing multiple things to search for, you have to index the string to differentiate what you found. Also, index is available on lists and tuples as well, so this adding this for strings actually creates more inconsistency.

Improving performance is an interesting goal here, but one I think we should be very cautious about. In my experience, optimizations are a leading cause for regressions. Note that even if tests are all passing, these methods are especially hot, so people are very likely doing all sorts of crazy things with them that we’re not testing for. Though, as far as I can see, this has only been benchmarked on some hyper-specific cases with timeit – what’s the effect when running on the full pyperformance benchmarks?