Add tuple support to more str functions

Nineteendo · May 24, 2024, 1:04pm

It’s not only speed, readability also counts: Justifying Python language changes | Curious Efficiency

def find1(p):
    seps = r"\/"
    i, n = 0, len(p)
    while i < n and p[i] not in seps:
        i += 1
    if i == n:
        i = -1
    return i

def find4(p):
    seps = r"\/"
    i = p.find(tuple(seps))
    return i

The first approach is just ridiculous (It’s not C).

Does someone want to run my benchmark on their PC?

dg-pb · May 24, 2024, 1:12pm

If you don’t want to write specialised functions for this, maybe you want to consider possibility of dividing it into segments of certain size?

I.e.:

def find(s, subs):
    i, n = 0, len(s)
    while i < n:
        for sub in subs:
            any_find_slice(s, i, min(n, i + constant_partition)
            ...
        if found:
            break
        else:
            i += constant_partition

If you optimize for constant_partition, it is possible that cost of extra C function calls would be justified for benefit of early stop leading to overall better balanced solution.

alicederyn · May 24, 2024, 1:15pm

I’m unclear why “this API is faster” is being used to justify “it needs to be in cpython”. Plenty of fast APIs are provided by third-party libraries. Performance should be completely irrelevant.

I would argue the burden this proposal needs to shoulder is explaining why this API is of sufficiently wide use and would make code sufficiently more readable that it will benefit being a cpython battery. Nothing else is relevant.

dg-pb · May 24, 2024, 1:15pm

I think this should be raised as a separate issue and not mixed up with this work.

Nineteendo · May 24, 2024, 1:16pm

Yeah, that’s why I’m asking someone else to run the benchmark. (I uninstalled my developer tools on Windows).

dg-pb · May 24, 2024, 1:17pm

It seems it is a Mac issue, so someone with linux or win is best.

elis.byberi · May 24, 2024, 2:50pm

Create a thread in Help (not in Ideas) if you are willing to pursue this issue. Explain the issue and provide a Python script to conduct the benchmark, rather than a bash script.

Nineteendo · May 24, 2024, 3:56pm

I ran it for strings of 1 million characters (rfind() should only be about 144x slower with memrchr()):

find best case - 1.42x faster - regex - 3.65x slower

2000000 loops, best of 5: 109 nsec per loop
1000000 loops, best of 5: 281 nsec per loop
2000000 loops, best of 5: 122 nsec per loop
5000000 loops, best of 5: 77 nsec per loop

find mixed case - 144x slower - regex - 2.53x slower

2000000 loops, best of 5: 111 nsec per loop
1000000 loops, best of 5: 280 nsec per loop
20000 loops, best of 5: 16.2 usec per loop
20000 loops, best of 5: 16 usec per loop

find worst case - no difference - regex - 151x slower

5 loops, best of 5: 50.9 msec per loop
50 loops, best of 5: 4.81 msec per loop
10000 loops, best of 5: 32.1 usec per loop
10000 loops, best of 5: 31.8 usec per loop

rfind best case - 1.42x faster - regex - 8739x slower

2000000 loops, best of 5: 118 nsec per loop
500 loops, best of 5: 728 usec per loop
2000000 loops, best of 5: 152 nsec per loop
5000000 loops, best of 5: 83.3 nsec per loop

rfind mixed case - 5034x slower - regex 6161x slower

2000000 loops, best of 5: 118 nsec per loop
500 loops, best of 5: 727 usec per loop
500 loops, best of 5: 596 usec per loop
500 loops, best of 5: 594 usec per loop

rfind worst case - no difference - regex 4.66x slower

5 loops, best of 5: 51.7 msec per loop
50 loops, best of 5: 5.54 msec per loop
200 loops, best of 5: 1.19 msec per loop
200 loops, best of 5: 1.19 msec per loop

OK, re is very bad at this (mostly because it can’t search in reverse).
Speed is compared against the fastest algorithm (that doesn’t cheat).

elis.byberi · May 24, 2024, 4:40pm

The find1 function is pretty inefficient. You could give this other one a shot:

def find1(p):
    seps = {'\\', '/'}
    for i, c in enumerate(p):
        if c in seps:
            return i
    return -1

Nineteendo · May 24, 2024, 6:48pm

Yeah that works, thanks.

elis.byberi · May 24, 2024, 7:35pm

I was suggesting that this could be the function you are looking for. It is approximately 1.2 times faster than your find1 function. Its time complexity is 𝑂(𝑛) in the worst case, where 𝑛 is the length of the string.

The time complexity of the algorithm proposed in the pull request is 𝑂(𝑛𝑘) in the worst case, where 𝑛 is the length of the string and 𝑘 is the length of the tuple.

I am curious about the speed difference for a longer seps tuple. At what tuple length does the proposed algorithm become slower than the find1 function?

def find1(p):
    seps = {'\\', '/'}  # what length?
    for i, c in enumerate(p):
        if c in seps:
            return i
    return -1

Nineteendo · May 25, 2024, 9:05am

Regex gets absolutely smoked (yes that’s 100 thousand times slower).
If there was a flag to find the last match…

find best case - 1.76x faster - regex - 4.13x slower

1000000 loops, best of 5: 177 nsec per loop
1000000 loops, best of 5: 277 nsec per loop
2000000 loops, best of 5: 118 nsec per loop
5000000 loops, best of 5: 67.1 nsec per loop

find mixed case - 2.64x faster - regex - 4.08x slower

2000000 loops, best of 5: 177 nsec per loop
1000000 loops, best of 5: 7 nsec per loop
20000 loops, best of 5: 16 usec per loop
5000000 loops, best of 5: 67.1 nsec per loop

find worst case - 1.12x faster - regex - 165x slower

5 loops, best of 5: 40.9 msec per loop
50 loops, best of 5: 4.82 msec per loop
10000 loops, best of 5: 32.6 usec per loop
10000 loops, best of 5: 29.2 usec per loop

rfind best case - 1.20x faster - regex - 100,544x slower

2000000 loops, best of 5: 110 nsec per loop
50 loops, best of 5: 9.24 msec per loop
2000000 loops, best of 5: 153 nsec per loop
5000000 loops, best of 5: 91.9 nsec per loop

rfind mixed case - 1.29x faster - regex - 108,696x slower

2000000 loops, best of 5: 110 nsec per loop
50 loops, best of 5: 9.25 msec per loop
500 loops, best of 5: 598 usec per loop
5000000 loops, best of 5: 85.1 nsec per loop

rfind worst case - 1.03x slower - regex - 18.7x slower

5 loops, best of 5: 49.2 msec per loop
10 loops, best of 5: 22.3 msec per loop
200 loops, best of 5: 1.19 msec per loop
200 loops, best of 5: 1.22 msec per loop

Compared once again to the fasted alternative.

dg-pb · May 26, 2024, 2:35pm

github.com/python/cpython

gh-118184: Support tuples for `find`, `index`, `rfind` & `rindex`

python:main ← nineteendo:find-tuple

opened 11:04AM - 24 May 24 UTC

nineteendo

+384 -107

From [@pfmoore on Discourse](https://discuss.python.org/t/add-tuple-support-to-m…ore-str-functions/50628/24): > One other option is for someone just to submit one or more PRs implementing the proposed feature(s). The PRs will either get accepted or rejected, and then you have your answer. The lack of response might just be because there’s not a lot that’s interesting to say. > > I don’t personally think this is worth the effort to implement, and I’m not convinced I’d find it very useful in practice. But I also don’t think it’s such a big deal that it needs a big debate, or communty consensus, or a PEP. So if you want to put in the effort, just go for it. ### Benchmark for 1,000,000 characters <details><summary>script</summary> ```python # find_tuple.py import re PATTERN = re.compile(r"[\\/]") RPATTERN = re.compile(r"^[\s\S]*(?P<sub>[\\/])") SEPS = frozenset(r'\/') def find1(p): for i, c in enumerate(p): if c in SEPS: break else: i = -1 return i def find2(p): match = PATTERN.search(p) i = match.start() if match else -1 return i def find3(p): sep = "\\" altsep = "/" i = p.find(sep) new_i = p.find(altsep) if new_i != -1 and (new_i < i or i == -1): i = new_i return i def find4(p): seps = ("\\", "/") i = p.find(seps) return i def rfind1(p): i = len(p) - 1 while i >= 0 and p[i] not in SEPS: i -= 1 return i def rfind2(p): match = RPATTERN.search(p) i = match.start('sub') if match else -1 return i def rfind3(p): sep = "\\" altsep = "/" i = max(p.rfind(sep), p.rfind(altsep)) return i def rfind4(p): seps = ("\\", "/") i = p.rfind(seps) return i ``` ```sh # find_tuple.sh echo find best case find-tuple/python.exe -m timeit -s "import find_tuple; string = r'\\/' + 'a' * 999_998" "find_tuple.find1(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = r'\\/' + 'a' * 999_998" "find_tuple.find2(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = r'\\/' + 'a' * 999_998" "find_tuple.find3(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = r'\\/' + 'a' * 999_998" "find_tuple.find4(string)" echo find mixed case find-tuple/python.exe -m timeit -s "import find_tuple; string = '/' + 'a' * 999_999" "find_tuple.find1(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = '/' + 'a' * 999_999" "find_tuple.find2(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = '/' + 'a' * 999_999" "find_tuple.find3(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = '/' + 'a' * 999_999" "find_tuple.find4(string)" echo find worst case find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 1_000_000" "find_tuple.find1(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 1_000_000" "find_tuple.find2(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 1_000_000" "find_tuple.find3(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 1_000_000" "find_tuple.find4(string)" echo rfind best case find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999_998 + '/\\\\'" "find_tuple.rfind1(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999_998 + '/\\\\'" "find_tuple.rfind2(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999_998 + '/\\\\'" "find_tuple.rfind3(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999_998 + '/\\\\'" "find_tuple.rfind4(string)" echo rfind mixed case find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999_999 + '/'" "find_tuple.rfind1(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999_999 + '/'" "find_tuple.rfind2(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999_999 + '/'" "find_tuple.rfind3(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999_999 + '/'" "find_tuple.rfind4(string)" echo rfind worst case find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 1_000_000" "find_tuple.rfind1(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 1_000_000" "find_tuple.rfind2(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 1_000_000" "find_tuple.rfind3(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 1_000_000" "find_tuple.rfind4(string)" ``` </details> <details><summary>find best case - 1.87x faster - regex - 3.94x slower</summary> ```none 2000000 loops, best of 5: 182 nsec per loop 1000000 loops, best of 5: 283 nsec per loop 2000000 loops, best of 5: 134 nsec per loop 5000000 loops, best of 5: 71.8 nsec per loop ``` </details> <details><summary>find mixed case - no difference - regex - 1.51x slower</summary> ```none 1000000 loops, best of 5: 189 nsec per loop 1000000 loops, best of 5: 285 nsec per loop 20000 loops, best of 5: 16.7 usec per loop 2000000 loops, best of 5: 190 nsec per loop ``` </details> <details><summary>find worst case - 1.07x faster - regex - 161x slower</summary> ```none 5 loops, best of 5: 40.4 msec per loop 50 loops, best of 5: 4.84 msec per loop 10000 loops, best of 5: 32.2 usec per loop 10000 loops, best of 5: 30 usec per loop ``` </details> <details><summary>rfind best case - 1.11x faster - regex - 93,366x slower</summary> ```none 2000000 loops, best of 5: 112 nsec per loop 50 loops, best of 5: 9.43 msec per loop 2000000 loops, best of 5: 143 nsec per loop 2000000 loops, best of 5: 101 nsec per loop ``` </details> <details><summary>rfind mixed case - 55.7x slower - regex - 84,505x slower</summary> ```none 2000000 loops, best of 5: 111 nsec per loop 50 loops, best of 5: 9.38 msec per loop 500 loops, best of 5: 597 usec per loop 50000 loops, best of 5: 6.18 usec per loop # memrchr() isn't available on macOS ``` </details> <details><summary>rfind worst case - 1.03x slower - regex -18.9x slower</summary> ```none 5 loops, best of 5: 48.5 msec per loop 10 loops, best of 5: 22.5 msec per loop 200 loops, best of 5: 1.19 msec per loop 200 loops, best of 5: 1.22 msec per loop ``` </details> * Issue: gh-118184 ---- 📚 Documentation preview 📚: https://cpython-previews--119501.org.readthedocs.build/

Summary of what has been done and current case:

Integration. Given str.startswith & str.endswith is able to accept tuple substring argument, this extension is not something completely new, but follows a design of couple of other string methods.
Functionality. There were 2 options for this extension. limit substrings to single characters (the way that “go” has it as suggested by @methane strings package - strings - Go Packages) or support multi-character strings. It so happened that there is an easy and effective way to implement multi-character string support, while keeping API (i.e. tuple argument) in line with str.stasrtswith. Thus, this provides complete functionality in comparison to implementation of single-character substrings, where if multi-string support becomes needed in the future it would lead to additional extensions with all the issues that comes with it in addition to keeping backwards compatibility with existing incomplete addition (if this was single-string implementation).
Maintenance. Although there have been comments that implementation is complex, this does not seem so to me. There were details to sort out along the way as bucketing implementation was a spontaneous process without any example to follow. The final result is a well defined simple algorithm, which handles any case properly and efficiently and makes use of existing functions for single substring case. So the maintenance cost is much lower in comparison to newly written algorithm from scratch such as Aho–Corasick algorithm - Wikipedia. The implementation written in python is clear and simple:

def find_tuple(s, subs, start=None, end=None):
    CHUNK_SIZE = 10_000
    start = 0 if start is None else start
    end = len(s) if end is None else end
    result = -1
    while result == -1 and start <= end:
        cur_end = start + CHUNK_SIZE
        # `- 1` & `cur_end + len(sub)` down below
        # means minimum 1 character in current chunk
        cur_end -= 1
        for sub in subs:
            sub_end = cur_end + len(sub)
            sub_end = min(sub_end, end)
            pos = s.find(sub, start, sub_end)
            if pos != -1:
                if pos == start:
                    return start
                # match with `cur_end = pos` would be as good
                # `- 1` here to only allow for later match
                cur_end = pos - 1
                result = pos
        start += CHUNK_SIZE
    return result

Performance. Although performance wasn’t the main target, but the optimization above ensured that this implementation is at least as fast as any other solution currently available in python for any case (it is possible with certain effort to beat it by a small percentage, but unlikely in practice). This optimization ensured that implemented functionality strongly satisfies There should be one-- and preferably only one --obvious way to do it., while the status quo is that different solutions for different cases need to be used for optimal performance. The suggested implementation performs very well in practice and more complex implementations would only provide marginal improvement at significantly greater maintenance cost.
Readability. All of alternative solutions are less readable than the suggested implementation. Cases and examples where this extension would improve readability:
- CPython cases
- Github: max(*.rfind(...
Calibration. CHUNK_SIZE was selected to be 10_000 as python call with all of the overhead takes as much time as actual scan of 10_000 length string. Lower sizes lead to no significant improvements in short running cases, while negative performance impact, although small (<5% for CHUNK_SIZE=1000), but is observed for full scans of 1M length strings. While larger sizes negatively impact performance of mixed cases (where one substring is found early, but full scans are done for all remaining ones). Selected number is a well balanced starting value. Fine-tuning is straightforward and can always be done later if such need arises.
Additional Information. This solution provides: “The first/last match with the lowest/highest STARTING index”. And if there is a need it is easy to get (with extra cost): “The first/last match with the lowest/highest ENDING index” by:

s[::-1].rfind((sub1[::-1], sub2[::-1]))    # first lowest ending index
s[::-1].find((sub1[::-1], sub2[::-1]))    # last highest ending index

So it does provide simple solutions for 2 more cases as opposed to the case where these 2 were directly reversible and reversion as above would just lead to each other.

To sum up:
This addition doesn’t overly rely on any single point from the above, but is rather balanced in benefits from each.

Nineteendo · May 26, 2024, 2:58pm

Is there someone who would like to run the benchmark on Linux? The performance of rfind() is vastly different on macOS.

elis.byberi · May 26, 2024, 3:35pm

Could you add this function to the benchmark?

def find_oneliner(p):
    subs = ('\\', '/')
    return next((i for i in map(lambda x: p.find(x), subs) if i != -1), -1)

Nineteendo · May 26, 2024, 3:39pm

Doesn’t work, returns wrong result for '/\\':

>>> find_oneliner("/\\")
1

This is good enough:

def find3(p):
    sep = "\\"
    altsep = "/"
    i = p.find(sep)
    new_i = p.find(altsep, 0, None if i == -1 else i)
    if new_i != -1:
        i = new_i
    return i

Also you shouldn’t directly return the value, they’re only in functions to benchmark them.

elis.byberi · May 26, 2024, 4:07pm

It should be this other one. Feel free to adjust.

def find_oneliner(p):
    subs = ('\\', '/')
    i = min((i for i in map(lambda x: p.find(x), subs) if i != -1), default=-1)
    
    return i


print(find_oneliner("/\\"))
print(find_oneliner("0/\\"))
print(find_oneliner("01/\\"))

Output:

0
1
2

elis.byberi · May 26, 2024, 4:29pm

You also need real-world usage datasets to benchmark these functions. It’s not sufficient to benchmark only a two-item tuple.

Here is a use case:

methane · May 26, 2024, 4:44pm

As a oneliner, min([i for s in subs if (i := p.find(s)) != -1], default=-1) is faster.

source

# min((i for i in map(lambda x: p.find(x), subs) if i != -1), default=-1)
oneliner1: Mean +- std dev: 759 ns +- 7 ns
# min([i for i in map(p.find, subs) if i != -1], default=-1)
oneliner2: Mean +- std dev: 560 ns +- 9 ns
# min([i for s in subs if (i := p.find(s)) != -1], default=-1)
oneliner3: Mean +- std dev: 482 ns +- 6 ns

Nineteendo · May 26, 2024, 4:47pm

And against find3()? It seems slower as it searches beyond the first match.
The goal wasn’t to compare it against different strategies, not oneliners.