Add tuple support to more str functions

dg-pb · May 26, 2024, 2:35pm

gh-118184: Support tuples for `find`, `index`, `rfind` & `rindex`

python:main ← nineteendo:find-tuple

opened 11:04AM - 24 May 24 UTC

From [@pfmoore on Discourse](https://discuss.python.org/t/add-tuple-support-to-m…ore-str-functions/50628/24): > One other option is for someone just to submit one or more PRs implementing the proposed feature(s). The PRs will either get accepted or rejected, and then you have your answer. The lack of response might just be because there’s not a lot that’s interesting to say. > > I don’t personally think this is worth the effort to implement, and I’m not convinced I’d find it very useful in practice. But I also don’t think it’s such a big deal that it needs a big debate, or communty consensus, or a PEP. So if you want to put in the effort, just go for it. ### Benchmark for 1,000,000 characters <details><summary>script</summary> ```python # find_tuple.py import re PATTERN = re.compile(r"[\\/]") RPATTERN = re.compile(r"^[\s\S]*(?P<sub>[\\/])") SEPS = frozenset(r'\/') def find1(p): for i, c in enumerate(p): if c in SEPS: break else: i = -1 return i def find2(p): match = PATTERN.search(p) i = match.start() if match else -1 return i def find3(p): sep = "\\" altsep = "/" i = p.find(sep) new_i = p.find(altsep) if new_i != -1 and (new_i < i or i == -1): i = new_i return i def find4(p): seps = ("\\", "/") i = p.find(seps) return i def rfind1(p): i = len(p) - 1 while i >= 0 and p[i] not in SEPS: i -= 1 return i def rfind2(p): match = RPATTERN.search(p) i = match.start('sub') if match else -1 return i def rfind3(p): sep = "\\" altsep = "/" i = max(p.rfind(sep), p.rfind(altsep)) return i def rfind4(p): seps = ("\\", "/") i = p.rfind(seps) return i ``` ```sh # find_tuple.sh echo find best case find-tuple/python.exe -m timeit -s "import find_tuple; string = r'\\/' + 'a' * 999_998" "find_tuple.find1(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = r'\\/' + 'a' * 999_998" "find_tuple.find2(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = r'\\/' + 'a' * 999_998" "find_tuple.find3(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = r'\\/' + 'a' * 999_998" "find_tuple.find4(string)" echo find mixed case find-tuple/python.exe -m timeit -s "import find_tuple; string = '/' + 'a' * 999_999" "find_tuple.find1(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = '/' + 'a' * 999_999" "find_tuple.find2(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = '/' + 'a' * 999_999" "find_tuple.find3(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = '/' + 'a' * 999_999" "find_tuple.find4(string)" echo find worst case find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 1_000_000" "find_tuple.find1(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 1_000_000" "find_tuple.find2(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 1_000_000" "find_tuple.find3(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 1_000_000" "find_tuple.find4(string)" echo rfind best case find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999_998 + '/\\\\'" "find_tuple.rfind1(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999_998 + '/\\\\'" "find_tuple.rfind2(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999_998 + '/\\\\'" "find_tuple.rfind3(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999_998 + '/\\\\'" "find_tuple.rfind4(string)" echo rfind mixed case find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999_999 + '/'" "find_tuple.rfind1(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999_999 + '/'" "find_tuple.rfind2(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999_999 + '/'" "find_tuple.rfind3(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999_999 + '/'" "find_tuple.rfind4(string)" echo rfind worst case find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 1_000_000" "find_tuple.rfind1(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 1_000_000" "find_tuple.rfind2(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 1_000_000" "find_tuple.rfind3(string)" find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 1_000_000" "find_tuple.rfind4(string)" ``` </details> <details><summary>find best case - 1.87x faster - regex - 3.94x slower</summary> ```none 2000000 loops, best of 5: 182 nsec per loop 1000000 loops, best of 5: 283 nsec per loop 2000000 loops, best of 5: 134 nsec per loop 5000000 loops, best of 5: 71.8 nsec per loop ``` </details> <details><summary>find mixed case - no difference - regex - 1.51x slower</summary> ```none 1000000 loops, best of 5: 189 nsec per loop 1000000 loops, best of 5: 285 nsec per loop 20000 loops, best of 5: 16.7 usec per loop 2000000 loops, best of 5: 190 nsec per loop ``` </details> <details><summary>find worst case - 1.07x faster - regex - 161x slower</summary> ```none 5 loops, best of 5: 40.4 msec per loop 50 loops, best of 5: 4.84 msec per loop 10000 loops, best of 5: 32.2 usec per loop 10000 loops, best of 5: 30 usec per loop ``` </details> <details><summary>rfind best case - 1.11x faster - regex - 93,366x slower</summary> ```none 2000000 loops, best of 5: 112 nsec per loop 50 loops, best of 5: 9.43 msec per loop 2000000 loops, best of 5: 143 nsec per loop 2000000 loops, best of 5: 101 nsec per loop ``` </details> <details><summary>rfind mixed case - 55.7x slower - regex - 84,505x slower</summary> ```none 2000000 loops, best of 5: 111 nsec per loop 50 loops, best of 5: 9.38 msec per loop 500 loops, best of 5: 597 usec per loop 50000 loops, best of 5: 6.18 usec per loop # memrchr() isn't available on macOS ``` </details> <details><summary>rfind worst case - 1.03x slower - regex -18.9x slower</summary> ```none 5 loops, best of 5: 48.5 msec per loop 10 loops, best of 5: 22.5 msec per loop 200 loops, best of 5: 1.19 msec per loop 200 loops, best of 5: 1.22 msec per loop ``` </details> * Issue: gh-118184 ---- 📚 Documentation preview 📚: https://cpython-previews--119501.org.readthedocs.build/

Summary of what has been done and current case:

Integration. Given str.startswith & str.endswith is able to accept tuple substring argument, this extension is not something completely new, but follows a design of couple of other string methods.
Functionality. There were 2 options for this extension. limit substrings to single characters (the way that “go” has it as suggested by @methane strings package - strings - Go Packages) or support multi-character strings. It so happened that there is an easy and effective way to implement multi-character string support, while keeping API (i.e. tuple argument) in line with str.stasrtswith. Thus, this provides complete functionality in comparison to implementation of single-character substrings, where if multi-string support becomes needed in the future it would lead to additional extensions with all the issues that comes with it in addition to keeping backwards compatibility with existing incomplete addition (if this was single-string implementation).
Maintenance. Although there have been comments that implementation is complex, this does not seem so to me. There were details to sort out along the way as bucketing implementation was a spontaneous process without any example to follow. The final result is a well defined simple algorithm, which handles any case properly and efficiently and makes use of existing functions for single substring case. So the maintenance cost is much lower in comparison to newly written algorithm from scratch such as Aho–Corasick algorithm - Wikipedia. The implementation written in python is clear and simple:

def find_tuple(s, subs, start=None, end=None):
    CHUNK_SIZE = 10_000
    start = 0 if start is None else start
    end = len(s) if end is None else end
    result = -1
    while result == -1 and start <= end:
        cur_end = start + CHUNK_SIZE
        # `- 1` & `cur_end + len(sub)` down below
        # means minimum 1 character in current chunk
        cur_end -= 1
        for sub in subs:
            sub_end = cur_end + len(sub)
            sub_end = min(sub_end, end)
            pos = s.find(sub, start, sub_end)
            if pos != -1:
                if pos == start:
                    return start
                # match with `cur_end = pos` would be as good
                # `- 1` here to only allow for later match
                cur_end = pos - 1
                result = pos
        start += CHUNK_SIZE
    return result

Performance. Although performance wasn’t the main target, but the optimization above ensured that this implementation is at least as fast as any other solution currently available in python for any case (it is possible with certain effort to beat it by a small percentage, but unlikely in practice). This optimization ensured that implemented functionality strongly satisfies There should be one-- and preferably only one --obvious way to do it., while the status quo is that different solutions for different cases need to be used for optimal performance. The suggested implementation performs very well in practice and more complex implementations would only provide marginal improvement at significantly greater maintenance cost.
Readability. All of alternative solutions are less readable than the suggested implementation. Cases and examples where this extension would improve readability:
- CPython cases
- Github: max(*.rfind(...
Calibration. CHUNK_SIZE was selected to be 10_000 as python call with all of the overhead takes as much time as actual scan of 10_000 length string. Lower sizes lead to no significant improvements in short running cases, while negative performance impact, although small (<5% for CHUNK_SIZE=1000), but is observed for full scans of 1M length strings. While larger sizes negatively impact performance of mixed cases (where one substring is found early, but full scans are done for all remaining ones). Selected number is a well balanced starting value. Fine-tuning is straightforward and can always be done later if such need arises.
Additional Information. This solution provides: “The first/last match with the lowest/highest STARTING index”. And if there is a need it is easy to get (with extra cost): “The first/last match with the lowest/highest ENDING index” by:

s[::-1].rfind((sub1[::-1], sub2[::-1]))    # first lowest ending index
s[::-1].find((sub1[::-1], sub2[::-1]))    # last highest ending index

So it does provide simple solutions for 2 more cases as opposed to the case where these 2 were directly reversible and reversion as above would just lead to each other.

To sum up:
This addition doesn’t overly rely on any single point from the above, but is rather balanced in benefits from each.