Add tuple support to more str functions

Summary of what has been done and current case:

  1. Integration. Given str.startswith & str.endswith is able to accept tuple substring argument, this extension is not something completely new, but follows a design of couple of other string methods.
  2. Functionality. There were 2 options for this extension. limit substrings to single characters (the way that “go” has it as suggested by @methane strings package - strings - Go Packages) or support multi-character strings. It so happened that there is an easy and effective way to implement multi-character string support, while keeping API (i.e. tuple argument) in line with str.stasrtswith. Thus, this provides complete functionality in comparison to implementation of single-character substrings, where if multi-string support becomes needed in the future it would lead to additional extensions with all the issues that comes with it in addition to keeping backwards compatibility with existing incomplete addition (if this was single-string implementation).
  3. Maintenance. Although there have been comments that implementation is complex, this does not seem so to me. There were details to sort out along the way as bucketing implementation was a spontaneous process without any example to follow. The final result is a well defined simple algorithm, which handles any case properly and efficiently and makes use of existing functions for single substring case. So the maintenance cost is much lower in comparison to newly written algorithm from scratch such as Aho–Corasick algorithm - Wikipedia. The implementation written in python is clear and simple:
def find_tuple(s, subs, start=None, end=None):
    CHUNK_SIZE = 10_000
    start = 0 if start is None else start
    end = len(s) if end is None else end
    result = -1
    while result == -1 and start <= end:
        cur_end = start + CHUNK_SIZE
        # `- 1` & `cur_end + len(sub)` down below
        # means minimum 1 character in current chunk
        cur_end -= 1
        for sub in subs:
            sub_end = cur_end + len(sub)
            sub_end = min(sub_end, end)
            pos = s.find(sub, start, sub_end)
            if pos != -1:
                if pos == start:
                    return start
                # match with `cur_end = pos` would be as good
                # `- 1` here to only allow for later match
                cur_end = pos - 1
                result = pos
        start += CHUNK_SIZE
    return result
  1. Performance. Although performance wasn’t the main target, but the optimization above ensured that this implementation is at least as fast as any other solution currently available in python for any case (it is possible with certain effort to beat it by a small percentage, but unlikely in practice). This optimization ensured that implemented functionality strongly satisfies There should be one-- and preferably only one --obvious way to do it., while the status quo is that different solutions for different cases need to be used for optimal performance. The suggested implementation performs very well in practice and more complex implementations would only provide marginal improvement at significantly greater maintenance cost.
  2. Readability. All of alternative solutions are less readable than the suggested implementation. Cases and examples where this extension would improve readability:
  3. Calibration. CHUNK_SIZE was selected to be 10_000 as python call with all of the overhead takes as much time as actual scan of 10_000 length string. Lower sizes lead to no significant improvements in short running cases, while negative performance impact, although small (<5% for CHUNK_SIZE=1000), but is observed for full scans of 1M length strings. While larger sizes negatively impact performance of mixed cases (where one substring is found early, but full scans are done for all remaining ones). Selected number is a well balanced starting value. Fine-tuning is straightforward and can always be done later if such need arises.
  4. Additional Information. This solution provides: “The first/last match with the lowest/highest STARTING index”. And if there is a need it is easy to get (with extra cost): “The first/last match with the lowest/highest ENDING index” by:
s[::-1].rfind((sub1[::-1], sub2[::-1]))    # first lowest ending index
s[::-1].find((sub1[::-1], sub2[::-1]))    # last highest ending index

So it does provide simple solutions for 2 more cases as opposed to the case where these 2 were directly reversible and reversion as above would just lead to each other.

To sum up:
This addition doesn’t overly rely on any single point from the above, but is rather balanced in benefits from each.

1 Like