def find1(p):
seps = r"\/"
i, n = 0, len(p)
while i < n and p[i] not in seps:
i += 1
if i == n:
i = -1
return i
def find4(p):
seps = r"\/"
i = p.find(tuple(seps))
return i
The first approach is just ridiculous (It’s not C).
Does someone want to run my benchmark on their PC?
If you don’t want to write specialised functions for this, maybe you want to consider possibility of dividing it into segments of certain size?
I.e.:
def find(s, subs):
i, n = 0, len(s)
while i < n:
for sub in subs:
any_find_slice(s, i, min(n, i + constant_partition)
...
if found:
break
else:
i += constant_partition
If you optimize for constant_partition, it is possible that cost of extra C function calls would be justified for benefit of early stop leading to overall better balanced solution.
I’m unclear why “this API is faster” is being used to justify “it needs to be in cpython”. Plenty of fast APIs are provided by third-party libraries. Performance should be completely irrelevant.
I would argue the burden this proposal needs to shoulder is explaining why this API is of sufficiently wide use and would make code sufficiently more readable that it will benefit being a cpython battery. Nothing else is relevant.
Create a thread in Help (not in Ideas) if you are willing to pursue this issue. Explain the issue and provide a Python script to conduct the benchmark, rather than a bash script.
I ran it for strings of 1 million characters (rfind() should only be about 144x slower with memrchr()):
find best case - 1.42x faster - regex - 3.65x slower
2000000 loops, best of 5: 109 nsec per loop
1000000 loops, best of 5: 281 nsec per loop
2000000 loops, best of 5: 122 nsec per loop
5000000 loops, best of 5: 77 nsec per loop
2000000 loops, best of 5: 111 nsec per loop
1000000 loops, best of 5: 280 nsec per loop
20000 loops, best of 5: 16.2 usec per loop
20000 loops, best of 5: 16 usec per loop
find worst case - no difference - regex - 151x slower
5 loops, best of 5: 50.9 msec per loop
50 loops, best of 5: 4.81 msec per loop
10000 loops, best of 5: 32.1 usec per loop
10000 loops, best of 5: 31.8 usec per loop
rfind best case - 1.42x faster - regex - 8739x slower
2000000 loops, best of 5: 118 nsec per loop
500 loops, best of 5: 728 usec per loop
2000000 loops, best of 5: 152 nsec per loop
5000000 loops, best of 5: 83.3 nsec per loop
rfind mixed case - 5034x slower - regex 6161x slower
2000000 loops, best of 5: 118 nsec per loop
500 loops, best of 5: 727 usec per loop
500 loops, best of 5: 596 usec per loop
500 loops, best of 5: 594 usec per loop
rfind worst case - no difference - regex 4.66x slower
5 loops, best of 5: 51.7 msec per loop
50 loops, best of 5: 5.54 msec per loop
200 loops, best of 5: 1.19 msec per loop
200 loops, best of 5: 1.19 msec per loop
OK, re is very bad at this (mostly because it can’t search in reverse).
Speed is compared against the fastest algorithm (that doesn’t cheat).
I was suggesting that this could be the function you are looking for. It is approximately 1.2 times faster than your find1 function. Its time complexity is 𝑂(𝑛) in the worst case, where 𝑛 is the length of the string.
The time complexity of the algorithm proposed in the pull request is 𝑂(𝑛𝑘) in the worst case, where 𝑛 is the length of the string and 𝑘 is the length of the tuple.
I am curious about the speed difference for a longer seps tuple. At what tuple length does the proposed algorithm become slower than the find1 function?
def find1(p):
seps = {'\\', '/'} # what length?
for i, c in enumerate(p):
if c in seps:
return i
return -1
Regex gets absolutely smoked (yes that’s 100 thousand times slower).
If there was a flag to find the last match…
find best case - 1.76x faster - regex - 4.13x slower
1000000 loops, best of 5: 177 nsec per loop
1000000 loops, best of 5: 277 nsec per loop
2000000 loops, best of 5: 118 nsec per loop
5000000 loops, best of 5: 67.1 nsec per loop
2000000 loops, best of 5: 177 nsec per loop
1000000 loops, best of 5: 7 nsec per loop
20000 loops, best of 5: 16 usec per loop
5000000 loops, best of 5: 67.1 nsec per loop
5 loops, best of 5: 40.9 msec per loop
50 loops, best of 5: 4.82 msec per loop
10000 loops, best of 5: 32.6 usec per loop
10000 loops, best of 5: 29.2 usec per loop
rfind best case - 1.20x faster - regex - 100,544x slower
2000000 loops, best of 5: 110 nsec per loop
50 loops, best of 5: 9.24 msec per loop
2000000 loops, best of 5: 153 nsec per loop
5000000 loops, best of 5: 91.9 nsec per loop
2000000 loops, best of 5: 110 nsec per loop
50 loops, best of 5: 9.25 msec per loop
500 loops, best of 5: 598 usec per loop
5000000 loops, best of 5: 85.1 nsec per loop
5 loops, best of 5: 49.2 msec per loop
10 loops, best of 5: 22.3 msec per loop
200 loops, best of 5: 1.19 msec per loop
200 loops, best of 5: 1.22 msec per loop
Integration. Given str.startswith & str.endswith is able to accept tuple substring argument, this extension is not something completely new, but follows a design of couple of other string methods.
Functionality. There were 2 options for this extension. limit substrings to single characters (the way that “go” has it as suggested by @methanestrings package - strings - Go Packages) or support multi-character strings. It so happened that there is an easy and effective way to implement multi-character string support, while keeping API (i.e. tuple argument) in line with str.stasrtswith. Thus, this provides complete functionality in comparison to implementation of single-character substrings, where if multi-string support becomes needed in the future it would lead to additional extensions with all the issues that comes with it in addition to keeping backwards compatibility with existing incomplete addition (if this was single-string implementation).
Maintenance. Although there have been comments that implementation is complex, this does not seem so to me. There were details to sort out along the way as bucketing implementation was a spontaneous process without any example to follow. The final result is a well defined simple algorithm, which handles any case properly and efficiently and makes use of existing functions for single substring case. So the maintenance cost is much lower in comparison to newly written algorithm from scratch such as Aho–Corasick algorithm - Wikipedia. The implementation written in python is clear and simple:
def find_tuple(s, subs, start=None, end=None):
CHUNK_SIZE = 10_000
start = 0 if start is None else start
end = len(s) if end is None else end
result = -1
while result == -1 and start <= end:
cur_end = start + CHUNK_SIZE
# `- 1` & `cur_end + len(sub)` down below
# means minimum 1 character in current chunk
cur_end -= 1
for sub in subs:
sub_end = cur_end + len(sub)
sub_end = min(sub_end, end)
pos = s.find(sub, start, sub_end)
if pos != -1:
if pos == start:
return start
# match with `cur_end = pos` would be as good
# `- 1` here to only allow for later match
cur_end = pos - 1
result = pos
start += CHUNK_SIZE
return result
Performance. Although performance wasn’t the main target, but the optimization above ensured that this implementation is at least as fast as any other solution currently available in python for any case (it is possible with certain effort to beat it by a small percentage, but unlikely in practice). This optimization ensured that implemented functionality strongly satisfies There should be one-- and preferably only one --obvious way to do it., while the status quo is that different solutions for different cases need to be used for optimal performance. The suggested implementation performs very well in practice and more complex implementations would only provide marginal improvement at significantly greater maintenance cost.
Readability. All of alternative solutions are less readable than the suggested implementation. Cases and examples where this extension would improve readability:
Calibration.CHUNK_SIZE was selected to be 10_000 as python call with all of the overhead takes as much time as actual scan of 10_000 length string. Lower sizes lead to no significant improvements in short running cases, while negative performance impact, although small (<5% for CHUNK_SIZE=1000), but is observed for full scans of 1M length strings. While larger sizes negatively impact performance of mixed cases (where one substring is found early, but full scans are done for all remaining ones). Selected number is a well balanced starting value. Fine-tuning is straightforward and can always be done later if such need arises.
Additional Information. This solution provides: “The first/last match with the lowest/highest STARTING index”. And if there is a need it is easy to get (with extra cost): “The first/last match with the lowest/highest ENDING index” by:
s[::-1].rfind((sub1[::-1], sub2[::-1])) # first lowest ending index
s[::-1].find((sub1[::-1], sub2[::-1])) # last highest ending index
So it does provide simple solutions for 2 more cases as opposed to the case where these 2 were directly reversible and reversion as above would just lead to each other.
To sum up:
This addition doesn’t overly rely on any single point from the above, but is rather balanced in benefits from each.
def find_oneliner(p):
subs = ('\\', '/')
i = min((i for i in map(lambda x: p.find(x), subs) if i != -1), default=-1)
return i
print(find_oneliner("/\\"))
print(find_oneliner("0/\\"))
print(find_oneliner("01/\\"))
# min((i for i in map(lambda x: p.find(x), subs) if i != -1), default=-1)
oneliner1: Mean +- std dev: 759 ns +- 7 ns
# min([i for i in map(p.find, subs) if i != -1], default=-1)
oneliner2: Mean +- std dev: 560 ns +- 9 ns
# min([i for s in subs if (i := p.find(s)) != -1], default=-1)
oneliner3: Mean +- std dev: 482 ns +- 6 ns