Ah, that’s WAY more complicated. I mean, I love messing around with LALR grammars and building expression evaluators with full flexibility (it’s hugely fun!), but that’s really not a replacement for regex
In the long run, why not? An ideal compile
function could detect that the grammar can be implemented as a regular expression, and it would just work like RE.
Once you’ve put all the work into building a nice RE syntax, you may as well build a parser on top of it
I think we are on topic. This may not be the solution you prefer, but I think it is a reasonable solution.
What you are writing as str.find((x, y))
, can just as easily be written as something like humre.find(str, humre.either(x, y))
. Why is are you opposed to the humre
solution?
I made a few arguments why I think it’s better:
- it’s more general (you can do anything regular expressions can do, including
- capturing patterns (which was mentioned in the ,
- searching for complicated patterns, etc.
- and it fits with the OO principle that methods should be reserved for doing things that cannot be done with the public interface.
I also agreed Serhiy’s arguments against expanding the standard library for this.
Regexes can be compiled as a deterministic finite state machines leading to linear performance. (Re2 from Google does that I believe)
Right, exactly. Which is why regular expressions are a fine solution to your problem, right?
It is “inevitable”.
As much as I appreciate a theory and long-term possibilities, it would be nice to keep pragmatic and realistic here.
Unless of course you have a clearly defined goal and plan of action how you are going to improve re
so that its performance will “almost” match implementations proposed in this thread.
Current implementation: GitHub - nineteendo/cpython at find-tuple
Benchmark (compared against fastest alternative, which is kind of unfair, because that might not be the fastest in other cases):
script
# find_tuple.py
import re
PATTERN = re.compile(r"[\\/]")
def find1(p):
seps = "\\/"
i, n = 0, len(p)
while i < n and p[i] not in seps:
i += 1
if i == n:
i = -1
return i
def find2(p):
match = PATTERN.search(p)
i = match.start() if match else -1
return i
def find3(p):
sep = "\\"
altsep = "/"
i = p.find(sep)
new_i = p.find(altsep)
if new_i != -1 and (new_i < i or i == -1):
i = new_i
return i
def find4(p):
seps = ("\\", "/")
i = p.find(seps)
return i
def rfind1(p):
seps = "\\/"
i = len(p) - 1
while i >= 0 and p[i] not in seps:
i -= 1
return i
def rfind2(p):
match = PATTERN.search(p[::-1])
i = len(p) - 1 - match.start() if match else -1
return i
def rfind3(p):
sep = "\\"
altsep = "/"
i = max(p.rfind(sep), p.rfind(altsep))
return i
def rfind4(p):
seps = ("\\", "/")
i = p.rfind(seps)
return i
# find_tuple.sh
echo find - best case
find-tuple/python.exe -m timeit -s "import find_tuple; string = r'\/' + 'a' * 998" "find_tuple.find1(string)"
find-tuple/python.exe -m timeit -s "import find_tuple; string = r'\/' + 'a' * 998" "find_tuple.find2(string)"
find-tuple/python.exe -m timeit -s "import find_tuple; string = r'\/' + 'a' * 998" "find_tuple.find3(string)"
find-tuple/python.exe -m timeit -s "import find_tuple; string = r'\/' + 'a' * 998" "find_tuple.find4(string)"
echo find - mixed case
find-tuple/python.exe -m timeit -s "import find_tuple; string = '/' + 'a' * 999" "find_tuple.find1(string)"
find-tuple/python.exe -m timeit -s "import find_tuple; string = '/' + 'a' * 999" "find_tuple.find2(string)"
find-tuple/python.exe -m timeit -s "import find_tuple; string = '/' + 'a' * 999" "find_tuple.find3(string)"
find-tuple/python.exe -m timeit -s "import find_tuple; string = '/' + 'a' * 999" "find_tuple.find4(string)"
echo find - worst case
find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999" "find_tuple.find1(string)"
find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999" "find_tuple.find2(string)"
find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999" "find_tuple.find3(string)"
find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999" "find_tuple.find4(string)"
echo rfind - best case
find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 998 + r'\/'" "find_tuple.rfind1(string)"
find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 998 + r'\/'" "find_tuple.rfind2(string)"
find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 998 + r'\/'" "find_tuple.rfind3(string)"
find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 998 + r'\/'" "find_tuple.rfind4(string)"
echo rfind - mixed case
find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999 + '/'" "find_tuple.rfind1(string)"
find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999 + '/'" "find_tuple.rfind2(string)"
find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999 + '/'" "find_tuple.rfind3(string)"
find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999 + '/'" "find_tuple.rfind4(string)"
echo rfind - worst case
find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999" "find_tuple.rfind1(string)"
find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999" "find_tuple.rfind2(string)"
find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999" "find_tuple.rfind3(string)"
find-tuple/python.exe -m timeit -s "import find_tuple; string = 'a' * 999" "find_tuple.rfind4(string)"
find - best case - 1.40x faster
2000000 loops, best of 5: 108 nsec per loop
1000000 loops, best of 5: 280 nsec per loop
2000000 loops, best of 5: 122 nsec per loop
5000000 loops, best of 5: 77 nsec per loop
find - mixed case - 1.27x faster
2000000 loops, best of 5: 109 nsec per loop
1000000 loops, best of 5: 279 nsec per loop
2000000 loops, best of 5: 135 nsec per loop
5000000 loops, best of 5: 85.8 nsec per loop
find - worst case - 1.48x faster
5000 loops, best of 5: 48.5 usec per loop
50000 loops, best of 5: 5 usec per loop
2000000 loops, best of 5: 143 nsec per loop
5000000 loops, best of 5: 96.7 nsec per loop
rfind - best case - 1.42x faster
2000000 loops, best of 5: 117 nsec per loop
200000 loops, best of 5: 1.25 usec per loop
2000000 loops, best of 5: 151 nsec per loop
5000000 loops, best of 5: 82.2 nsec per loop
rfind - mixed case - 6.01x slower
2000000 loops, best of 5: 116 nsec per loop
200000 loops, best of 5: 1.25 usec per loop
500000 loops, best of 5: 743 nsec per loop
500000 loops, best of 5: 697 nsec per loop
rfind - worst case - 1.03x faster
5000 loops, best of 5: 47.3 usec per loop
50000 loops, best of 5: 5.84 usec per loop
200000 loops, best of 5: 1.34 usec per loop
200000 loops, best of 5: 1.3 usec per loop
rfind()
seems to be consistently slower than find()
, which is outside of my control (I haven’t written the helper functions).
That might be 1 step ahead, but one feature of re
could be useful - REVERSE
flag.
import re
s = 'a' * 50000 + '_' + 'a' * 50000 + '_' + 'a' * 50000
%timeit s.find('_') # 807 ns
%timeit s.rfind('_') # 20.7 µs
%timeit next(re.finditer(r'_', s)).span() # 14.7 µs
%timeit next(re.finditer(r'(?s:.*)_', s)).span() # 36.4 µs
str.rfind
is 20x slower than str.find
(is this the same for everyone? I am surprised a bit)
And there is no straight forward way to do a reverse search in re
. Is there?
Has the flag, but not sure how it functions.
Are there common use cases for find/rfind/split by multiple separater string?
Maybe, line ending is the most common case. But Python provides str.splitlines() already.
If we can focus on multile character separaters, we can ignore Aho-Corasick vs brute force.
For example, Go provides simple APIs by support only chars:
Well, yeah Windows paths for example, see here:
And we can still ignore Aho-Corasick when tuple(seps)
is passed.
- Because it’s an extra package I have to install and learn
- everyone reading the package also has to learn it
- It is currently implemented on top of
re
, meaning from a performance perspective it is going to be worse, always (and based on the chosen syntax, I am not sure if it can ever compete) - Personally, I much prefer something far less verbose than what humre offers.
This is not an OPP principle I have ever heard about, and IMO it’s a bad one that leads to incomplete APIs missing essential features for day-to-day use, and scatters the needed functions across multiple places.
It also goes against past design decisions for python, see the history of the string
module.
I think you’ll find it in many OO books. Herb Sutter discusses it here specifically discussing a string class.
The suggestion isn’t that essential features be in multiple places. The suggestion is that they should implemented in free functions rather than methods. E.g., string. find
rather than str.find
. And once you’re already importing string
, you could instead import re
or something like it.
That’s fair, but then you can use re
if you don’t want to learn something new.
Post-compile, it is identical. I also really don’t understand this interest in optimization. Are you running on ten megabyte strings? If not, who cares about speed? Nearly all of the time is going to be in setting up the call no matter how fast you make the call.
One very common case is Case-Change problem. snake_case to Title Case etc.
a) split sentence on multiple separators
b) separators must be kept in place to be able to join back
Note, split already offers this, but only for 1 specific case - whitespaces. It checks whether a character is one of many internally. Also, it lacks keepsep
flag.
Currently this is usually being solved with re
but solution is fairly involved. There is a whole library for it: case-converter · PyPI
Yes, exactly! That is where humre is going to fail for such simple cases compared to re or builtins. ( Unless I have a predefined pattern somewhere else, scattering the relevant code throughout the module)
I am unable to reproduce that.
import timeit
setup = "s = 'a' * 50000 + '_' + 'a' * 50000 + '_' + 'a' * 50000"
number = 1000000
# Benchmark for find
find_time = timeit.timeit(
"s.find('_')",
setup=setup,
number=number
)
# Benchmark for rfind
rfind_time = timeit.timeit(
"s.rfind('_')",
setup=setup,
number=number
)
print(f"Time taken for find: {find_time} seconds")
print(f"Time taken for rfind: {rfind_time} seconds")
Output:
Time taken for find: 0.711215250999885 seconds
Time taken for rfind: 0.7262046799987729 seconds
I mean the setup in the surrounding code, which has to happen no matter what. Whatever setup that humre does still going to be tiny by comparison.
And anyway, the humre compilation typically only happens once for a piece of code unless you’re searching for different things each time, which isn’t something that’s been illustrated in any examples.
I think in practice, these kind of micro-optimizations don’t matter. Python isn’t the right language if they do matter to you.
Much thanks! Im running OSX, could be a bug. Tested on 3.11 & 3.12 - same result.
Why? Could you elaborate?
Very strange:
script
# find_tuple.sh
echo best case
find-tuple/python.exe -m timeit -s "string = '/' + 'a' * 999" "string.find('/')"
find-tuple/python.exe -m timeit -s "string = 'a' * 998 + '/'" "string.rfind('/')"
echo worst case
find-tuple/python.exe -m timeit -s "string = 'a' * 1000" "string.find('/')"
find-tuple/python.exe -m timeit -s "string = 'a' * 1000" "string.rfind('/')"
best case
10000000 loops, best of 5: 32.3 nsec per loop
5000000 loops, best of 5: 40.1 nsec per loop
worst case
5000000 loops, best of 5: 44.8 nsec per loop
500000 loops, best of 5: 639 nsec per loop
If you’re worried about squeezing out every bit of speed, use a language like Rust or C. In Python, I don’t think that optimization is an important consideration compared with the other considerations mentioned in this thread: good interface design and keeping the stdlib simple.
We (not me) were recently investigating if we could avoid re
imports because of the import time. Also, because of the setup cost alternatives beat it out for short strings, making it not objectively better. str.find(tuple)
doesn’t suffer from those problems.
re
is already 15x slower than str.find
. How much slower is it going to be with humbre
overhead?
I think it is very wrong to argue: “It is either: speed or convenience”.
A good product is both fast and convenient.
Following your logic, 3 abstractions down nothing will be possible to calculate in time anymore.
And it is ok, because it is the wrong language.
So who is going to build a right one?
It is always easy to argue with such cliches.
It is much harder to come up with solutions that are good in all aspects.
Finally, yes, python is slower than compiled language.
Does it mean that perfromance does not matter anymore?
Or does it mean that extra effort is needed for it to be competitive?