Probably because this is one of those actually very useful, but not very exciting features.
I for one would love to see this added for find and rfind! Iāve lost count of how many times Iāve had to write some long-winded, āclunkyā code just to check for a few different possible strings, when it feels like I ought to just be able to pass a tuple, which would be much more readable.
Even if it didnāt bring any performance benefit at all, itād be fantastic to have this built in to Python.
I donāt like using re for this because I find that even more unreadable: the whole thing ends up inside a string literal, probably littered with escapes, and checking the result is more complex too.
Thatās not whyāitās because this opinion on the change, which some people find self-evident, has not been demonstrated to others in a convincing manner. There are already explanations of this in this thread.
One other option is for someone just to submit one or more PRs implementing the proposed feature(s). The PRs will either get accepted or rejected, and then you have your answer. The lack of response might just be because thereās not a lot thatās interesting to say.
I donāt personally think this is worth the effort to implement, and Iām not convinced Iād find it very useful in practice. But I also donāt think itās such a big deal that it needs a big debate, or communty consensus, or a PEP. So if you want to put in the effort, just go for it.
I, personally, would definitely find an occasional use for this. And in my opinion it would be a good addition to the language - a fancy feature that you donāt commonly find in other languages.
However, by taking first glance at it, it is a bit of work.
So if I was to go on implementing this, I would like someone with interest backing me up and helping out a bit as there would be decisions to be made.
E.g. what is the return value of str.find(sub: tuple)? Surely it has to be (index, index_of_string_in_a_tuple). Otherwise big part of performance benefits would be killed by having to find out which string was that (at least in case of short strings). Is it ok that in single substring case it returns 1 value and in tuple case 2 values?
I really donāt like the feeling of that. That sounds like two separate functions jammed under the same name.
Iām not convinced. Consider the single-character case first: I think that if someone really wants the position of the first of any of certain characters, itās not especially likely that it will matter which character was found. If that is needed, finding it is just applying that index, which is another Python call, but one that does O(1) work after having just searched a string. And on the other side of the equation, building a tuple isnāt free either - and unpacking and interpreting it from Python certainly isnāt either. And when the search substrings are longer, youāll have a harder time convincing people that it shouldnāt be a regex task.
Does the same not equally apply to str.find? If not, that sounds to me more like thereās something suboptimal in the re implementation that could be patched.
Yes, thatās 3.16x faster in my benchmarks, but only in the worst case. If 1 substring is at the start, it will still search for the other substrings till the end.
It should definitely be possible to improve the performance, but I need a very cheap tailmatch() for that, which we donāt have at the moment.
In theory. In practice it can be expensive compared to almost-linear search on relatively short string (say <200 characters long) + 1 python call.
In case where it is just 1-character - yes fair enough, but significant overhead would build up to find which substring was that only from a starting index.
E.g.:
s = 'a' * 198 + '__'
seps = ('__', ',,')
idx = s.find(seps[0])
%timeit s.find(seps[0]) # 300 ns
# So the upper bound is 600 ns
remainder = s[idx:idx+max(len(sep) for sep in seps)]
sep = [sep for sep in seps if remainder.startswith(sep)]
%timeit [sep for sep in seps if remainder.startswith(sep)] # 462 ns
So for simpler cases it can take as much time to figure out which separator that was as to find index.
If str.find & str.rfind wasnāt to return sep that matched, at least lower level utility should be able to return it. Then str.split can be effortlessly implemented.
def split_many(s, sep: tuple, maxsplit=-1, keepsep=False):
if maxsplit == -1:
maxsplit = sys.maxsize
i, j, n = 0, 0, len(s)
res = list()
while i < n:
if j >= maxsplit:
res.append(s[i:])
break
idx, ss = find_many(s, sep, i)
if idx == -1:
res.append(s[i:])
break
else:
res.append(s[i:idx])
if keepsep:
res.append(ss)
i = idx + len(ss)
j += 1
return res
According to OO theory, many of the methods on str should have been free functions by the principle that they can be implemented using the public interface. Methods are not just about saving characters.
Therefore, from a language idealism, you should be importing some library of functions and calling that there.
However, I totally agree with the aversion to using re:
In my opinion, what we need is a more human-readable regular expression library. It shouldnāt be in the standard library until itās stable and popular. I believe there might already be some decent ones. Using that library, you can then build the tools you want (split, find, etc.)
I think this would be a far more productive route than trying hammer everything you want into the standard library.
Generally, I agree. Either:
a) Forget string methods and concentrate on re
b) Do a final polish for couple of string methods and then concentrate on re. If methods mentioned in this thread are in sufficient need, this could also be beneficial given that they will inevitably be more performant than any re implementation and will be available at a hands reach without any extra effort.
While that may be true, in my experience, every language ends up with some combination of āsimpleā string methods/functions and a regex API. Itās very hard to imagine that this is a coincidence - rather, it seems to me that like it or not, regular expressions are in some sense the best āgeneral string manipulationā approach that anyone has managed to invent over the years. They arenāt perfect (far from it!) but they fill a gap that nothing else manages to.
Itās true that there are likely alternative approaches on PyPI somewhere. If you really donāt like regular expressions, by all means hunt something out and use it. But itāll probably be a long time before anything better is in the stdlib. (Grep, awk and perl havenāt found anything better yet and they have been around longer than Python).
Itās the same with the bash command syntax. From many angles, it is AWFUL. Wouldnāt it be so much cleaner to have a proper āarray of stringsā syntax like subprocess.run has? And yet⦠itās an incredibly usable syntax.Iāve never yet met a syntax that I would actually prefer, even if (from a purely design aesthetic standpoint) they are ābetterā.
Sorry I think youāve misunderstood. Iām not arguing against regular expressions. Iām simply agreeing with the above comment that Pythonās implementation of regular expressions produces strings that are hard to read. I think thatās a pretty uncontroversial viewpoint.
The first web hit for human readable regular expressions in Python gives humre. Something like that would be a lot easier to use than Pythonās re.
In an ideal world, youād have something like that instead of re. And in an even more idea world, it would be a more powerful parsing library instead.
I donāt think thatās āinevitableā. Regular expressions are compiled, and in post-JIT world, anything can be fast. Itās very possible that the compiled regular expression is just as fast as any hand-crafted solution. Also, most people donāt care so much about a few percentage points of optimization. If you really care about string parsing to that point, then you should hand-craft something even better (e.g., see what Kitty did with string processing on the GPU.)
Personally, I think it would be better to forget string methods, forget re, and focus on something nicer.
If people want an actual alternative to regular expressions, particularly as regards performance, my push would be for something sscanf-based. Itās less flexible than regex, but it guarantees that it will parse from left to right ONCE, thus ensuring that thereās no exponential backtracking or anything like that. What youād end up with would be something like this:
data = "1969-07-20: Neil Armstrong walks on the moon"
year, month, day, event = sscanf(data, "%d-%d-%d: %s")
I still wouldnāt use this for simple operations that work better as string methods, but when there are multiple fields to be pulled out of something, it often is worth it.