Add tuple support to more str functions

dg-pb · May 23, 2024, 12:01pm

Same conclusion as yours. There is a significant amount of work to make optimal implementation that does not do any unnecessary work.

However, once find and rfind exists with tuple arguments, implementing string.split(sep: tuple, keepsep=True) is not hard

I think making re more efficient instead is a better spent time after all.

rrolls · May 23, 2024, 4:41pm

radio silence

Probably because this is one of those actually very useful, but not very exciting features.

I for one would love to see this added for find and rfind! I’ve lost count of how many times I’ve had to write some long-winded, “clunky” code just to check for a few different possible strings, when it feels like I ought to just be able to pass a tuple, which would be much more readable.

Even if it didn’t bring any performance benefit at all, it’d be fantastic to have this built in to Python.

I don’t like using re for this because I find that even more unreadable: the whole thing ends up inside a string literal, probably littered with escapes, and checking the result is more complex too.

jamestwebber · May 23, 2024, 4:49pm

That’s not why–it’s because this opinion on the change, which some people find self-evident, has not been demonstrated to others in a convincing manner. There are already explanations of this in this thread.

pf_moore · May 23, 2024, 4:59pm

One other option is for someone just to submit one or more PRs implementing the proposed feature(s). The PRs will either get accepted or rejected, and then you have your answer. The lack of response might just be because there’s not a lot that’s interesting to say.

I don’t personally think this is worth the effort to implement, and I’m not convinced I’d find it very useful in practice. But I also don’t think it’s such a big deal that it needs a big debate, or communty consensus, or a PEP. So if you want to put in the effort, just go for it.

dg-pb · May 23, 2024, 6:44pm

I, personally, would definitely find an occasional use for this. And in my opinion it would be a good addition to the language - a fancy feature that you don’t commonly find in other languages.

However, by taking first glance at it, it is a bit of work.

So if I was to go on implementing this, I would like someone with interest backing me up and helping out a bit as there would be decisions to be made.

E.g. what is the return value of str.find(sub: tuple)? Surely it has to be (index, index_of_string_in_a_tuple). Otherwise big part of performance benefits would be killed by having to find out which string was that (at least in case of short strings). Is it ok that in single substring case it returns 1 value and in tuple case 2 values?

kknechtel · May 23, 2024, 7:14pm

I really don’t like the feeling of that. That sounds like two separate functions jammed under the same name.

I’m not convinced. Consider the single-character case first: I think that if someone really wants the position of the first of any of certain characters, it’s not especially likely that it will matter which character was found. If that is needed, finding it is just applying that index, which is another Python call, but one that does O(1) work after having just searched a string. And on the other side of the equation, building a tuple isn’t free either - and unpacking and interpreting it from Python certainly isn’t either. And when the search substrings are longer, you’ll have a harder time convincing people that it shouldn’t be a regex task.

Does the same not equally apply to str.find? If not, that sounds to me more like there’s something suboptimal in the re implementation that could be patched.

Nineteendo · May 23, 2024, 7:18pm

Yes, that’s 3.16x faster in my benchmarks, but only in the worst case. If 1 substring is at the start, it will still search for the other substrings till the end.

It should definitely be possible to improve the performance, but I need a very cheap tailmatch() for that, which we don’t have at the moment.

dg-pb · May 23, 2024, 8:05pm

In theory. In practice it can be expensive compared to almost-linear search on relatively short string (say <200 characters long) + 1 python call.

In case where it is just 1-character - yes fair enough, but significant overhead would build up to find which substring was that only from a starting index.

E.g.:

s = 'a' * 198 + '__'
seps = ('__', ',,')
idx = s.find(seps[0])
%timeit s.find(seps[0])    # 300 ns
# So the upper bound is 600 ns
remainder = s[idx:idx+max(len(sep) for sep in seps)]
sep = [sep for sep in seps if remainder.startswith(sep)]
%timeit [sep for sep in seps if remainder.startswith(sep)]    # 462 ns

So for simpler cases it can take as much time to figure out which separator that was as to find index.

Nineteendo · May 23, 2024, 8:20pm

Hmm, let’s keep it consistent with startswith(), even though it would’ve been useful.

elis.byberi · May 23, 2024, 8:23pm

You should consider the Aho–Corasick algorithm and similar algorithms. This is a well-known problem.

elis.byberi · May 23, 2024, 9:14pm

These algorithms are very well written, based on their use case.

Sometimes a brute-force algorithm can be faster than a specialized one for small inputs.

I am -1 to adding tuples for other string methods. str.startswith() and str.endswith() are straightforward.

dg-pb · May 24, 2024, 8:28am

I would be +1 for str.split upgrade as well.

If str.find & str.rfind wasn’t to return sep that matched, at least lower level utility should be able to return it. Then str.split can be effortlessly implemented.

def split_many(s, sep: tuple, maxsplit=-1, keepsep=False):
    if maxsplit == -1:
        maxsplit = sys.maxsize
    i, j, n = 0, 0, len(s)
    res = list()
    while i < n:
        if j >= maxsplit:
            res.append(s[i:])
            break
        idx, ss = find_many(s, sep, i)
        if idx == -1:
            res.append(s[i:])
            break
        else:
            res.append(s[i:idx])
            if keepsep:
                res.append(ss)
            i = idx + len(ss)
            j += 1
    return res

NeilGirdhar · May 24, 2024, 8:46am

I agree with Paul and Serhiy.

A couple things that haven’t been mentioned.

According to OO theory, many of the methods on str should have been free functions by the principle that they can be implemented using the public interface. Methods are not just about saving characters.

Therefore, from a language idealism, you should be importing some library of functions and calling that there.

However, I totally agree with the aversion to using re:

In my opinion, what we need is a more human-readable regular expression library. It shouldn’t be in the standard library until it’s stable and popular. I believe there might already be some decent ones. Using that library, you can then build the tools you want (split, find, etc.)

I think this would be a far more productive route than trying hammer everything you want into the standard library.

dg-pb · May 24, 2024, 8:55am

Generally, I agree. Either:
a) Forget string methods and concentrate on re
b) Do a final polish for couple of string methods and then concentrate on re. If methods mentioned in this thread are in sufficient need, this could also be beneficial given that they will inevitably be more performant than any re implementation and will be available at a hands reach without any extra effort.

pf_moore · May 24, 2024, 9:02am

While that may be true, in my experience, every language ends up with some combination of “simple” string methods/functions and a regex API. It’s very hard to imagine that this is a coincidence - rather, it seems to me that like it or not, regular expressions are in some sense the best “general string manipulation” approach that anyone has managed to invent over the years. They aren’t perfect (far from it!) but they fill a gap that nothing else manages to.

It’s true that there are likely alternative approaches on PyPI somewhere. If you really don’t like regular expressions, by all means hunt something out and use it. But it’ll probably be a long time before anything better is in the stdlib. (Grep, awk and perl haven’t found anything better yet and they have been around longer than Python).

Rosuav · May 24, 2024, 9:29am

It’s the same with the bash command syntax. From many angles, it is AWFUL. Wouldn’t it be so much cleaner to have a proper “array of strings” syntax like subprocess.run has? And yet… it’s an incredibly usable syntax.I’ve never yet met a syntax that I would actually prefer, even if (from a purely design aesthetic standpoint) they are “better”.

NeilGirdhar · May 24, 2024, 9:40am

Sorry I think you’ve misunderstood. I’m not arguing against regular expressions. I’m simply agreeing with the above comment that Python’s implementation of regular expressions produces strings that are hard to read. I think that’s a pretty uncontroversial viewpoint.

The first web hit for human readable regular expressions in Python gives humre. Something like that would be a lot easier to use than Python’s re.

In an ideal world, you’d have something like that instead of re. And in an even more idea world, it would be a more powerful parsing library instead.

I don’t think that’s “inevitable”. Regular expressions are compiled, and in post-JIT world, anything can be fast. It’s very possible that the compiled regular expression is just as fast as any hand-crafted solution. Also, most people don’t care so much about a few percentage points of optimization. If you really care about string parsing to that point, then you should hand-craft something even better (e.g., see what Kitty did with string processing on the GPU.)

Personally, I think it would be better to forget string methods, forget re, and focus on something nicer.

Rosuav · May 24, 2024, 9:49am

If people want an actual alternative to regular expressions, particularly as regards performance, my push would be for something sscanf-based. It’s less flexible than regex, but it guarantees that it will parse from left to right ONCE, thus ensuring that there’s no exponential backtracking or anything like that. What you’d end up with would be something like this:

data = "1969-07-20: Neil Armstrong walks on the moon"
year, month, day, event = sscanf(data, "%d-%d-%d: %s")

I still wouldn’t use this for simple operations that work better as string methods, but when there are multiple fields to be pulled out of something, it often is worth it.

NeilGirdhar · May 24, 2024, 9:52am

That makes sense. Maybe you can pass an argument to the compile like no_backtrack=True and the RE compiler can raise for the catastrophic cases.

If someone really did work on this, the compiler could support LALR or LR(1) or even more ambitious models eventually.

Nineteendo · May 24, 2024, 9:55am

Regexes can be compiled as a deterministic finite state machines leading to linear performance.
(Re2 from Google does that I believe)

But note that we’re focussing on very simple patterns here. Please try to stay on topic.