Add tuple support to more str functions

dg-pb · May 26, 2024, 4:50pm

find_oneliner will always be slower than find3. 2 reasons:

find3 has no extra function calls or loops. It is hard-coded 2 substring search with no flexibility.
It has extra optimization

find3 is the lower bound for this approach. No need to include something that will always be slower.

elis.byberi · May 26, 2024, 4:55pm

You can’t benchmark a specialized algorithm like find3 against a general-purpose algorithm.

I introduced onelineer only for completeness.

Nineteendo · May 26, 2024, 4:57pm

You can easily make it general purpose (so it isn’t as specialised as you think):

def find3(p):
    seps = r"\/"
    i = -1
    for sep in seps:
        new_i = p.find(sep, 0, None if i == -1 else i)
        if new_i != -1:
            i = new_i
    return i

I’ll benchmark against that, that’ll make the performance difference a bit larger.

dg-pb · May 26, 2024, 5:09pm

One way to do this is to test on wide range of examples.

Another way is to find extreme corner cases between which lies the volume of all other cases.

If you have a case, which captures some aspect, which is not currently captured in benchmark, I would be happy to add it.

dg-pb · May 26, 2024, 5:11pm

Ubuntu benchmark:

find best case
1000000 loops, best of 5: 303 nsec per loop
1000000 loops, best of 5: 357 nsec per loop
1000000 loops, best of 5: 260 nsec per loop
2000000 loops, best of 5: 129 nsec per loop
find mixed case
1000000 loops, best of 5: 301 nsec per loop
1000000 loops, best of 5: 370 nsec per loop
10000 loops, best of 5: 23.9 usec per loop
1000000 loops, best of 5: 240 nsec per loop
find worst case
5 loops, best of 5: 47.2 msec per loop
20 loops, best of 5: 17.1 msec per loop
5000 loops, best of 5: 45.5 usec per loop
10000 loops, best of 5: 34.2 usec per loop
rfind best case
1000000 loops, best of 5: 209 nsec per loop
20 loops, best of 5: 13.7 msec per loop
1000000 loops, best of 5: 265 nsec per loop
2000000 loops, best of 5: 163 nsec per loop
rfind mixed case
1000000 loops, best of 5: 217 nsec per loop
20 loops, best of 5: 13.6 msec per loop
10000 loops, best of 5: 23.8 usec per loop
1000000 loops, best of 5: 275 nsec per loop
rfind worst case
5 loops, best of 5: 85.9 msec per loop
10 loops, best of 5: 37.8 msec per loop
5000 loops, best of 5: 44.6 usec per loop
10000 loops, best of 5: 35.4 usec per loop

elis.byberi · May 26, 2024, 5:20pm

Why don’t you just return here? What exactly are you trying to benchmark?

I was talking about different language constructs.

My only concern is the worst case. find shouldn’t become a performance trap; it shouldn’t be naive.
Take a look at the use case above. It would make perfect sense to use string.find(english_words) there.

Nineteendo · May 26, 2024, 5:22pm

Because there can be an earlier match: 'abc'.rfind(tuple('abc')).

Well, we would have to compare it against the alternatives. If they’re slow too, that’s no argument against this function.

dg-pb · May 26, 2024, 5:22pm

Apologies, I quoted wrong text there.

Thank you, you did find another extreme corner. It is actually a valuable test case.

elis.byberi · May 26, 2024, 5:26pm

I’ve already provided the alternative in the linked thread. But that’s irrelevant; simply put, find should handle these cases if you’re going to allow multiple substrings.

Nineteendo · May 26, 2024, 5:27pm

It is relevant, we’re comparing against alternatives. If I have no alternatives to compare it with we can’t say how much faster or slower it is. Only that it appears inefficient, which is subjective.
I’m not saying we shouldn’t benchmark this, but I need those alternatives.

elis.byberi · May 26, 2024, 5:32pm

This problem is well-known. It has edge cases like the thread I linked, where brute force is going to win.

I repeat, if find were to support multiple substrings as an input, it should handle these edge cases. We can’t just say, “Look, find is naive, don’t use it.”

dg-pb · May 26, 2024, 5:33pm

Could you point exactly to the problem you are referring to? The problem you linked there just doesn’t match the specification of what find does here. In other words, find aims to find the lowest index of any of the substrings, while the problem you are referring to needs to exit if ANY substring was found.

I would be more than happy to include the case you have in mind if you wrote it down simply and clearly here. If the problem is a subset of what find does (which is the case for the problem in linked thread), then you need to provide solution against which to benchmark.

Nineteendo · May 26, 2024, 5:35pm

Don’t you need all matches in that case? Not just the first one? Or am I thinking of something else?

dg-pb · May 26, 2024, 5:38pm

The OP tries to write a function contains_english_word(text: str) -> bool.

I have not read all of the thread. @elis.byberi was participating in it, maybe he has some other problem in mind that was discussed there?

elis.byberi · May 26, 2024, 5:42pm

This is true for the proposed find as well, isn’t it? You can easily return the index though.

dg-pb · May 26, 2024, 5:54pm

Yes, thank you, you are right. It is a valid case.

Can you provide a method which solves the problem efficiently so I can benchmark against it?

Also, do you have exact inputs in mind that would be good to test it with?

elis.byberi · May 26, 2024, 5:59pm

You (we) should work on creating decent test cases first.

A very large string and a small tuple.
A small string and a large tuple.
A very large string and a large tuple.

The case in the linked thread involves a small string and a very large tuple.

elis.byberi · May 26, 2024, 6:13pm

Here’s a brute force algorithm I was brainstorming. It might need tests.

dg-pb · May 26, 2024, 6:30pm

I have re-considered. Although it might seem that those are not covered in the current benchmark, they actually are. There are 3 cases with the search of 2 substrings in a very long string:

both of substrings are at the start of long string - this is the same as short string and a large tuple. Although tuple is not that large, it captures this nuance well. Longer tuple would just scale complexity linearly.
1 substring is hit early, while other is at the back of very long string. This tests early stop. Again, more items in a tuple would just scale complexity linearly.
There are no hits in a very long string, so that algorithm needs to run full scan. Again, increasing number of substrings in a tuple would just scale complexity linearly.

So benchmark case is complete. And efficient alternative methods are chosen, so that there is at least 1 alternative which in theory should deal well with each benchmark case.

If you have a specific case, which you can write down (without imports please) and provide an alternative method against which to benchmark - I can run it.

But the method needs to be efficient so that there is an actual value in benchmarking. An algorithm written in python which uses loops has no chance outperforming current implementation.

But I would require a code which I can run to do this.

elis.byberi · May 26, 2024, 6:40pm

IIUC, this should be correct:

Chunking may help in mixed cases, but the worst-case scenario would still be the same.

I’m out of my free time, so I can’t give any ETA for providing any algorithms.