Request for review: "gh-119702: New dynamic algorithm selection for string search (+ rfind alignment)"

dg-pb · July 13, 2024, 4:15am

I have done some work on removing hard-boundaries in algorithm selection for string search.

As I went through the code carefully and ended up understanding most of it, couple of side effects of this happened to be:

Non-trivial performance improvements
Direction agnostic algorithm implementations (rfind uses exactly same routines and does not suffer quadratic complexity anymore)

Besides benefits listed above, this also results in better code structure which supports sensible initial implementation of “finite set of patterns” string search.

This would ideally be merged before I moved on to other string extensions. See: String Search Overview, Coverage and API

It has been reviewed and iterated many times. I gave it a month or so to make sure it settled and it is ready for a coredev review.

I would appreciate if someone looked at it.

github.com/python/cpython

gh-119702: New dynamic algorithm selection for string search (+ alignment with `rfind`)

python:main ← dg-pb:implement-119702

opened 10:30AM - 04 Jun 24 UTC

dg-pb

+621 -444

# 1. Work I managed to combine all good tricks that I found in the library into… 1 dynamic solution, which seems to perform well and eliminate hard-coded boundaries for algorithm selection. 1. Instead of 3 different implementations only one (`horspool_find`) is now called (for both forward and reverse search). It dynamically defaults to linear-complexity-assured solution (`two_way_find`) if it predicts it will perform better. 2. Direction agnostic logic allowed `rfind` to use exact same code as `find`. 3. Special case `n == m` to use `memcmp` added. # 2. Results Aggregate impact of this change seems to be net positive. It results in non-trivial average performance increase, adapts more advanced search algorithms for reverse search, smooths out performance `surface` and improves on general code structure and documentation. Benefits: 1. Performance `surface` is much smoother now. There is only 1 logic that can cause a step change now and it is dynamic as opposed to many hard-coded step changes of the current logic. 2. Direction agnostic logic works well and eases the strain of the alternative of having to keep 2 implementations in sync. 3. Benchmarks: * Average **75%** performance increase of `find` for artificial benchmark of shuffled alphabet. * Average **34%** performance increase of `find` for real file search of different slice lengths. * Average **247%** performance increase of `rfind` for artificial benchmark of shuffled alphabet. Worth noting: 1. Splitting 2 directions (forward and reversed) into 2 implementations would result in 10-30% better performance based on tested benchmarks. However, I think it is a good trade-off, given the advantages of such approach. 2. There are areas and cases where new algorithm performs worse (see benchmark). However, they are either not clustered or where they are, the performance decrease is not substantial. # 3. Benchmarks: Benchmark result value: ```python current_runtime = `run time of current python version` new_runtime = `run time of this PR` result = (new_runtime - current_runtime) / min(new_runtime, current_runtime) ``` ### 3.1. Artificial dataset via randomized alphabet. <details><summary>Case Generation Gode</summary> ```python # shuffled alphabet alphabet = 'DHUXYEZQCLFKISBVRGNAMWPTOJ' zipf = [1/x for x in range(1, 1+len(alphabet))] def zipf_string(length, seed): letters = random.Random(seed).choices(alphabet, weights=zipf, k=length) return ''.join(letters) NLS = [ 2, 3, 4, 6, 8, 12, 16, 24, 32, 48, 64, 96, 128, 192, 256, 384, 500, 1000, 10000, 100_000 ] HSS = [ 500, 750, 1000, 1500, 2_000, 3_000, 4_000, 6_000, 8_000, 12_000, 16_000, 24_000, 32_000, 48_000, 64_000, 96_000, 1_000_000 ] def generate_benchmarks(): output = [] for m in NLS: for n in HSS: if n < m: continue for s in (1, 2, 3): seed = (s*n + m) % 1_000_003 needle = zipf_string(m, seed) haystack = zipf_string(n, seed ** 2) name = f"needle={m}, haystack={n}, seed={s}" output.append((name, needle, haystack)) with open(f"{PATH}/_generated.py", 'w') as f: print("benches = [", file=f) for name, needle, haystack in output: print(f" {(name, needle, haystack)!r},", file=f) print("]", file=f) ``` </details> <details><summary>1.a. Results. Current vs new `str.find/str.count`.</summary> <img width="963" alt="Screenshot 2024-06-27 at 15 06 37" src="https://github.com/python/cpython/assets/3577712/812f7a33-3118-460b-aa88-bc6710cde333"> </details> Comparison for `len(haystack) == 1000` for `str.find`. x-axis is `"{needle_len}:{seed}"`. Upper chart is run time, lower chart is percentage difference. It depicts the issue this PR is addressing. I.e. Big sub-optimal step-changes in performance for small input changes. <img width="607" alt="Screenshot 2024-06-27 at 14 12 55" src="https://github.com/python/cpython/assets/3577712/7a5ea910-3f61-4fcf-9a74-57791c72d788"> <details><summary>1.b. Results. Current vs new `str.rfind/str.rcount`..</summary> <img width="597" alt="Screenshot 2024-06-08 at 23 23 17" src="https://github.com/python/cpython/assets/3577712/d00493a1-c58b-4479-9bd1-8f7d2425e5fa"> </details> ### 3.2. Search for arbitrary chunks in real files. <details><summary>Case Generation Gode. `str.rfind/str.rcount`..</summary> ```python / FILES = { "c": (CPYTHON_PATH / "Objects" / "unicodeobject.c").read_text(), "py": (CPYTHON_PATH / "Lib" / "_pydecimal.py").read_text(), "en": (CPYTHON_PATH / "Doc" / "library" / "stdtypes.rst").read_text(), "bin": (CPYTHON_PATH / "python.exe").read_bytes(), } MS = [10, 15, 20, 30, 40, 60, 80, 120, 160, 240, 320, 640, 1280] MR = range(12) def generate_benchmarks(): results = dict() for file_label, haystack in FILES.items(): n = len(haystack) for m in MS: for i in MR: stt = (1_000_003 * i) % (n - m) needle = haystack[stt:stt + m] results[(m, file_label, i)] = haystack, needle return results ``` </details> <details><summary>2.a. Results. Current vs new `str.find/str.count`.</summary> <img width="735" alt="Screenshot 2024-06-27 at 15 06 54" src="https://github.com/python/cpython/assets/3577712/609500cd-7fd9-4276-9569-2c97114aefba"> </details> * Issue: gh-119702

dg-pb · August 9, 2024, 5:02pm

If anyone is interested in understanding string search algorithms better, this is still needed.

There haven’t been any objections to this yet and feedback is mostly positive.

However, I appreciate that it might take a while until the right person appears to review this. So I will bump this from time to time if it is ok. I posted in String search improvements · Issue #691 · faster-cpython/ideas · GitHub, but been advised to ping a bit more here in discourse.

davidism · August 9, 2024, 5:12pm

It’s not. This doesn’t scale, and is noisy. Maintainers are already aware of their backlog.

dg-pb · September 11, 2024, 2:43pm

But in this case, it would be good to receive such confirmation. I.e. that it is in someones backlog.

Otherwise, if no response was given, I am left under impression that no one has acknowledged it / picked it up.

In which case bumping it again is a pretty natural following action IMO.

Or is there something that I am not aware of? E.g. once I post here I have assurance by default that it is in someones backlog? Or something similar…

pf_moore · September 11, 2024, 3:25pm

If it’s in the tracker, it’s in people’s backlogs. That may or may not mean it gets someone’s attention soon, but that’s for the maintainers to manage. Nagging people in an attempt to get action sooner is not helpful.

dg-pb · September 11, 2024, 3:38pm

So are there any actions that I am allowed to do to prevent this from being forgotten after posting it once?

Is this your interpretation of my actions so far regarding this topic?

pf_moore · September 11, 2024, 3:49pm

Wait patiently. I’m sorry if that isn’t what you want to hear, but everyone involved here is a volunteer and you’re not entitled to any particular timeframe for responses. As I said, it won’t be forgotten, it’s on the tracker list. People will review that list when they have the time and inclination.

Not so far, no. I’m hoping my comments help you avoid falling into the trap of appearing to behave like that, is all. Not being willing to accept the advice given by one of the moderators is concerning, though.

dg-pb · September 11, 2024, 6:06pm

Let’s take a step back.

Is bumping PR-review-request forbidden in this forum?

I would say bumping it once in 3-6 months could be reasonable. Once a year?

I am completely fine if the rules are that it is forbidden, as long as it is clear to me and everyone agrees.

pf_moore · September 11, 2024, 6:47pm

The overriding rule is “be reasonable”. I’m not aware of specific rules that give an exact number of months/weeks or anything like that. Continuing to bump when requested not to by a moderator, does not match my idea of “being reasonable”. Continuing to argue the point when you’ve been given advice on what to do doesn’t seem reasonable to me.

I’m not going to comment further. I feel like if I continue to give you comments to argue against, you’re going to end up in code of conduct territory, and I’d prefer not to see that happen.

dg-pb · September 14, 2024, 9:35pm

I tend to be as reasonable as I can, all of your concerns are much appreciated.