Optimize str.find aka FASTSEARCH

dg-pb · May 29, 2024, 12:35am

I propose a change in algorithm selection in:

Objects/stringlib/fastsearch.h

main


      
                  if (s[i] == p0) {
                      count++;
                  }
              }
              return count;
          }
          
          
          Py_LOCAL_INLINE(Py_ssize_t)
          FASTSEARCH(const STRINGLIB_CHAR* s, Py_ssize_t n,
                     const STRINGLIB_CHAR* p, Py_ssize_t m,
                     Py_ssize_t maxcount, int mode)
          {
              if (n < m || (mode == FAST_COUNT && maxcount == 0)) {
                  return -1;
              }
          
              /* look for special cases */
              if (m <= 1) {
                  if (m <= 0) {
                      return -1;

Instead of that 3-clause if statement, I propose modifying adaptive_find so that it adapts well to any situation.

What has been done:

Constants for both algorithms (default_find and two_way_find) were calculated, such as time_per_iteration, initialization_time_per_m, etc. Some of those “constants” were not constant across different cases, others had very similar values for all cases. For varying “constants”, extreme (sometimes limiting) values were chosen so that default_find is favoured.
Based on this information condition was derived that is to replace: cpython/Objects/stringlib/fastsearch.h at main · python/cpython · GitHub

Results:

Improvements over current performance in targeted areas is up to 20 x. Those improvements span across all 3 clauses of if statement that is referred to above.
No significant performance decrease has been detected for any of test cases.
New adjustment smooths out performance and eliminates spikes that are present for small changes of inputs. Those can sometimes be fairly significant. This makes it difficult to optimize methods that make use of str.find

Couple of request:

If anyone remembers what cases were important when “bells and whistles” (as per fastsearch.h docstring) were introduced I would appreciate this information.
Also, if anyone wants me to add some cases that are important let me know also.

dg-pb · May 29, 2024, 1:59am

Benchmark

Notes:

There is a constant overhead of ~160ns
Some small optimizations are still to be made, but nothing that would change below results by much.

Set-up

s = '_' * N
if best:
    ch = '.'
elif worst:
    ch = '_'
ss =  ch * (m - 2) + '1' + ch
def test_func():
    s.find(ss)

Results

OS-11.7.10-x86_64-i386-64bit-Mach-O | 3.14.0a0
macOS-11.7.10-x86_64-i386-64bit-Mach-O | 3.14.0a0
------------------------------------------------------
N=10         Current         Proposed
units: ns | best | worst | best | worst
------------------------------------------------------
        2 |  286 |   366 |  224 |   221
        3 |  224 |   225 |  211 |   222
        4 |  218 |   249 |  212 |   226
        5 |  238 |   225 |  248 |   231
        6 |  285 |   232 |  213 |   227
        7 |  222 |   258 |  215 |   250
        8 |  360 |   245 |  242 |   237
        9 |  222 |   248 |  215 |   224
------------------------------------------------------
N=100          Current         Proposed
units: ns | best | worst | best | worst
------------------------------------------------------
        2 |  376 |   398 |  314 |   468
        3 |  317 |   422 |  304 |   448
        4 |  350 |   444 |  311 |   417
        5 |  269 |   372 |  251 |   461
        6 |  246 |   409 |  236 |   424
        9 |  235 |   456 |  229 |   476
       10 |  232 |   491 |  226 |   439
       30 |  238 |   825 |  236 |   687
       40 |  252 | 1,061 |  452 |   663
       90 |  279 |   620 |  377 |   726
------------------------------------------------------
N=1K              Current           Proposed
units: ns |  best |  worst |  best | worst
------------------------------------------------------
        2 | 1,098 |  1,525 | 1,094 | 1,798
        3 |   872 |  1,725 |   856 | 1,877
        4 |   769 |  1,765 |   733 | 2,245
        5 |   657 |  1,871 |   849 | 2,289
        6 |   597 |  1,835 |   617 | 2,081
        9 |   491 |  2,845 |   484 | 2,097
       10 |   467 |  3,144 |   453 | 2,050
      100 |   296 | 23,885 |   278 | 2,387
      300 |   403 | 47,235 |   378 | 2,828
      400 |   441 | 57,311 |   434 | 3,105
      900 |   884 | 19,768 |   720 | 4,250
------------------------------------------------------
N=10K              Current           Proposed
units: ns |   best |     worst |  best |  worst
------------------------------------------------------
        2 |  9,262 |    13,309 | 9,023 | 15,964
        3 |  6,951 |    13,907 | 6,889 | 17,373
        4 |  5,574 |    15,146 | 5,774 | 18,952
        5 |  4,789 |    16,044 | 4,989 | 20,035
        6 |  3,845 |    16,172 | 4,482 | 18,943
        9 |  2,904 |    26,679 | 3,036 | 19,452
       10 |  2,677 |    28,419 | 2,570 | 18,721
      100 |    894 |    18,799 |   544 | 18,462
     1000 |  3,459 |    21,939 |   753 | 20,564
     3000 | 10,670 |    23,477 | 1,821 | 24,998
     4000 |  2,827 |    28,243 | 2,389 | 27,754
     9000 |  4,935 | 1,722,265 | 4,973 | 38,100
------------------------------------------------------
N=100K             Current         Proposed
units: ns |    best |    worst |    best |    worst
------------------------------------------------------
        2 |  91,451 |  136,300 |  96,473 |  166,890
        3 |  73,523 |  149,025 |  71,222 |  182,560
        4 |  57,877 |  160,103 |  59,756 |  185,634
        5 |  50,188 |  163,632 |  46,594 |  183,795
        6 |  57,090 |  198,212 |  41,886 |  194,464
        9 |  36,564 |  185,232 |  30,860 |  195,669
       10 |  33,569 |  199,484 |  26,517 |  194,941
      100 |   6,013 |  186,479 |   4,410 |  190,645
     1000 |   4,994 |  205,996 |   1,105 |  191,947
    10000 |  40,565 |  214,843 |   5,619 |  213,163
    30000 | 105,320 |  232,888 |  17,065 |  267,391
    40000 |  21,611 |  274,259 |  23,156 |  282,290
    90000 |  55,079 |  402,467 |  47,837 |  372,822
-----------------------------------------------------
N=1M                 Current          Proposed
units: ns |    best |     worst |    best |     worst
------------------------------------------------------
        2 | 941,007 | 1,448,639 | 954,215 | 1,719,287
        3 | 717,159 | 1,566,532 | 722,924 | 1,666,423
        4 | 568,244 | 1,652,256 | 552,982 | 1,870,495
        5 | 489,075 | 1,617,497 | 446,916 | 1,871,483
        6 | 561,883 | 2,024,259 | 406,523 | 1,841,601
        9 | 382,779 | 1,992,450 | 288,651 | 1,805,480
       10 | 343,236 | 1,988,768 | 256,256 | 1,822,459
      100 |  50,959 | 1,903,812 |  42,702 | 1,884,445
     1000 |  34,911 | 1,879,474 |   7,173 | 1,807,063
    10000 |  68,449 | 1,868,857 |   5,844 | 1,822,787
   100000 | 339,663 | 2,060,019 |  53,211 | 2,016,727
   300000 | 951,868 | 2,208,496 | 158,703 | 2,446,954
   400000 | 212,162 | 2,681,579 | 215,990 | 2,668,347
   900000 | 495,428 | 3,989,225 | 479,384 | 3,810,628

Rosuav · May 29, 2024, 2:55am

How does this affect the pyperformance benchmarks?

https://pyperformance.readthedocs.io/

sweeneyde · May 29, 2024, 3:12am

Hi @dg-pb,

I wrote the line in question, and I’d be happy to help look at this more in the coming weeks.

In your investigations, I would suggest trying many different test cases; it’s unlikely for there to be any “free lunch” here. Some strategies will help some cases at the cost of others. You can see some discussion at this issue, and this PR was where those thresholds originated. It’s linked at that PR, but you can look at some of the benchmark scripts I used.

The bulk of my tuning was on randomly-generated text and patterns, using a Zipf distribution on characters, which should hopefully roughly approximate a lot of kinds of texts. In that PR there’s a colorful table (copied below for convenience) that shows the performance ratio of the two-way algorithm to the default algorithm for various haystack and needle lengths, and several random seeds per length combination. The cutoffs were chosen based on the table.

In particular, I saw that when the needle was a large percentage of the length of the haystack, the expensive two-way preprocessing wasn’t typically worth it, but is necessary to avoid quadratic time in the worst case. This is why the adaptive algorithm exists: to try the default fast algorithm first, but to keep track of how many characters we match and switch over to two-way in case we’re at the risk of becoming quadratic. The default algorithm stuck around in my effort not to slow down the other cases at all.

To prevent overfitting for that Zipf case, in that benchmarking gist there’s also a script I used to look at some sample strings from a .py file, a .rst file, a .c file, and an .exe file, and make sure some sample of “real-world” string searches don’t get hurt too bad. I’m more concerned with a greater number of random and real-world-ish cases than with searching for "a"*1_000 in "b"*1_000_000 and the like, as long as nothing ever gets quadratic. We can further discuss the kinds of cases that are hurt or helped by some of the tricks in fastsearch.h, but the main thing I’d suggest is just to try lots of different cases.

Best,
Dennis

dg-pb · May 29, 2024, 3:18am

Thank you, will look into it. Will probably use the same table you shared for comparison.

It does, “worst case” always gets maximally quadratic:

HS = '_' * n
needle - "_" * m + '1' + '_'

It checks for last character, then loops from 1st. Thus the difference between best and worst cases. I think these 2 are extremes (disregarding nuances of two-way find.

EDIT:
Difference could exist because of other reasons, but I compiled `default_find" and explicitly ensured that it gets quadratic.

dg-pb · May 29, 2024, 3:27am

Running, but I doubt it will have any significant impact. Suspect it will be well within 1 standard deviation.

sweeneyde · May 29, 2024, 4:52am

disregarding nuances of two-way find

That’s what I meant though: the thresholds ensure that default_find (which is worst-case O(n*m)) isn’t used in the worst cases, so overall str.find has runtime O(n + m).

dg-pb · May 29, 2024, 5:01am

I went through implementation and analysed decisions that were made. So far it seems that my solution handles all of it implicitly.

It essentially calculates ratio of average hits/iteration. Then it knows how long it would take to run default_find with this ratio and how long it will take to initialize two_way_find and run it. and if threshold is hit, then it goes to two_way_find.

This approach improves a lot on the hard-coded short sizes as initialization of two_way_find for short needles is actually not that expensive, while the problem although small is getting a lot of false hits.

At the same time for large haystacks and small needles there are problems where there are no hits at all, but current algorithm still runs two_way_find.

At least so far it seems that it does a pretty well calculated decision.

The main cost here is that for large haystacks with large needles it needs to do at least 1 iteration, which in very bad cases adds up to 30%. But I don’t think these cases are common in real-world problems.

sweeneyde · May 29, 2024, 5:14am

The approach sounds plausible in the abstract. Just so we can talk more concretely, you’re welcome to open a draft PR and request me for review, though it might take a while to get through it.

dg-pb · May 29, 2024, 5:18am

I am half-way through benchmarks. Will post results here and if they seem good, will open a PR.

dg-pb · May 29, 2024, 8:19am

Benchmark:

Although worse than I thought, but still pretty good for first cut. Especially given the fact that it wasn’t calibrated on this data.

I have investigated poorly performing area - it is as it should be.

I have taken a lower bound on one variable hoping for the best, but apparently need to account for it properly.

dg-pb · May 29, 2024, 11:11am

Standard pyperf suite shows little difference. str_fastsearch: pyperf_standard · GitHub

dg-pb · June 2, 2024, 1:42am

Average 16% improvement. What suffers most is cases where no skips are possible due to additional overhead. However, where things get more complex benefits begin to kick-in.

Red area ideally should be covered by naive algorithm, which is coded at efficiency of memchar.

Also, note that there is a number of things that will make it faster & better.

Alpha version results are as follows:

dg-pb · June 4, 2024, 3:22am

Latest results - average 65% improvement: