Sorting adaptivity improvement

A possibility.

To add Cartesian Tree Sort step to sorted.

Along the lines of:

while start < stop:
    n = _count_run(seq, start, stop)
    if n < min_run and cartesian_sort_is_doing > well:
        n = _cartesiansort(seq, start, stop, n, dynamic_early_exit_param)
    if n < min_run:
        cartesian_sort_is_doing -= points
        n = _binarysort(seq, start, stop, n)
    ...

_cartesiansort needs 3 x min_run extra memory.
dynamic_early_exit_param can be adjusted for more gains for higher cost.

Decreases comparison count of medium-to-highly pre-sorted data at a slight cost on some other cases.

Main improvement is for random walks.
Up to 50% decrease in comparisons.
Plus captures a couple of more exotic datasets.

3 Likes

Let me assure people that if you can’t make head or tail of that extremely “busy” table, you’re normal :wink:. I’ll sketch some high-order bits on @dg-pb’s behalf:

  • Python’s sort is extremely effective on stretches of already-sorted (whether increasing or decreasing) data.

  • It’s also one of the most effective approaches known for exploiting “large scale” partial order (even if nowhere close to being already-sorted).

  • But in between it effectively relies on just binary insertion sort, which is aimed not at peak speed but at best worst-case behavior.

  • Cartesian tree sort is a novel kind of binary tree based sort, which is very effective (fast) at sorting data with “few inversions”: data that’s “close to” being already sorted, and again whether in increasing or decreasing order. The idea here is apparently to try to use that instead of binary insertion sort for the “in between” area. Which has potential!

  • But plain old linear insertion sort is also very effective at sorting data with few inversions, but only in the increasing direction (decreasing data is its worst case). A fancier variant could presumably detect whether a region is generally going “up” or “down”, and if the latter change what it does to put data in non-increasing order instead, reversing the slice at the end.

  • CPython’s internal count_run() already figures out the difference between non-increasing and non-decreasing runs, so another idea is to make that fancier to try extending a natural run via increasing- or decreasing- linear insertion sort steps too,.

  • All such things are expected to lose (slow things down at least slightly) if the input is randomly ordered (meaning uniformly random - skewed randomness is exploitable). “The trick” is to cut them off fast if they’re not obviously paying off. When they are paying off, they can win big.

6 Likes

Thank you Tim.


Yes, insertionsort is another option and its basic version is very simple.

However, from my experimentations it is much harder to calibrate it for good early exit and to tell whether it is doing well.

E.g. in pdqsort it sorts until N values are not at the end.

The issues is that this is not a good indicator as 5 values that are very close to end might be much better than one in the middle.

Then adding efficient insertion to both ends is further complexity.


Not to say that Cartesian Tree Sort is simpler - it can be quite involved, but it has advantages:

  1. It is much easier to come up with simple stopping rule. My last one is very simple and does well - "stop if number of nodes with 2 children is >= N. But it can be improved further.
  2. Handles both ends - see “trumpet” dataset
  3. Does better on adversarial data, thus lower penalties on suboptimal applications
  4. It has scope for further development.
1 Like

Having that said, I think much simpler and better solution would be to add adaptivity to binarysort.

Main advantage of Cartesian Tree sort is data movement, however for small runs like that it doesn’t matter.

Thus, with simple modification to binarysort and dynamic switching:

The benefit naturally decreases as data size increases - larger and larger portion of sorting is done by merge functions.

1 Like

Number of comparisons for data size 64

[mono inc, mono dec, random, presorted-ish inc]
insertionsort      : [ 63, 2016, 1078, 143]
mergesort          : [192,  192,  308, 214]
binarysort         : [264,  321,  298, 276]
sorted             : [ 63,   63,  304, 240]

abinarysort        : [ 63,   63,  338, 163]
ctreesort          : [ 63,   63,  367, 173]
1 Like

Things are complicated, alas.

There are potentially log n passes over the data, but as sorting goes “left to right” just once at the top level, it’s obscured that finding runs only accounts for just one of those passes (distributed over time), and that’s the only pass in which binarysort is used. The more passes, the less it can matter in context.

Then there’s that number of compares matters less now than when the sort was first written, because the sort now specializes on type-homogeneous lists (very common), and in some cases (like floats and small ints) to custom C comparison functions that are much faster than float.__lt__ and int.__lt__ can achieve even at the C API level. Extra code to cut the number of compares can actually backfire now (a whole lot so for floats, but cutting the number of compares calling out to user-defined Python __lt__ is still hugely beneficial).

Data movement doesn’t appear to matter much at all anymore. In context, these pointers easily fit in a corner of L1 cache on modern boxes, and moving pointers in L1 cache is blazing fast.

So, alas, what’s needed to actually “sell” a putative improvement isn’t theoretical improvements, or counting data movement or # of compares, but demonstrating actual speed improvements in plausible cases.

Although I don’t always follow that myself :wink: For example, @Stefan2’s new way of picking a variable MINRUN was so achingly principled and effective “in theory” that I checked it in despite not measuring any speed difference. It replaced a hacky heuristic with something provably well behaved in all cases.

But that’s very rare. I usually don’t bother unless there’s at least a 10% win in “important” cases, without significantly slowing any important cases.

That’s not easy either. Speed differences can (& do) vary a lot across HW, compiler, compiler releases, and compiler options used.

So it can be a long, hard slog. I encourage you to open a PR with your best attempt so far, and solicit feedback there on actual “before and after” timings on various platforms and workloads. But that can be hard to get too.

But that’s enough good news for one post :wink:.

1 Like

This is why I am certain now that Adaptive Binary beats Cartesian Tree for this.
It does even a bit better on adaptivity
And has minimal additional complexity.
It is now much much simpler than the one I posted earlier:

  1. No arrays
  2. No additional hefty operations
  3. Just couple of extra running variables
  4. And one if-else statement before going into binary search.

The only advantage of Cartesian Tree is data movement. If runs were of at least size 200, then it might need further investigation, but with status quo Adaptive Binary is a clear winner.

Lists or Tuples? We were wrapping integers in lists to match the performance precisely because sorted doesn’t specialise them.?

In this case I think it will not.
The overhead is minimal.

And although compares don’t matter much for specialized comparisons, for binary sort it not only reduces the comparison count, but all operations that come with it. Essentially the whole content of one iteration. And although comparison weighs less, other parts start weighing more, so benefits of specialised vs non-specialised should balance out to a certain degree.

        while idx < end:
            mid = (idx + end) // 2
            if v < seq[mid]:
                end = mid
            else:
                idx = mid + 1

I am able to make a good prediction of the outcome before doing this.

Ok, so let’s take this apart.

1. “without significantly slowing any important cases”

From results above it can be seen that damage to other cases is minimal. This is because:

  1. Adaptive Binary does well on random data as well. Much better than Cartesian Tree. (see later)
  2. It is easy to identify good turn off condition. It can be adjusted further, but currently:
N = min_run
# Integral approximation underestimates the number by ~10%
# Thus works well as a cut-off condition as it is
# without extra scaling down
cmr = (N * log(N) - N) / log(N)
cs = 0
while ...:
    ...
    if cs:
        _binarysort(seq, start, end, nsorted)
        cs -= 1    # decrease adaptivity counter
    else:
        # Try adaptive
        n = nsorted
        nc = _abinarysort(seq, start, end, n)
        if nc >= (cmr - (n * log(n) - n) / log(n)):
            # if fails, do every 3rd, 5th...
            cd += 2
            # at minimum do every 10th
            if cd > 10:
                cd = 10
            cs = cd

2. ““important” cases”

So let’s take 2 cases:

  1. Almost sorted data (something that insertion sort would nail down perfectly)
  2. Stock price data

Are these important enough?

Also, add random data to evaluate adversarial impact of this.

def get_data(show=False, wrapped=False):
    import random
    import itertools as itl
    import yfinance as yf
    random.seed(0)
    NUMS1 = [random.random() * 100 for _ in range(1000)]
    NUMS2 = list(itl.accumulate([-1 + 4 * random.random() for _ in range(1000)]))
    data = yf.download("AAPL", start="2020-01-01", end="2025-01-01")
    NUMS3 = [float(i) for i in data.values[:1000,0])    # Get close
    if show:
        import matplotlib.pyplot as plt
        plt.plot(NUMS1, label='random')
        plt.plot(NUMS2, label='almost-sorted')
        plt.plot(NUMS3, label='AAPL')
        plt.legend()
        plt.show()
    else:
        if wrapped:
            NUMS1 = [[i] for i in NUMS1]
            NUMS2 = [[i] for i in NUMS2]
            NUMS3 = [[i] for i in NUMS3]
        return NUMS1, NUMS2, NUMS3

get_data(show=True)

TBC in next post.

1 Like

3. “there’s at least a 10% win”

Ok, so now timings.
For each dataset there are 2 cases:

  1. Size 64 prefix (to see the performance of “Small Run Function”
  2. Full data of 1000.

1. Data size 64

I have 2 cython compiled functions that special case float comparisons - one with a prefix, meaning adaptive. They are still slower than C code of list.sort, but the difference between them can be used to infer what could improvement possibly be for list.sort.

Results for [float] and specialised float elements:

WRAND, WPSRT, WAAPL = this.get_data(wrapped=True)

                                # RAND | AAPL | PSRT
%timeit binarysort(DATA[:64])   # 14   | 17   | 11 ”s
%timeit abinarysort(DATA[:64])  # 16   | 15   |  4 ”s

%timeit DATA[:64].sort()        # 12   | 14   |  7 ”s
%timeit DATA[:64].asort()       # 14   | 12   |  2.5 ”s (PROJECTED)
                                # +17% | -15%  | -64% (PROJECTED %)

RAND, PSRT, AAPL = this.get_data(wrapped=False)
                                # RAND | AAPL | PSRT
%timeit binarysort(DATA[:64])   # 2.7  | 3.0  | 2.4 ”s
%timeit abinarysort(DATA[:64])  # 3.4  | 2.7  | 1.2 ”s

%timeit DATA[:64].sort()        # 1.7  | 1.7  | 1.1 ”s
%timeit DATA[:64].asort()       # 2.1  | 1.5  | 0.6 ”s (PROJECTED)
                                # +23% | -12% | -46%   (PROJECTED %)

2. Data size 1000

And 2 cython compiled functions of TimSort - one with a prefix, meaning adaptive binary sort. These do not specialise floats, so only for [float] elements:

WRAND, WPSRT, WAAPL = this.get_data(wrapped=True)
                                # RAND | AAPL | PSRT
%timeit timsort(DATA[:])        # 490  | 410  | 200 ”s
%timeit atimsort(DATA[:])       # 490  | 370  | 110 ”s

%timeit DATA[:].sort()          # 370  | 300  | 180 ”s
%timeit DATA[:].asort()         # 370  | 270  | 100 ”s (PROJECTED)
                                # 0%   | -10% | -45%   (PROJECTED %)

3. Conclusions

So one part is missing - specialized float elements for Data size 1000.

However, it can be clearly seen, that benefits of [float] element Adaptive Binary sorting size 64 values are visible in larger runs:

  1. No visible impact on random data due to dynamic switch-off
  2. AAPL data size 64 had 15% improvement and for size 1000 improvement has decreased to 10%
  3. For almost sorted data size 64 improvement was 64%, while its benefits in TimSort resulted in 45% faster computation.
  4. naturally, benefits deteriorate as data size increases, but not as much. It can be seen from comparison heatmaps above that they stay significant for large sizes. Given that galloping is reducing computations greatly for merge step (which is true for large almost sorted data), Binary Sort remains significant cost.

Thus, it can be inferred that this would extend to specialized comparisons in a similar manner. Even if it is twice lower impact - it is still a significant one.

4. To sum up:

  1. For non-specialized comparisons improvement is visible and significant.
  2. For specialized values oberved benefit is lower, still significant nevertheless. It was only estimated for 64 value size, but there is enough data to make an educated guess that benefits to TimSort would propagate in a similar manner as it does for non-specialized comparison data.

I think there is enough data to make a decision on whether this is a desirable thing to add.

Additional complexity is small. See: https://gist.github.com/dg-pb/674a4b7f6f8ae7d8fc2cd8bd35569639
Computationally, it is 3-4 extra operations per element.
The most expensive addition is nc counter (for dynamic switch-off) of comparisons as it is part of O(nlogn) part.

1 Like

Sorry, but I can’t make the time to discuss at length. It’s time for a PR - you’ve been doing
somewhat related things off-and-on for months behind the scenes, and it’s time to fish or cut bait :wink:.

Fine by me!

The inputs to Python’s sort are always lists. Those are the lists I’m talking about. The types of an input list’s elements are a different matter. It’s true that specialization to elements also of list type is generally far less effective than specialization to elements of tuple type.

Timing of actual implementation is far more impressive to me than head arguments.

That’s subjective for all of us. What I hope for is that if you open a PR, and ask for data there, people will volunteer real-life outcomes on workloads important to them. I expect you already learned that just asking in a comment on Discourse typically gets no useful response. People for whom sort speed is very important often have mountains of data, and/or proprietary data, and simply won’t share it outside their company.

Give them a version of Python they can just use on their own instead, and that can make all the difference.

Absolutely not. Nothing gets added without timing an actual CPython implementation, which always turns up surprises. There is enough data to justify people spending time taking a PR seriously. It looks promising to me. But I’ve been wrong before.

1 Like
3 Likes