This is why I am certain now that Adaptive Binary beats Cartesian Tree for this.
It does even a bit better on adaptivity
And has minimal additional complexity.
It is now much much simpler than the one I posted earlier:
- No arrays
- No additional hefty operations
- Just couple of extra running variables
- And one
if-else statement before going into binary search.
The only advantage of Cartesian Tree is data movement. If runs were of at least size 200, then it might need further investigation, but with status quo Adaptive Binary is a clear winner.
Lists or Tuples? We were wrapping integers in lists to match the performance precisely because sorted doesnât specialise them.?
In this case I think it will not.
The overhead is minimal.
And although compares donât matter much for specialized comparisons, for binary sort it not only reduces the comparison count, but all operations that come with it. Essentially the whole content of one iteration. And although comparison weighs less, other parts start weighing more, so benefits of specialised vs non-specialised should balance out to a certain degree.
while idx < end:
mid = (idx + end) // 2
if v < seq[mid]:
end = mid
else:
idx = mid + 1
I am able to make a good prediction of the outcome before doing this.
Ok, so letâs take this apart.
1. âwithout significantly slowing any important casesâ
From results above it can be seen that damage to other cases is minimal. This is because:
- Adaptive Binary does well on random data as well. Much better than Cartesian Tree. (see later)
- It is easy to identify good turn off condition. It can be adjusted further, but currently:
N = min_run
# Integral approximation underestimates the number by ~10%
# Thus works well as a cut-off condition as it is
# without extra scaling down
cmr = (N * log(N) - N) / log(N)
cs = 0
while ...:
...
if cs:
_binarysort(seq, start, end, nsorted)
cs -= 1 # decrease adaptivity counter
else:
# Try adaptive
n = nsorted
nc = _abinarysort(seq, start, end, n)
if nc >= (cmr - (n * log(n) - n) / log(n)):
# if fails, do every 3rd, 5th...
cd += 2
# at minimum do every 10th
if cd > 10:
cd = 10
cs = cd
2. ââimportantâ casesâ
So letâs take 2 cases:
- Almost sorted data (something that insertion sort would nail down perfectly)
- Stock price data
Are these important enough?
Also, add random data to evaluate adversarial impact of this.
def get_data(show=False, wrapped=False):
import random
import itertools as itl
import yfinance as yf
random.seed(0)
NUMS1 = [random.random() * 100 for _ in range(1000)]
NUMS2 = list(itl.accumulate([-1 + 4 * random.random() for _ in range(1000)]))
data = yf.download("AAPL", start="2020-01-01", end="2025-01-01")
NUMS3 = [float(i) for i in data.values[:1000,0]) # Get close
if show:
import matplotlib.pyplot as plt
plt.plot(NUMS1, label='random')
plt.plot(NUMS2, label='almost-sorted')
plt.plot(NUMS3, label='AAPL')
plt.legend()
plt.show()
else:
if wrapped:
NUMS1 = [[i] for i in NUMS1]
NUMS2 = [[i] for i in NUMS2]
NUMS3 = [[i] for i in NUMS3]
return NUMS1, NUMS2, NUMS3
get_data(show=True)
TBC in next post.