Array should have a sort method

Is sorting such already (reverse-)sorted data a common case for you? Otherwise that doesn’t seem like an appropriate benchmark, given sorted’s huge advantage there.

2 Likes

Or just:

    def add(self, element):
        insort(self, element)
3 Likes

No. I agree that this is a very bad benchmark. I will run another one tomorrow.

As already noted, the benchmark you used (reverse-ordered data) is one sorted() excels at (linear time). But the conclusion won’t qualitatively change regardless of benchmark: yes, heapsort coded in Python will always run much slower than the mountain of optimized C code sorted() invokes.

The solution remains the same: use numpy. It already put a lot of effort into sorting arrays of simple numeric types. It’s unlikely anyone will volunteer to reproduce that effort in CPython. I won’t, and I’m one of the only ones who might :wink:.

6 Likes

But you’re avoiding sorted presumably because you have an array so huge (several GBs?) that calling sorted on this array results in MemoryError, in which case even with a C implementation of heap sort it’ll likely take a really long time to finish despite a O(n log n) time complexity.

So it’ll help if you can elaborate on your actutal use case to show that there are realistic scenarios where a built-in sort method for array.array is warranted.

3 Likes

Time to make a pitch for one of the true gems of the Python world, the sortedcontainers package by Grant Jenks.

Its SortedList type maintains a list in always-sorted order, and just about everything is at worst about log time (add an element, delete an element, bisect to find where an element would belong, find the median (etc)).

It’s coded in pure Python, and is more efficient (both time and space) than fancy C-coded tree structures. It’s especially effective under PyPy, which can exploit its simple coding style.

“The trick” is that list.insert(i, x) and del list[i] are linear-time “in theory”, but in practice you can barely notice unless i is quite small compared to len(list). Else it all fits in L1 cache, and moving contiguous slices in cache is extremely efficient on most modern boxes.

Of course that can be pushed too far, and the implementation actually uses a list of lists under the covers, so that each individual sublist doesn’t get “too large”. These sublists are invisible to the user, and are merged and split as needed, by magic.

“Log time” isn’t technically accurate, but it “acts like that” at least for sorted lists up to about a billion elements. Those are the largest I’ve actually used. Grant has pushed testing beyond 10 billion elements, and reports that it still scales well at that size.

Note that, unlike C-coded tree structures packed with pointers, Python lists are memory-efficient and enjoy superb data locality. There’s nothing not to like about this package. Use it :smile:.

13 Likes

Here is a better benchmark:

Benchmarking code
import array
import bisect
import random
import sys
import time

sys.modules['_heapq'] = None # avoid the C version, which requires a true list
from heapq import _heapify_max, _siftup_max


class SortedArray(array.array):
    def __new__(cls, typ, data):
        return array.array.__new__(cls, typ, sorted(data))

    def add(self, element):
        i = bisect.bisect_left(self, element)
        self.insert(i, element)


def heapsort(arr):
    _heapify_max(arr)
    view = memoryview(arr)
    for size in reversed(range(1, len(arr))):
        arr[0], arr[size] = arr[size], arr[0]
        _siftup_max(view[:size], 0)


src = [random.randint(0, 2**32) for i in range(1_000_000)]

a = SortedArray('q', [])

start = time.process_time()

for i in src:
    a.add(i)

print("Adding 1 000 000 random elelents to SortedArray took:", time.process_time() - start)

b = array.array('q', src)

start = time.process_time()

c = sorted(b)

print("Sorting 1 000 000 random elelents with sorted took:", time.process_time() - start)

start = time.process_time()

heapsort(a)

print("Sorting 1 000 000 random elelents with heapsort took:", time.process_time() - start)
Adding 1 000 000 random elelents to SortedArray took: 75.878228764
Sorting 1 000 000 random elelents with sorted took: 0.4819394679999931
Sorting 1 000 000 random elelents with heapsort took: 6.518191674999997

This doesn’t change the conclusions

1 Like

I edited the comment to include the code.

1 Like

A variation, seems about equally fast:

def heapsort3(arr):
    _heapify_max(arr)
    view = memoryview(arr)
    while vie := view[:-1]:
        view[0], view[-1] = view[-1], view[0]
        view = vie
        _siftup_max(view, 0)

2 Likes

That heapsort(a) is sorting the already sorted array. And not an array but a subclass instance, which could be slower.

Min-heap version using a reverse view (and shorter names):

def heapsort4(arr):
    v = memoryview(arr)[::-1]
    heapify(v)
    while w := v[:-1]:
        v[0], v[-1] = v[-1], v[0]
        v = w
        _siftup(v, 0)

These aren’t fair comparisons as you’re not comparing the insertion at a sorted location with insert an item + sort afterward, you’ve handily removed the insertion from 2 measurements here.

This is exactly the kind of thing I was getting at with

Additionally, the premise for this class was in response to:

In which you mention space complexity, not time complexity. You’ve not measured for what it was presented to optimize for, and 2 of the options you measured do exactly what you said you wanted to avoid. (edit: to be clear if all 3 were appropriate for the premise, Then comparison other characteristics would be fine, but 2 of the benchmarked options shouldn’t even be candidates if the premise for this feature is valid)

That said, just use an established library here, especially if you aren’t going to benchmark using the real code you want to see an improvement from the proposal with.

1 Like

Keep in mind that calling sorted on an array will not merely copy it is data around duplicating the memory usage. It will unbox each of the itens, which is 4 or 8 bytes in size to a full Python int or float object, weighting tens of bytes (+ 8 bytes for a pointer to each one)

4 Likes

You are right. I meant to run:

heapsort(b)

sorry for the confusion. I should have named the variables better.

This partly applies even to the heapsort versions: for an array.array a, something as simple as

    a[i] = a[j]

ends up boxing a[j], then unboxing it again to store into a[i]. Lots of “hidden” cycles and dynamic memory churn then.

That’s part of why I suggested radix sorting instead: no boxing or unboxing needed then. numpy already does that for integer types.

Perhaps surprising: the smallest CPython int object consumes 32 bytes on a 64-bit box:

  • 8 bytes for a type pointer
  • 8 bytes for a refcount
  • 8 bytes for the sign bit and a count of (30-bit) “digits”
  • 4 bytes for a “digit”
  • 4 bytes to pad to 16-byte heap alignment

The OP should really benchmark using numpy instead, to get a feel for how much they’re missing by refusing to use it :wink:.

7 Likes

Yes, agreed. The space and performance overhead of unboxing alone may be worth writing a sort method for array.array. And although the robustness of numpy.array has rendered array.array redundant in many ways as Tim says, I guess there’s still value in array.array for those who need the most lightweight application possible.

Yes, but unboxing can be avoided if the sorting is implemented in C with direct memory access to the array. So depending on how often people actually need to sort an array.array (while avoiding the heavyweight numpy package) the proposal may be worth implementing.

1 Like

Very clean looking. Slicing a memoryview with [:-1] to achieve a popping effect is very clever indeed. Thanks for sharing!

1 Like

I have an array.array sorter written in pure Python now that’s faster thrn list.sort() on larger arrays. It’s a novel mix of an MSB radix sort and list.sort(), relying heavily on memoryviews.

  • All numeric array codes are supported. “u” and “w” (Unicode gimmicks) are not, because memoryview doesn’t support them.

  • The code may not work at all on a big-endian machine - haven’t tried. The code does intend to support it.

  • It doesn’t work under PyPy, for reasons still unknown. I suspect a bug in PyPy’s support for fancy chaining of views with non-unit strides.

  • Extra temp memory is required, basically bounded by the amout of RAM required to build and list.sort() a list with no more 200_000 elements (the value of the optional maxsort argument). Later: a faster version also requires temp memory equal to the amount taken by the original aeray.aeray.

  • Speedups depend on the typecode of the array. It’s generally most effective on 4-byte types. It’s significantly faster on all types at about 5 million elements, and its advantage increases the larger the array (because the longest list CPython’s sort is used for is bounded, the radix sort’s linear-time O() behavior isn’t affected).

  • It’s least effective for “d” (double floats). Boxing/unboxing thoss is relatively cheap (C and CPython use the raw bits unchanged - no conversion costs), and list.sort() specializes to a cheap native machine comparison function.

  • A pure radix sort coded in Python is much slower. list.sort() is doing the bulk of the work, but on lists of bounded size, mitigating its O(n Log n) behavior. In effect, the radix layer acts like a 256-way merge on independently sorted sublists. and where those pieces are pasted together - end to end - with no need for comparisons.

Here’s timing output for arrays of around 168 million elements [1], so large that speedups are dramatic:

N=167_772_160
    b 1 1.32 35.46 13.99 0.39 0
    B 1 1.31 22.32 10.59 0.47 0
    h 2 2.64 98.21 51.87 0.53 0
    H 2 2.60 96.63 51.06 0.53 0
    i 4 4.85 207.39 73.28 0.35 0
    I 4 4.96 196.43 60.39 0.31 0
    l 4 4.89 206.50 77.17 0.37 0
    L 4 5.55 197.43 59.97 0.30 0
    q 8 10.30 226.71 103.02 0.45 0
    Q 8 10.26 220.39 83.98 0.38 0
    d 8 25.35 124.65 90.45 0.73 0
    f 4 23.00 125.03 79.63 0.64 0

Let’s a pick a line:

    Q 8 10.26 220.39 83.98 0.38 0
  • It’s a “Q” array (64- bit unsigned int on this box).
  • Each item takes 8 bytes on this box.
  • It took 18.26 10.26 seconds (wall clock) to fill the buffer with random bits.
  • It took 228.39 220.39 seconds to sort the array with the best method now, where xs is an array.array('Q')
L = xs.tolist()
del xs[:]
L.sort()
xs.fromlist(L)
del L
  • It took 83.98 seconds for the new code to sort xs direcly.
  • The new time divided the old time is 0.38, Nearly 3(!) times faster. The speedup is (1-038)*100 = 62%.
  • The trailing 0 is a detail I won’t even try to explain.

What next? Probably nothing :wink:. It’s an interesting puzzle, but - as things often do - is turning into a time sink, raising ever more questions in the details. So this all is just FYI, to give a sense for what’s possible even sticking to Python.


  1. why that number? accident - just started with arrays of length 5, and kept doubling it across tries ↩︎

6 Likes

Are 18.26 and 228.39 both typos, or why are they 8 more than the values shown before?

1 Like

Yup! Sorry about that.

The values in the full quoted lines are correct (zero unit digit, not 8). I’m afraid “0” and “8” are visually confusable to me in Discourse’s fixed-width font under dark mode :frowning_face:.