Array should have a sort method

Hello,

Arrays are a space efficient way of storing numeric values in Python. However, there doesn’t seem to be any space efficient way to order them. It’s possible to use the built-in sorted function, but it will return a list and that will cancel all space efficiency gains of arrays. I think that arrays should have a sort() method just like lists do.

3 Likes

I assume numpy isn’t an option?

>>> import array
>>> import numpy as np
>>> a = array.array('i', [0, 3, 1, 2])
>>> b = np.frombuffer(a, dtype='i4')
>>> b.sort()
>>> a
array('i', [0, 1, 2, 3])
1 Like

That’s a solution, but it requires adding a 16MB dependency just to be able to sort an array efficiently.

2 Likes

you probably want something like this if your use case frequently needs sorted arrays, rather than waiting on it being added to the stdlib. Sorting in place would be useful for some applications though, and seems like a good addition for the stdlib to me.

import array
import bisect

class SortedArray(array.array):
    def __new__(cls, typ, data)
        return array.array.__new__(cls, typ, sorted(data))

    def add(self, element)
        i = bisect_left(self, element)
        self.insert(i, element)

The stdlib array module is more of a dead battery than some of the recently removed ones…

2 Likes

Won’t adding elements to your SortedArray have a complexity of n^2?

Take my word for it: sorting is an involved topic :wink:.

It it were easy it would arleady have been done. If it were needed, then it’s surpising that, e.g., there’s no PyPI package I can find to sort array.array objects.

If someone wants to pursue it, some hints:

  • Stability doesn’t matter for pure numeric types.

  • Comparisons are cheap for simple numeric types.

  • list.sort() is far too involved to even think about adapting for this purpose. It goes to enormous lengths to minimize comparisons and ensure stability, neither of which matter here.

  • Heapsort is the quickest way to get there from here. Simple code, no extra memory needed, worst case O(n log n). But also no “really good” cases, and is generally the slowest of the major O(n log n) methods.

  • LSB radix sort is suitable, relatively easy to code, and worst-case linear-time. Requires a scratch array as large as the original, and a table of 256 counters. Different code may be needed depending on the endianness of the HW.

  • Whar to do with floats is always a problem, thnnks to NaNs and signed zeros. Try to be compatible with list.sort()? With numpy’s sort? Impose a total order?

Probably enough work to merit a PEP.

18 Likes

It is provided by numpy but using numpy’s own ndarray type. Maintaining a third party package to provide this functionality for array would be a hard sell compared to just using numpy which has this and much more.

5 Likes

No, that’s not n^2 insertion complexity. that’s also not the only method you might want to implement, and doing this memory-efficiently with just the standard library for any of the multi-insertion methods involves using the buffer apis and memoryview objects.

I tend to agree with the assessment of others that this is uncommon to need anything more than what I wrote there [1] while also having a case where depending on numpy is inappropriate.


  1. I’m aware of two libraries that use a variation of that, without implementing more than ensuring sort on creation and ensuring single insertion remains sorted ↩︎

3 Likes

Just wanted to add a little context here for anyone like myself who only learned this recently —

This is the same Tim: Timsort: Wikipedia

He won’t brag about it, but those are some well-founded implementation tips…

5 Likes

Except appeal to authority is fallacious :wink:,

Still, I am worth listening to on the topic. I wrote all of CPython’s sorts since its start, and threw away at least 20 for each one I checked in. The most recent was novel in several ways, but is quite aged now, and widely adopted outside CPython too (in contexts where partially ordered inputs are common, and/or compares are very expensive compared to moving data, it shines).

I still pay some attention to new developments, but it’s faded to more of a background interest now. Suffice it to say that, despite real progress in many areas, I haven’t yet seen another approach more promising for CPython’s mix of needs.

Most research seems to be going into ever more esoteric ways to squeeze every last cycle out of sorting native scalar types (like machine ints and floats), which isn’t really CPython’s niche.

But that is array.array’s niche.

At a higher level. Orson Peters (no relation) recently introduced “glidesort”, which aims to combine the best characteristics of list.sort()'s merging with a “pdqsort” flavor of stable quicksort for faster handling of inputs with many duplicates, and for better speed when comparisons are cheaper.

Which they have become in CPython over the years, dramatically so after code was introduced to specialize the comparison functions used when the input happens to be a homogeneous list of a suitable type (like float or “small enough” int). Some of list.sort()'s initial decisions have become counterproductive when comparisons are cheap, although only very mildly so to date.

So while the “sorting story” will never end, it has been on a multi-year pause :wink:.

7 Likes

As @tim.one pointed out, heap sort would be your best option if you want the best space efficiency (at O(1)) while maintaining a worst-case O(n log n) time complexity.

In my answer to sorting - Python Sort in Constant Space O(1) - Stack Overflow I showed several possible ways of using the heapq module in the stdlib to implement an in-place heap sort. Here’s one of them:

import sys
sys.modules['_heapq'] = None # avoid the C version, which requires a true list
from heapq import _heapify_max, _siftup_max
from array import array

def heapsort(arr):
    _heapify_max(arr)
    view = memoryview(arr)
    for size in reversed(range(1, len(arr))):
        arr[0], arr[size] = arr[size], arr[0]
        _siftup_max(view[:size], 0)

a = array('i', [5, 2, 1, 4, 3])
heapsort(a)
print(a) # array('i', [1, 2, 3, 4, 5])
2 Likes

No, that’s not n^2 insertion complexity.

I believe that insert() needs to move all the elements in the array after x. Adding all elements to SortedArray would have w complexity of n^2. If that’s not the case, what’s the complexity of adding all elements to SortedArray then?

1 Like

I’s worth pointing out that NumPy’s ndarray can wrap an array (or any other chunk of memory):

>>> import numpy
>>> import array
>>> a = array.array('i', [5, 1, 9])
>>> numpy.array(a, copy=False).sort()
>>> a
array('i', [1, 5, 9])

But of course, if you already depend on the NumPy power plant, there’s not much reason for using the stdlib’s coin-cell battery.

4 Likes
  • Are arrays sorted much now?
  • Could data be sorted before creating the array?

Essentially, would it be used much?

1 Like

finding the index to insert at: O(log n)
Moving some amount of items: O(n)

it also removes any later calls to some sorting method, so you can choose to view this as amortizing sort performance if sorting would always be needed later on in the application.

These aren’t multiplicative terms, so the complexity is usually written as O(n)[1]

There are reasons to and not to use a subclass like this. This is great if you would always want the array in its sorted form, change it multiple times after construction, and it is long-lived.

You should also want to profile real use before and after adding something like this, not just a microbenchmark, but the actual time in the application you’re wanting this for. you might find it surprising, but lists in python are often a better choice than arrays when you don’t check all of the boxes for array use.[2]


  1. Big-O notation can be misleading with regard to performance, on small arrays, the bisect might take more time than the actual insertion because of the size of the constant terms hidden there. ↩︎

  2. There are fewer boxes for array use over lists when using an array library such as numpy, because the array gains other capabilities, compare your options based on what they actually provide that is relevant to your use and the performance within your real application, not just a theoretical one. constant terms can dwarf algorithmic complexity in small cases, usage patterns can change the best algorithms and how certain underlying details like memmove calls perform. ↩︎

2 Likes

Thanks for sharing that! Very clever, and solves the problem in a uniform and general way. Bravo!

A curiosity I’ll leave for others to dig into, if so inclined: for an array of million random.random() floats, it’s substantially slower under PyPy than under CPython. No idea why. heapq.py is pretty much the same under both, although PyPy uses a faster algorithm to implement heapq.merge() (one that doesn’t use heaps at all).

1 Like

Thanks for this very elegant solution. I noticed that importing a module containing heapsort will affect the heapq imports in the entire program, but that can be mitigated by adding the following lines:

sys.modules.pop('_heapq')
sys.modules.pop('heapq')

I still believe that sorting is something that should natively be supported by array.array though.

1 Like

It also seems that heapsort is much slower than using the builtin sorted. There is a very simple benchmark:

import time

from array import array
from heapsort import heapsort


a = array('q', range(1_000_000, 0, -1))

start = time.process_time()

b = sorted(a)

print("Sorting took:", time.process_time() - start)

start = time.process_time()

heapsort(a)

print("Sorting took:", time.process_time() - start)
Sorting took: 0.03236380999999999
Sorting took: 6.134912384

Adding the same number of elements to SortedArray was even slower:

import array
import bisect
import time


class SortedArray(array.array):
    def __new__(cls, typ, data):
        return array.array.__new__(cls, typ, sorted(data))

    def add(self, element):
        i = bisect.bisect_left(self, element)
        self.insert(i, element)
        

a = SortedArray('q', [])

start = time.process_time()

for i in range(1_000_000, 0, -1):
    a.add(i)

print("Adding took:", time.process_time() - start)
Adding took: 159.908296744
1 Like

I sort of agree but that is because I would say that the stdlib should basically include numpy arrays. I think that originally part of the motivation to have numpy as a separate package from scipy was that it might be plausible for stdlib inclusion. That didn’t happen so instead numpy became the most widely required dependency in the 3rd party packaging ecosystem and now there is little incentive to change that because “why not just use numpy?”.

Anyone who wants this already uses numpy for it. You rejected installing numpy above even though it is the obvious solution because of the wheel size. It is rare that I would have a Python installation without numpy so from my perspective it is a de facto part of the stdlib that I just need to install into each venv.

Your objection was to the size of numpy but you could say that about any feature from any package: I only want <10% of that package so why do I have to install the whole thing? The answer is because that is the way it all got divided up into manageable maintainable packages. The stdlib has a lot of stuff that I never use and takes up more space than numpy but I have to have that in every Python installation. It is a much bigger “package” that I only want <10% of and you propose to make it bigger by adding more functionality to array that I would never use because in total the array module would still be dwarfed by the features of numpy.

The answer here is simple: just install numpy. It consume some disk space just like the stdlib and some of that disk space is wasted just like for all the stdlib modules that you don’t use.

7 Likes