Arrays are a space efficient way of storing numeric values in Python. However, there doesn’t seem to be any space efficient way to order them. It’s possible to use the built-in sorted function, but it will return a list and that will cancel all space efficiency gains of arrays. I think that arrays should have a sort() method just like lists do.
you probably want something like this if your use case frequently needs sorted arrays, rather than waiting on it being added to the stdlib. Sorting in place would be useful for some applications though, and seems like a good addition for the stdlib to me.
Take my word for it: sorting is an involved topic .
It it were easy it would arleady have been done. If it were needed, then it’s surpising that, e.g., there’s no PyPI package I can find to sort array.array objects.
If someone wants to pursue it, some hints:
Stability doesn’t matter for pure numeric types.
Comparisons are cheap for simple numeric types.
list.sort() is far too involved to even think about adapting for this purpose. It goes to enormous lengths to minimize comparisons and ensure stability, neither of which matter here.
Heapsort is the quickest way to get there from here. Simple code, no extra memory needed, worst case O(n log n). But also no “really good” cases, and is generally the slowest of the major O(n log n) methods.
LSB radix sort is suitable, relatively easy to code, and worst-case linear-time. Requires a scratch array as large as the original, and a table of 256 counters. Different code may be needed depending on the endianness of the HW.
Whar to do with floats is always a problem, thnnks to NaNs and signed zeros. Try to be compatible with list.sort()? With numpy’s sort? Impose a total order?
It is provided by numpy but using numpy’s own ndarray type. Maintaining a third party package to provide this functionality for array would be a hard sell compared to just using numpy which has this and much more.
No, that’s not n^2 insertion complexity. that’s also not the only method you might want to implement, and doing this memory-efficiently with just the standard library for any of the multi-insertion methods involves using the buffer apis and memoryview objects.
I tend to agree with the assessment of others that this is uncommon to need anything more than what I wrote there [1] while also having a case where depending on numpy is inappropriate.
I’m aware of two libraries that use a variation of that, without implementing more than ensuring sort on creation and ensuring single insertion remains sorted ↩︎
Still, I am worth listening to on the topic. I wrote all of CPython’s sorts since its start, and threw away at least 20 for each one I checked in. The most recent was novel in several ways, but is quite aged now, and widely adopted outside CPython too (in contexts where partially ordered inputs are common, and/or compares are very expensive compared to moving data, it shines).
I still pay some attention to new developments, but it’s faded to more of a background interest now. Suffice it to say that, despite real progress in many areas, I haven’t yet seen another approach more promising for CPython’s mix of needs.
Most research seems to be going into ever more esoteric ways to squeeze every last cycle out of sorting native scalar types (like machine ints and floats), which isn’t really CPython’s niche.
But that is array.array’s niche.
At a higher level. Orson Peters (no relation) recently introduced “glidesort”, which aims to combine the best characteristics of list.sort()'s merging with a “pdqsort” flavor of stable quicksort for faster handling of inputs with many duplicates, and for better speed when comparisons are cheaper.
Which they have become in CPython over the years, dramatically so after code was introduced to specialize the comparison functions used when the input happens to be a homogeneous list of a suitable type (like float or “small enough” int). Some of list.sort()'s initial decisions have become counterproductive when comparisons are cheap, although only very mildly so to date.
So while the “sorting story” will never end, it has been on a multi-year pause .
As @tim.one pointed out, heap sort would be your best option if you want the best space efficiency (at O(1)) while maintaining a worst-case O(n log n) time complexity.
I believe that insert() needs to move all the elements in the array after x. Adding all elements to SortedArray would have w complexity of n^2. If that’s not the case, what’s the complexity of adding all elements to SortedArray then?
finding the index to insert at: O(log n)
Moving some amount of items: O(n)
it also removes any later calls to some sorting method, so you can choose to view this as amortizing sort performance if sorting would always be needed later on in the application.
These aren’t multiplicative terms, so the complexity is usually written as O(n)[1]
There are reasons to and not to use a subclass like this. This is great if you would always want the array in its sorted form, change it multiple times after construction, and it is long-lived.
You should also want to profile real use before and after adding something like this, not just a microbenchmark, but the actual time in the application you’re wanting this for. you might find it surprising, but lists in python are often a better choice than arrays when you don’t check all of the boxes for array use.[2]
Big-O notation can be misleading with regard to performance, on small arrays, the bisect might take more time than the actual insertion because of the size of the constant terms hidden there. ↩︎
There are fewer boxes for array use over lists when using an array library such as numpy, because the array gains other capabilities, compare your options based on what they actually provide that is relevant to your use and the performance within your real application, not just a theoretical one. constant terms can dwarf algorithmic complexity in small cases, usage patterns can change the best algorithms and how certain underlying details like memmove calls perform. ↩︎
Thanks for sharing that! Very clever, and solves the problem in a uniform and general way. Bravo!
A curiosity I’ll leave for others to dig into, if so inclined: for an array of million random.random() floats, it’s substantially slower under PyPy than under CPython. No idea why. heapq.py is pretty much the same under both, although PyPy uses a faster algorithm to implement heapq.merge() (one that doesn’t use heaps at all).
Thanks for this very elegant solution. I noticed that importing a module containing heapsort will affect the heapq imports in the entire program, but that can be mitigated by adding the following lines:
I sort of agree but that is because I would say that the stdlib should basically include numpy arrays. I think that originally part of the motivation to have numpy as a separate package from scipy was that it might be plausible for stdlib inclusion. That didn’t happen so instead numpy became the most widely required dependency in the 3rd party packaging ecosystem and now there is little incentive to change that because “why not just use numpy?”.
Anyone who wants this already uses numpy for it. You rejected installing numpy above even though it is the obvious solution because of the wheel size. It is rare that I would have a Python installation without numpy so from my perspective it is a de facto part of the stdlib that I just need to install into each venv.
Your objection was to the size of numpy but you could say that about any feature from any package: I only want <10% of that package so why do I have to install the whole thing? The answer is because that is the way it all got divided up into manageable maintainable packages. The stdlib has a lot of stuff that I never use and takes up more space than numpy but I have to have that in every Python installation. It is a much bigger “package” that I only want <10% of and you propose to make it bigger by adding more functionality to array that I would never use because in total the array module would still be dwarfed by the features of numpy.
The answer here is simple: just install numpy. It consume some disk space just like the stdlib and some of that disk space is wasted just like for all the stdlib modules that you don’t use.