A "sorted" that returns a generator (mine implemented with quicksort)

WOW! Thanks so much, @Stefan2!! I had no idea this had been sped.

I used to capture bound methods in locals “for speed” a lot, but got out of the habit some years ago when I switched to PyPy for running most “speed actually matters” programs (PyPy has long optimized this).

A similar fate befell my old habit of capturing globals and builtins in function locals via abusing default arguments def f(...,, len=len, abs=abs, ...): But using your code with

setup = '''\
x = -3

def f():
    return abs(x)

def f1(abs=abs, x=x):
    return abs(x)
'''        
funcs = ['f()', 'f1()']

suggests CPython has made great progress there too:

 42.2 ± 0.0 ns  f()
 49.0 ± 0.1 ns  f1()
Python: 3.13.0 (tags/v3.13.0:60403a5, Oct  7 2024, 09:38:07) [MSC v.1941 64 bit (AMD64)]

It remains something of a puzzle to me how it manages to make these faster than retrieving a local. That’s quite a trick!

Which is coming back to me now. The OP then didn’t care about “latency” (time to first result) at all. They were solely worried about RAM. They had a large list in memory, but not enough RAM for list.sort() to allocate the rather modest temp memory it needed (on a 64-bit box, 4 * len(array) bytes, half the RAM needed to copy it).

I’m not sure why, but it didn’t occur to me at the time to give them code for an in-place sort coded _in _ Python.

Instead I gave them a God-awful mess using heapq.nsmallest to brute-force a form of batched selection sort as an iterator.

So now it’s becoming clearer too why I suppressed these memories :wink:.

BTW, heapq.nsmallest is a delight if you know in advance how long a prefix of a sorted sequence you need. Call that k. Provided k is small compared to the length of the sequence, it’s fast, essentially O(n * log(k)) time, and memory use proportional to k no matter how long the sequence. And it’s stable too, despite that it’s implemented using a heap.

For example, on a million-element randomized array of ints just now, nsmallest() found the 10 smallest elements in about twice the time it took for min() to find just the smallest (and well under a tenth of a second).

2 Likes

I don’t really know what you mean by “a line of partitioning” or “merge a row”, so couldn’t follow the argument. But I can give you some hard data on the quoted part: Python’s sort has no tricks at all up its sleeve to help sorting randomized data, and Python code in that case is more competitive than you might think. On inputs with various forms of pre-existing order, though, Python’s sort can be thousands of times faster. Exploiting order is what it excels at, and no flavor of quicksort or heapsort can compete with that, regardless of implementation language.

Anyway, if I fiddle my quicksort sorting generator to never use sorted() (always recourses until len(xs) is 1, or it’s yielding an eq partition), it’s not even 3x slower than sorted() on my “usual” million-element randomized list of ints. I’m certain it’s not calling sorted() now because I added sorted = None at the top of the function :laughing:.

sort 1.439 1.457 1.460
qs   3.228 3.573 3.608

sort shows the times for sorted(), and qs for my generator to run to completion without ever using Python’s sort.

Those are seconds for 3 runs, sorted fastest to slowest. gc is disabled for the duration of each run, so the variance is just due to “machine noise”. Which is high! Even on a relatively quiet box. There are nevertheless about 250 processes running on the box, almost all owned by the OS, Chrome, or a handful of apps I started (like two editors just sitting there). They’re mostly idle, but there’s no guessing when they’ll briefly wake up to interfere with the caches and the cores.

Anyway, there’s some combination here of Python being faster than you might think, and/or its C sort being slower, on randomly ordered data.

1 Like

Costly parameter handling? This way is still fast:

def f2(abs=abs, x=x):
    def f2():
        return abs(x)
    return f2
f2 = f2()
 55.7 ± 0.5 ns  f()
 55.8 ± 0.5 ns  f2()
 68.3 ± 0.7 ns  f1()
Python: 3.13.0 (main, Nov  9 2024, 10:04:25) [GCC 14.2.1 20240910]

Although if I declare global x before x = -3 (otherwise it’s local in a function that timeit builds):

 49.9 ± 0.3 ns  f()
 54.7 ± 0.1 ns  f2()
 67.3 ± 0.4 ns  f1()
Python: 3.13.0 (main, Nov  9 2024, 10:04:25) [GCC 14.2.1 20240910]

Now I’m puzzled. Attempt This Online!

1 Like

No kidding :wink: This is getting too off-topic here, so I’ll just state that, on my box, one of these 4 is clearly fastest, one clearly slowest, and the other two in the middle pretty close to each other. Makes no “sense” at all to me :smile:

setup = '''\
x = -3

def f():
    return abs(x)

def f1(abs=abs, x=x):
    return abs(x)

def f2(x=x):
    return abs(x)

def f3(abs=abs):
    return abs(x)
'''
funcs = ['f()', 'f1()', 'f2()', 'f3()']

And pretty much O(n) for random data, as most values are larger than the largest of the k in the heap and thus get dismissed after a single comparison.

2 Likes

Ah - that’s a good clue! That changes the results of what I posted a lot. Although it makes f() even faster than before. Guess we should be using globals a lot more often :wink:.

1 Like

Maybe. Though Raymond’s var_access_benchmark.py still shows me they’re slower:

Variable and attribute read access:
   4.1 ns	read_local
   4.1 ns	read_nonlocal
   6.0 ns	read_global
   9.8 ns	read_builtin
1 Like

As usual, “benchmarks” seem to say less about Python than about which platform you’re running them on. On my box just now:

Variable and attribute read access:
   3.7 ns       read_local
   4.8 ns       read_nonlocal
   4.8 ns       read_global
   6.6 ns       read_builtin

But ideally the different benchmarks run on the same box shouldn’t tell us opposite things. Using dis I see an extra COPY_FREE_VARS and PUSH_NULL in my f2 compared to f. I guess that explains why f is faster despite using LOAD_GLOBAL instead of LOAD_FAST, and explains the opposite results in the benchmarks: We used the variables only once per function call, whereas Raymond used them many times, making those two extra instructions insignificant.

1 Like

Could you elaborate on this a bit?

1 Like

If all the user wanted was to minimise memory, an in-place sort is ideal (no extra memory apart from a few working variables). It can be in Python, because speed in Python is perfectly acceptable (again, because the requirement is memory usage, not maximal speed).

Or am I missing something?

2 Likes

Was it maybe because they wanted a stable sort? (I’m not aware of a simple good one.)

2 Likes

Ok, so I have combined all that has been shared here.

And added my own unifying twist on it, which is initial data analysis and partitioning algorithm, which runs in O(n).

This is the balanced Python version.
There are couple of really bad ones, but see sorted column, which shows that they are the ones where sorted is shamelessly fast.

It is easily calibratable.
So these are results with skew towards laziness.



Now the better part. I have factored out few key bits and optimized by compiling in cython “Pure Python” mode. Here are 2 plots from above with cython optimizations:

So the above could be achieved with few utility functions in C:

  1. It would be even faster.
  2. Large differences such as “inc_stairs” and “large_rep” should disappear.
  3. More aggressive “laziness” could be set, given some parts are not as expensive.


Finally, there is an option of looking into possibility to incorporate this into main sort.

I believe there are overlaps in my initial data analysis step and what sorted does.

Also, there could be opportunities to share information between this and sorted.

Full implementation in C would provide maximum performance.

It is quite a long shot, but I believe it might be possible to achieve both at the same time:

  1. Make sorted lazy
  2. Improve total performance

Then given isorted function existed, sorted would just be: list(isorted(. But as I said, this is fairly long shot and would be a bit of work.

3 Likes

in a single-pivot quick sort to lg n levels of partitioning, then sorted as the catch-all:

  • at each partitioning branch of quick sort, the partition will be divided in two. at least one of the two resulting sub-partitions is less than or equal to 1/2 of the items partitioned. if we follow these branches at each step, eventually it will result in producing values at most lg n depth in the recursion tree.
  • in a perfectly split quick sort tree, every branch at depth lg n contains one value. for every failed partition (eg. a single item placed) leading up to a quick sort partitioning node at depth lg n, the minimum size at that node scales at O(2**num_failed_partitions).

sorting a small number of items can be considered O(1). it might even fit into cache.

from a hypothetical standpoint:

  • the first case in the list above can be handled by sorted when the size of the partition to be sorted reaches the threshold where sorting is considered O(1) time.
  • in the second case, if the number of failed partitions reaches a certain amount, the best case size at quick sort terminal depth lg n along that branch (assuming equal splits for “best case”) crosses the threshold for considering sort an O(1) operation a depth lg n. this is known at the time a certain number of failed partitions have happened relative to lg n. i believe this is where partitioning should be abandoned at that node in favor of sorted.

example:

  • 1024 items to be sorted
  • lg (1024) = 10 levels max budget before sorted is called (to have 2 * n lg n) worst case)
  • 5 worst-case partitions occur (1 item placed each)
  • 1019 items are remaining in the worst partition to be sorted
  • lg (1019) = 10 levels remaining best-case for quick sort
  • 5 levels have already been spent out of the initial budget of lg (1024) = 10, leaving 5.
  • best case size (assuming equal splits) from 1019 and using the remaining 5 splits is 32 at depth lg n. meaning: if quick sort perfectly partitions this node hereafter, at level lg n there will be nodes of size 32.
  • if 32 is greater than the threshold for considering sorted an O(1) operation, stop quick sort partitioning and perform sorted on this partition of size 1019.

note: 16 lg 16 = 64 which is O(1).

It would give us more insight in the benchmarks. This is just comparing apples to oranges.

Obviously all the algorithms would be written in C for maximum performance, but that’s just too much work during the algorithm design phase.

Could you elaborate please?

You repeatedly mention that sorted() has an advantage (because it’s written in C).
Hence why you’re using values wrapped in lists. All this is demonstrating is the fact that Python is slow.

It would be better to compare the algorithms to timsort written in Python, although you could still compare it to sorted() for completion sake.

Not sure how easy it would be to implement Timsort with all the quirks and bits that sorted has. Maybe someone has one? Well in either case, this would be inevitable step along the way in case incorporation possibility is considered. For now it is just a possibility, a speculation.

Not really. It was a long time ago, and memory has faded.

Perhaps that when you ask “an expert” for advice, they may be so obsessed with novel approaches that they miss obvious things :wink:.

I don’t believe they did. Just brain freeze on my part.

Merge sort can be made in-place simply enough but at the cost of another lg n factor. Given adjacent runs A and B, search for the largest non-negative int K such that the first K elements of B belong before the last K elements of A. Binary search can be used for this, although getting it exactly right is delicate, and exponential search would probably be better for general use (K is usually small for runs built from randomly ordered data).

Then you have two adjacent, equal-sized, blocks to swap. How to do that in place should be obvious.

Now you’re left with two merges to do, each of which follows the same process.

Easiest to follow with an example:

A: 2 4 6  B: 1 3 5

1 < 6, so K >= 1. Then 3 < 4, so K >= 2. But it’s not the case that 5 < 2, so K=2 is the result. Swap the last 2 elements of A with first 2 of B, leaving two smaller pairs of runs to merge:

A:2  B:1 3
A:4 6  B:5

K happens to be 1 for both pairs. For the first, 1 is swapped with 2, leaving the 2 new pairs to do:

A:empty  B:1
A:2  B:3

and K=0 for both, so both finish at once. Similarly for the other original pair.

OK - the coding isn’t really “simple” after all :wink:. But it’s the simplest in-place sub-quadratic-worst-case stable comparison-based sort I know of.

In a merge sort that already works, it just requires replacing the “merge two adjacent runs” part. Easiest to write it recursively, but with the usual quicksort trick to limit recursion depth:

# merge adjacent pairs A and B
while A and B: # if either is empty, there's nothing to do:
    # as sketched, create new A1, B1 and A2, B2 pairs in place

    # recurse on the pair with the smallest total length

    # set A and B to the larger pair and loop

Of course if one of the pairs is a singleton, it would pay to search for its final position direclty instead. As is, e.g.,

A:2 3 4 5 6 7 8 9  B:1

acts like a slow insertion sort, moving 1 “one to the left” each time. Still linear-time, but slothful. Etc.

4 Likes