Adding unstable sort to cpython

alonme · December 30, 2022, 1:35pm

Pythons sorting functions use timsort which is a stable sort.
For some use cases the “stability” feature is unneeded, in which case an unstable sort can be used.

Unstable sorting algorithms can be faster - and hence are useful in some use cases ( see GitHub - orlp/pdqsort: Pattern-defeating quicksort. benchmarks for example)

For example, rust has an unstable sort - which was proposed in the following RFC which i believe explains the need and possible solutions in a great way.

https://rust-lang.github.io/rfcs/1884-unstable-sort.html

ericvsmith · December 30, 2022, 2:07pm

It seems to me that this is a case for a package on PyPI. If it becomes super popular we can consider it for the stdlib.

smontanaro · December 30, 2022, 2:08pm

I’ll leave detailed explanations to @tim.one, but I don’t think Python’s list sorting was guaranteed to be stable at the time. That suggests that had an unstable sort been faster, that’s what timsort would have become.

Rosuav · December 30, 2022, 2:39pm

Hack on CPython and replace the sorting algorithm with a specific one that you want to recommend. Benchmark it. Unless it’s faster, there’s no proposal here.

If it IS faster, then the next question is: how much faster is it? In order for an unstable sort to be valuable, it has to be worth the complexity of having two different ways to sort things - one that works all the time, and one that only works some of the time. That’s going to need a LOT of speed difference, so there’s going to need to be some very good-looking numbers along with the proposal.

alonme · December 30, 2022, 2:59pm

@Rosuav

Both work “all the time”, they just do different work.

I agree that its only worth the hassle if the speed difference is a major one,
However i have reasons to believe it will be like that,
rust’s “stable” sort is using timsort, and they wrote about pdqsort being 45% faster (See RFC)

Is there any benchmark for sorting that CPython uses?

Rosuav · December 30, 2022, 3:31pm

Well, no, because an unstable sort can’t be used when you need stability, but a stable sort can be used when stability isn’t necessary. So the sort method that lists already have can be used for ALL situations, but the one you’re proposing can only be used when stability doesn’t matter. So unless it’s sufficiently fast, there’s no point adding the complexity of a second sort method.

Check the pyperformance suite. I’m not sure if there’s anything specific to sorting, but if you build CPython twice, once with the existing algorithm and once with a different one, you should be able to get some decent comparative data.

Without any real hard data to work with, I would say that the simplicity of knowing that things “just work” is incredibly helpful. For example, knowing that dictionaries remember their insertion order (without having to explicitly select an order-remembering dictionary) is extremely handy, since it means there’s no agonizing decision about whether it’s worth paying a performance cost for convenience. Same goes for stable sorting. Working across languages and having to remember to warp an algorithm against one language’s lack of promise of something is a lot more hassle than “oh, this is 3% slower”, which doesn’t change the way you code at all.

So… how good can you make this?

pf_moore · December 30, 2022, 5:18pm

The arguments in the Rust RFC seem to be mostly about the fact that unstable sorts allow no-allocation strategies, and the fact that an unstable sort can be used with no-std. So pdqsort is more attractive in a systems programming context. That’s not to say that performance isn’t important, but so is stability (for some applications) and the trade-offs in Python are different.

Let’s wait to see some real-world performance data before deciding.

tjreedy · December 30, 2022, 7:08pm

A major goal of timsort was to exploit non-random structure in data to be sorted. For performance, it exploits the fact that ‘worse’ O(NN) algorithms can be real-time faster than academically ‘better’ O(NlogN) algorithms for small N. Notice that O(NlogN) algorithms run fast for large N by splitting the problem into multiple small N pieces that can eventually run faster with NN insert sort. It turned out that for mergesort stability was possible with little loss of performance. Tim Peters included in the repository the following, which explains the benchmarks used for comparisons with the previous sort and variations of what became timsort.

github.com

python/cpython/blob/main/Objects/listsort.txt

Intro
-----
This describes an adaptive, stable, natural mergesort, modestly called
timsort (hey, I earned it <wink>).  It has supernatural performance on many
kinds of partially ordered arrays (less than lg(N!) comparisons needed, and
as few as N-1), yet as fast as Python's previous highly tuned samplesort
hybrid on random arrays.

In a nutshell, the main routine marches over the array once, left to right,
alternately identifying the next run, then merging it into the previous
runs "intelligently".  Everything else is complication for speed, and some
hard-won measure of memory efficiency.


Comparison with Python's Samplesort Hybrid
------------------------------------------
+ timsort can require a temp array containing as many as N//2 pointers,
  which means as many as 2*N extra bytes on 32-bit boxes.  It can be
  expected to require a temp array this large when sorting random data; on
  data with significant structure, it may get away without using any extra

This file has been truncated. show original

The github.io blog post linked above says

Q: How much faster can unstable sort be?
A: Sorting 10M 64-bit integers using pdqsort (an unstable sort implementation) is 45% faster than using slice::sort. Detailed benchmarks are here.

The repository readme, also linked above, makes it clear that machine ints are a special case that can be sped up with special tricks that I understood to not be applicable to strings and structures (which would include Python’s boxed ints).

I don’t know to what extent slice::sort is timsort, as ‘timsort’ is not mentioned among the sources of pdqsort. If it were timsort, then the answer here:

Q: Is slice::sort ever faster than pdqsort?
A: Yes, there are a few cases where it is faster. For example, if the slice consists of several pre-sorted sequences concatenated one after another, then slice::sort will most probably be faster. Another case is when using costly comparison functions, e.g. when sorting strings. slice::sort optimizes the number of comparisons very well, while pdqsort optimizes for fewer writes to memory at expense of slightly larger number of comparisons. But other than that, slice::sort should be generally slower than pdqsort.

should consider all the cases where timsort excels. Notice that pdqsort is slower with ‘costly comparison functions’ including strings. I presume this would even include python ints.

My guess is that pdqsort might be of more interest to the numpy people, for sorting arrays of machine numbers. (I don’t know what it does now.)

da-woods · December 30, 2022, 8:23pm

FWIW here’s a trivial Cython wrapping of the C++ standard library sort and stable_sort functions:

# distutils: language=c++

from libcpp.algorithm cimport sort, stable_sort as stable_sort_
from libcpp cimport bool as cppbool
from cpython.sequence cimport PySequence_Fast_ITEMS
from cpython cimport PyObject

cdef cppbool lt(lhs, rhs) noexcept:
    return lhs < rhs

def stable_sort(o):
    l = list(o)
    cdef PyObject** items = PySequence_Fast_ITEMS(l)
    cdef Py_ssize_t length = len(l)
    stable_sort_(items, items+length, lt)
    return l

def unstable_sort(o):
    l = list(o)
    cdef PyObject** items = PySequence_Fast_ITEMS(l)
    cdef Py_ssize_t length = len(l)
    sort(items, items+length, lt)
    return l

It doesn’t handle exceptions from the comparison function correctly (although that wouldn’t be too hard to do if you took the comparison function out of Cython), or do custom key functions/reverse like the real “sorted” function does.

On my PC

>>> x = [ random.random() for _ in range(10000) ]
>>> print(timeit.timeit("sort2.stable_sort(x)", globals=globals(), number=1000))
2.6741872249986045
>>> print(timeit.timeit("sort2.unstable_sort(x)", globals=globals(), number=1000))
3.218431173998397
>>> print(timeit.timeit("sorted(x)", globals=globals(), number=1000))
1.5250315239973133

So Python comfortably beats the C++ stable sort, which actually beats the C++ unstable sort(?).

So the conclusion is that Python is competitive and that unstable sort may not be faster.

Obviously more time and effort could be spent on this, but it isn’t an obvious win using what should be a decent external implementation.

alonme · December 30, 2022, 8:36pm

regarding the stable sort being timsort,
see here Vec in std::vec - Rust

da-woods · December 30, 2022, 8:40pm

With that said, I think my test-case (all the keys being floats) might have hit one of the optimized paths in the Python sort… But it’s easy enough to test further if other people want to

alonme · December 30, 2022, 9:07pm

Haven’t looked deeply into libcpp, but i am getting very different results in a small benchmark i wrote.

Here is the code for a pyo3 module that implements both a stable and unstable sorting functions.

use pyo3::prelude::*;

#[pyfunction]
fn sort_unstable_list_of_numbers(mut a: Vec<i32>) -> PyResult<Vec<i32>> {
    a.sort_unstable();
    Ok(a)
}
#[pyfunction]
fn sort_list_of_numbers(mut a: Vec<i32>) -> PyResult<Vec<i32>> {
    a.sort();

    Ok(a)
}

/// A Python module implemented in Rust.
#[pymodule]
fn rust_sort(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(sort_unstable_list_of_numbers, m)?)?;
    m.add_function(wrap_pyfunction!(sort_list_of_numbers, m)?)?;
    Ok(())
}

and a short python script to compare these to pythons and numpy’s sorting (which is unstable in this specific configuration)

from timeit import Timer
from rust_sort import sort_list_of_numbers, sort_unstable_list_of_numbers
import random

import numpy as np


def python_sort(l: list):
    return sorted(l)

def numpy_sort(l: list):
    return np.sort(l)

def check_timing(lname:str,l: list, timeit_number: int = 1):
    print("\n\n")
    print(lname)
    python_sort_result = Timer(lambda : python_sort(l)).timeit(timeit_number)
    print(f"python sort:        {python_sort_result}")
    numpy_sort_result = Timer(lambda : numpy_sort(l)).timeit(timeit_number)
    print(f"numpy sort:         {numpy_sort_result}")
    rust_sort_result = Timer(lambda : sort_list_of_numbers(l)).timeit(timeit_number)
    print(f"rust stable sort:   {rust_sort_result}")
    rust_sort_unstable_result = Timer(lambda : sort_unstable_list_of_numbers(l)).timeit(timeit_number)
    print(f"rust unstable sort: {rust_sort_unstable_result}")
    print(f"rust vs python:         \t\t{rust_sort_result / python_sort_result}")
    print(f"rust vs numpy:          \t\t{rust_sort_result / numpy_sort_result}")
    print(f"rust unstable vs python:\t\t{rust_sort_unstable_result / python_sort_result}")
    print(f"rust unstable vs numpy: \t\t{rust_sort_unstable_result / numpy_sort_result}")


if __name__ == "__main__":
    # Set seed to get reproducible results
    random.seed(0)
    randomlist1 = random.choices(range(1_000_000), k=500)
    random.seed(0)
    randomlist2 = random.choices(range(500), k=500)
    random.seed(0)
    randomlist3 = random.choices(range(1_000_000), k=50000)
    random.seed(0)
    randomlist4 = random.choices(range(500), k=50000)

    lists_to_sort = {
        "randomlist1": randomlist1,
        "randomlist2": randomlist2,
        "randomlist3": randomlist3,
        "randomlist4": randomlist4,
    }

    for lname, l in lists_to_sort.items():
        check_timing(lname, l, timeit_number=500)

which results in the following

randomlist1
python sort:        0.011974836932495236
numpy sort:         0.011842277133837342
rust stable sort:   0.015231002122163773
rust unstable sort: 0.01168325706385076
rust vs python:                         1.27191728856303
rust vs numpy:                          1.2861548458989962
rust unstable vs python:                0.9756506188528349
rust unstable vs numpy:                 0.9865718334244848



randomlist2
python sort:        0.012899479130282998
numpy sort:         0.012305153999477625
rust stable sort:   0.01314361416734755
rust unstable sort: 0.00944886589422822
rust vs python:                         1.0189259608546066
rust vs numpy:                          1.0681389414472602
rust unstable vs python:                0.7324998008676126
rust unstable vs numpy:                 0.7678787193260109



randomlist3
python sort:        4.0684269110206515
numpy sort:         2.2548892840277404
rust stable sort:   2.451281855814159
rust unstable sort: 1.7647692309692502
rust vs python:                         0.602513430725293
rust vs numpy:                          1.0870963258274113
rust unstable vs python:                0.43377189011035233
rust unstable vs numpy:                 0.782641189290223



randomlist4
python sort:        3.123894941061735
numpy sort:         1.9065817210357636
rust stable sort:   1.9543205290101469
rust unstable sort: 0.9952141779940575
rust vs python:                         0.6256037945840526
rust vs numpy:                          1.0250389518831895
rust unstable vs python:                0.31858119327656026
rust unstable vs numpy:                 0.5219887335610249

Edit:
by pre converting the list to an np.ndarray, the results of numpy are getting much better, however that creates an unfair comparison to the rust code (which does some conversions to pass the lists to rust)

da-woods · December 30, 2022, 9:39pm

The key difference I think is that you’re converting everything to fixed-width integers and sorting those. I’m using a comparison between Python objects. So my code would support a mixture of different Python types and yours wouldn’t.

alonme · December 30, 2022, 9:48pm

Right, i think that pyo3 doesn’t support “Generics” so that is what i could quickly do.

However - numpy is doing the same - and is still slower, but i think it also uses an unstable sort, so that is another story.

BTW - i now see that there are some specializations in the python implementation (see here https://github.com/python/cpython/pull/582)
and this also contains some benchmarks that were used in the past, which we can possibly compare to

ajoino · December 30, 2022, 10:14pm

I do not think that is an unfair comparison, since np.sort is made to sort numpy arrays. By not providing a numpy array as input I think you are testing an unrealistic scenario.

alonme · December 30, 2022, 10:25pm

I get what you are saying,
Just for completeness:
comparing the speed of the pre-converted ndarray to those of rust

this isn’t run in a loop,
the “rust side” prints are done in the rust module, measuring the time it took to run the sort itself, and nothing more

randomlist1
numpy ndarray sort: 0.00011563091538846493
stable sort (rust side): 1.7032e-5
rust stable sort:   7.940409705042839e-05
unstable sort (rust side): 1.6366e-5
rust unstable sort: 3.959890455007553e-05

randomlist2
numpy ndarray sort: 2.8833048418164253e-05
stable sort (rust side): 1.6622e-5
rust stable sort:   3.784080035984516e-05
unstable sort (rust side): 1.1679e-5
rust unstable sort: 3.1874049454927444e-05

randomlist3
numpy ndarray sort: 0.0027110970113426447
stable sort (rust side): 0.00246804
rust stable sort:   0.005548004060983658
unstable sort (rust side): 0.001281326
rust unstable sort: 0.00433477689512074

randomlist4
numpy ndarray sort: 0.001977666048333049
stable sort (rust side): 0.002575172
rust stable sort:   0.004511518171057105
unstable sort (rust side): 0.000523034
rust unstable sort: 0.0025977911427617073

Rosuav · December 30, 2022, 10:44pm

You’re timing some incredibly short operations here. Seeing results saying e-5 is a good indication that you aren’t actually getting any useful data at all. Here’s an example of literally doing nothing in several languages:

$ python3 -c 'import time; start = time.time(); print(time.time() - start)'
2.384185791015625e-07
$ python2 -c 'import time; start = time.time(); print(time.time() - start)'
9.53674316406e-07
$ pike -e 'int start = gethrtime(1); write("%d ns\n", gethrtime(1) - start);'
172 ns
$ ruby -e 't = Time.new; print(Time.new - t, "\n")'
8.82e-07
$ node -e 'let t = +new Date; console.log(+new Date - t)'
0

From which you can deduce some utterly meaningless things, like that Python 2 is four times the speed of Python 3. Things get even more interesting when you try to measure the time cost of discovering the time of day. For example:

$ python3 -m timeit -s 'import time' 'time.time()'
5000000 loops, best of 5: 58.1 nsec per loop

So attempting to calculate the cost of a single sort by “record time of day, do the sort, record time of day, subtract” runs the risk that the time cost of record time of day is at least as much as the time cost of do the sort. That’s not good for the measurement.

That’s why these kinds of things are ALWAYS done in a loop. For something like this, where it may well depend on the order of elements, I’d recommend doing two separate tests, something like this:

$ python3 -m timeit -s 'import random; items = [random.randrange(10000) for _ in range(10)]' 'tmp = items[:]'
5000000 loops, best of 5: 63.5 nsec per loop
$ python3 -m timeit -s 'import random; items = [random.randrange(10000) for _ in range(10)]' 'tmp = items[:]; tmp.sort()'
2000000 loops, best of 5: 158 nsec per loop

And then you can easily see the true cost of the sort - and also how roughly half of the cost is there even if you aren’t sorting.

Performance measurement is HARD, which is why we have tools that try to make it easier for us. Never trust single measurements that come in with crazily low values.

alonme · December 30, 2022, 11:00pm

I disagree that this example is too small (sorting 50K numbers in the last one), nor does it fall to some of the pitfalls in your examples.
But anyway - that is exactly why i pointed out this is a single run, and that this is just for completeness

Its really off topic by this point but here we go with running a bigger test in a loop

randomlist_million = random.choices(range(10_000_000), k=1_000_000)

ran 10 times in a loop results in following

np.sort pre-converted  - average: 0.069096 std: 0.001393
rust internal unstable - average: 0.029869 std: 0.000492

Lets drop this specific issue now, Comparing rust to numpy is not what this is about

Rosuav · December 30, 2022, 11:07pm

It’s not so much the number of objects being sorted, but the fact that you’re only doing it once. Try commenting out the actual sort step from the early examples (the ones that showed e-5 times) and see how much the times change by.

Sorting enormous lists is important, but so is sorting small lists. So properly benchmarking those is just as valuable - it just happens to be a bit harder to do

tim.one · December 31, 2022, 1:02am

Terry already linked to listsort.txt, which was intended to be my last word on the topic .

As it says, the disadvantage of merge sorts is that they do much more data movement than members of the quicksort family (of which pdqsort is one). When comparisons are cheap (as for native machine ints or floats on modern boxes), saving moving a memory word may be as valuable as saving a comparison. Quicksort variants shine then - quicksort is stellar at reducing memory movement.

But, in Python, and especially at the time the current sort was written, comparisons are enormously more expensive than moving an object pointer in memory. That’s where merge sorts shine: they get very much closer to doing a minimal number of compares than quicksorts get. See listsort.txt for brief quantification of that. But they do a lot more data movement.

There is no reason in general that stable vs non-stable is important to speed. That’s not really the issue. But it can be, in some cases. For example, consider this list:

xs = [2] + [1] * 1_000_000_000

That is, 2 followed by a billion 1s. A sufficiently clever non-stable sort could just swap the endpoints to get the list in sorted order.

But that violates stability (the 1 at the end moves to the start). A stable sort cannot do better than moving all billion plus one elements, rotating the entire list one place to the left.

I still pay attention to developments in sorting, and I’ve seen nothing yet that would plausibly be an improvement over Python’s general-purpose sort for CPython’s needs.

For specific data types, sure. There’s no end of ways to exploit regularities. But the only thing Python’s sort requires of objects is that they implement __lt__(). It makes no assumptions of any kind about how the objects are stored, are represented, or what they “mean”.

If you’re always sorting lists of floats, by all means, use numpy’s sort instead.

Or if you’re always sorting lists of strings, there are faster ways to sort those too (although, offhand, I don’t know of an extension module geared to that specific case).

The builtin sort works for any types that implement a sane __lt__(), and strives to invoke that method as seldom as possible for common, real-life cases.

I have no doubt that pdqsort can be faster for some specific data types, but also no doubt that it wouldn’t be a suitable replacement for CPython’s current sort (because it’s not stable, and because quicksort variants do more comparisons, which are still significantly slower than pointer copies).

Topic		Replies	Views
Add a #define UNSTABLE_ABI for tools and extensions like Cython and Numpy C API	14	571	February 8, 2023
PEP 689 -- Unstable C API tier PEPs	16	1559	February 15, 2023
Add binarytree to stdlib Ideas	2	1186	July 20, 2021
Adding sortedcontainers package into the standard library? Ideas	5	2688	January 10, 2021
Python vs C++ on sorting dict Python Help	7	1062	March 29, 2022

Adding unstable sort to cpython

Related Topics