Soliciting sorters

tim.one · September 17, 2025, 6:33pm

I want to publicize a new PR and solicit help:

Issue: `list.sort` enhancement proposal: Adaptivity for `binarysort` · Issue #138946 · python/cpython · GitHub

PR: gh-138946: `list.sort` enhancement proposal: Adaptivity for `binarysort` by dg-pb · Pull Request #138947 · python/cpython · GitHub

It’s about sorting in Python. It would be extremely helpful if people who care about sort speed tried the code and reported timing results on their workloads. It’s always hard to get real-world data, and for sorting those who care often have very large and/or proprietary data they can’t share. Let’s try to change that! You don’t have to share your company’s data. Just your experience.

Not kidding about this: if nobody chips in, I’ll take it as evidence that nobody cares enough to justify more time pursuing this.

Mile-high overview: Python’s sort was a pioneer in “adapting” to significant input order, of many kinds. But “in the small” it does the opposite: it relies on binary insertion sort, which effectively does the same number of compares to sort N elements (for “small” N) regardless of how close or far they already are to being sorted. It aims to minimize the worst-case cost, an engineering tradeoff in tension with the more conditionally optimistic “in the large” strategies & tactics.

@dg-pb aims to make that part adaptive too. To my eyes, it has real potential. And not just for small lists. Even on favorable huge lists where the current sort is “astonishingly fast”, a very significant part of the total time is spent in binary insertion sort to build up shorter runs to start with. Adapting to favorable “in the small” order speeds those significantly too.

So all real-life data would be valuable.

Of course sorting has changed to make type-homogenous lists go much faster (especially lists of floats and small integers), so “does no harm?” is also of interest. Python-defined __lt__ comparisons remain very expensive.

I have no opinion on whether feedback “should be” shared on the issue or the PR. I don’t even find that distinction helpful for cases like this (quite the contrary). Suit yourself, although I’ll stick to just using the PR. But it absolutely does not belong in replies to a Discourse post .

hauntsaninja · September 23, 2025, 10:20pm

Sebastian Wild had solicited real-life sorting data when working on Powersort (see How to contribute your inputs to the Adaptive Sorting Benchmark ). Maybe he could share any data he collected with you and @dg-pb?

sebawild · September 24, 2025, 12:52pm

(Let me also publicly reply here.)
In Track B of the Powersort Competition this was exactly my goal, but there weren’t any useful submissions. (Hint: still got prizes left, you can earn some vouchers by submitting )

Track A was more popular, but the data is engineered to show a large difference between old and new merge policies; not likely representative.

For both tracks, I only collect a “rank-reduced” version of the data, i.e., a list of integers with the same order as the original data, so something is lost there. On the other hand, the hope was that this conversion could lower concerns against contributing data as no actual data ever leaves contributors.

tim.one · September 24, 2025, 6:16pm

Hi, @sebawild! Been a long time. Thank you for chiming in .

The only thing we want to measure here is elapsed time on real workloads on users’ platforms, because that’s the only thing end users see. And that’s messy.

Sorting has evolved to specialize on the type of type-homogenous lists. Some kinds of comparisons are very cheap now (like in lists of floats), while others remain very expensive (like __lt__ methods coded in Python).
Some workloads never sort a list with 100s of millions of elements, but instead sort lists of at most a few hundred elements but millions of times. Merge costs can approach irrelevance then. They’re currently dominated by binary insertion sort building initial runs.
As you know too well , modern architectures are very hard to out-think. Even when “timsort” was first introduced, we had specific test data that ran significantly faster on some platforms (combination of HW and compiler), but significantly slower on others. No ]satisfying explanation was found (HW-specific “cache effects” was the best vague guess).

But that’s again what end users see, so we want to know about that too. The specific platforms most in use by those who care can matter too. We don’t have the people-power to use different code on different platforms, unless it can key off of portable C #defines.that capture relevant HW differences. Fat chance.

So it’s very messy. But also very simple: if lots of users contribute timing results they get, we can just follow the data.

But, ya, that’s unlikely to happen .

sebawild · September 24, 2025, 7:38pm

For what it’s worth, I’ve put the Track A submissions in a quick repo: GitHub - sebawild/powersort-benchmark: Benchmark of sorting instances from Track A of the Powersort Competition.
Use at own risk

sebawild · September 24, 2025, 7:49pm

@tim.one On a fundamental level, it’s annoying that our (fully man-made!) machines are too complicated to predict what happens, but there we are.
As you know, I’m always trying to isolate some part questions, where we can give more satisfying answers.

guido · September 24, 2025, 8:11pm

@diegor See above for something where ARM might do something useful for future generations. Jus ask Tim questions.

dg-pb · September 24, 2025, 8:23pm

It would be very helpful if this could get run on few more machines: abinary_minitmr.py · GitHub

.../main/python.exe abinary_minitmr.py /output/dir base
.../new/python.exe abinary_minitmr.py /output/dir new

.../any/python.exe abinary_minitmr.py /output/dir compare

nas · September 24, 2025, 8:34pm

Here is some real world data, in the hope it might be useful. This is from an asset tracking system. The database contains the items in creation order but typically displays a table of assets in sorted order. The sort keys are customizable but the default is to sort on the equipment IDs (usually serial numbers but can be other schemes). In order to censor the data, I sorted the list of strings and then replaced the string values with their original position in the underlying data table.

asset_id_seq.py · GitHub
asset_id_seq2.py · GitHub

If there is a better way to share this data, let me know.

tim.one · September 24, 2025, 8:50pm

@sebawild may find that useful, but not really the PR this topic is about,

gh-138946: list.sort enhancement proposal: Adaptivity for binarysort by dg-pb · Pull Request #138947 · python/cpython · GitHub

We’re looking for actual before-and-after timings, on your actual apps, running on your platform with your real data. We don’t need to see the data, just quantification of relative before-and-after speed. A brief description would be helpful, like:

Pkatlorm (HW, compiler, compiler options used),
Nature of the sort keys (small ints, bigints, strings, mixed, custom Python objects, …).
How many elements were being sorted?

:“Theoretical questions” have already been analyzed as best we can, but users don’t run theory . We want to know the actual results they’ll actually see if the PR is committed. Any such results should be added to the PR (or the related issue). They’ll just get lost on Discourse.

diegor · September 26, 2025, 2:07pm

Guido, thanks for the ping.I will have a look at the problem and if I question I will bother @tim.one

dg-pb · September 27, 2025, 3:03am

Thank you for the data. I tested it and the nature of the data from POV of the algorithm is the same as (or at least very close to) purely random one.

The results are in line with the “WORST” dataset present in micro benchmarks in `list.sort` enhancement proposal: Adaptivity for `binarysort` · Issue #138946 · python/cpython · GitHub.

If you have concerns regarding the impact this change would have on your specific HW, please run Soliciting sorters - #9 by dg-pb and share results in the GH issue.

You can also add your own data to the list (I left a place for it - EXTRA = {'data_name': [data_list]}), just trim it to 10K for a reasonable runtime.