Note on implementation: The test branch uses key: callable | list for
prototyping. The proposed API uses keylist= to avoid type ambiguity.
1. Background
So there already have been some discussions around this:
- I saw quite explicit request for this, but can not find it now. But here is stack: python - Sorting list according to corresponding values from a parallel list - Stack Overflow, which has 651 upvotes
- Add a dict.sort() method - #32 by Marco_Sulla
- Indexing, formatting, sorting (4) has hinted at this
So (1) is a generic problem, while (2) and (3) is a special case.
Motivation is 2 parts:
- Convenience
- Performance
2. Convenience
Often one has 2 lists:
- one holds values of final interest
- another holds keys according to which sorting is needed
Without this functionality, one needs to zip or/and key=:
pairs = zip(keys, values)
sorted_pairs = sorted(pairs, key=opr.itemgetter(0))
keys, values = zip(*sorted_pairs)
With this, things are much more straight forward:
values.sort(key=keys)
3. Performance
So sorting dict, unfortunately, is not going to benefit much in convenience.
But where it will benefit is performance.
For 2 cases below, performance benefit is 30-40%
Setup:
S="
import random
import operator as opr
N = 100
# N = 100_000
# keys = range(N, 0, -1) # DEC
keys = random.sample(range(N), N) # RND
dct = dict(zip(keys, range(N)))
"
A="
d = dct.copy()
items = sorted(d.items())
d.clear()
d.update(items)
"
B="
d = dct.copy()
items = sorted(d.items(), key=opr.itemgetter(0))
d.clear()
d.update(items)
"
C="
d = dct.copy()
keys = list(d.keys())
values = list(d.values())
values.sort(key=keys)
d.clear()
d.update(zip(keys, values))
"
Results:
# DEC # 100 | 100K
./python.exe -m timeit -s "$S" "$A" # 9.0 µs | 15.0 ms
./python.exe -m timeit -s "$S" "$B" # 9.0 µs | 15.0 ms
./python.exe -m timeit -s "$S" "$C" # 8.0 µs | 10.0 ms
# RND # 100 | 100K
./python.exe -m timeit -s "$S" "$A" # 18 µs | 80 ms
./python.exe -m timeit -s "$S" "$B" # 13 µs | 55 ms
./python.exe -m timeit -s "$S" "$C" # 11 µs | 34 ms
(A) although often used, is not correct solution which works for all cases - it will use 2nd item in case of duplicates.
(B) is robust solution that one should would use now if generic solution applicable to all cases is needed.
(C) is using proposed functionality.
It can be clearly seen that avoiding copies and more importantly calling keyfunc can result in large performance gains.
And proposed functionality is offering straight and most optimal path for cases that can exploit this (or another way to structure the problem where extra performance can be beneficial).
4. API
API that I have tested with is:
def sort(..., key: callable | list, ...)
It works nicely and unambiguously.
But it can just as well be:
def sort(..., key: callable, keylist: list):
def sort(..., key: callable, keys: list):
assert not (key and keys)
I thin the best would be to exclude sorted from this and only do this for list.sort.
The variation I am currently at:
def list.sort(*, key=None, keylist=None, reverse=False)
"""
Sort list in place.
keylist: list or None
If provided, sort this list by the corresponding elements in keylist.
Both this list and keylist are mutated in lockstep to maintain
correspondence. keylist must be a list of equal length.
Cannot be used with key parameter.
To preserve keylist order, use: values.sort(keylist=list(keys))
"""
keylist gets mutated, because it is easy to wrap it in list for non-mutability:
list.sort(keylist=list(keys))
but there is no way to achieve mutability if it is restricted.
5. Any thoughts?
So would others find this useful?
Do you have any cases in mind?
I can test performance benefit for them.
Or even better - you can test it yourself - GitHub - dg-pb/cpython at sort_keys
One final benchmark:
S='
import random, operator
N = 100_000
keys = random.sample(range(N), N)
values = list(range(N))
# Baseline: zip + sort
def baseline():
pairs = list(zip(keys, values))
pairs.sort(key=operator.itemgetter(0))
return [v for _, v in pairs]
# This proposal
def proposed():
values.sort(key=list(keys))
return values
'
./python.exe -m timeit -s "$S" "baseline()" # 60 ms
./python.exe -m timeit -s "$S" "proposed()" # 20 ms