Thank you
I like getting a bit of stats. (even if that might not be the primary indicator)
Ok, let’s look at the current situation…
Setup
N = 100_000
import random
from operator import itemgetter
keys = random.sample(range(N), N)
columns = [list(range(N)) for _ in range(100)]
def sort_zip(ncols):
if ncols == 1:
col = columns[0]
pairs = sorted(zip(keys, col), key=itemgetter(0))
new_keys, new_col = map(list, zip(*pairs))
else:
keys_rows = zip(keys, *columns)
new_keys_rows = sorted(keys_rows, key=itemgetter(0))
new_keys, *new_cols = map(list, zip(*new_keys_rows))
def sort_new_or_zip(ncols):
new_keys = list(keys)
if ncols == 1:
new_col = sorted(columns[0], key=new_keys)
else:
new_rows = sorted(zip(*columns), key=new_keys)
new_cols = list(map(list, zip(*new_rows)))
def sort_itemgetter(ncols):
idxs = sorted(range(N), key=keys.__getitem__)
reorderer = itemgetter(*idxs)
new_keys = list(reorderer(keys))
new_cols = list(map(list, map(reorderer, columns[:ncols])))
def sort_itemgetter_new(ncols):
new_keys = list(keys)
idxs = sorted(range(N), key=new_keys)
reorderer = itemgetter(*idxs)
new_cols = list(map(list, map(reorderer, columns[:ncols])))
./python.exe -m timeit -s "$S" "sort_zip(30)"
./python.exe -m timeit -s "$S" "sort_new_or_zip(30)"
./python.exe -m timeit -s "$S" "sort_itemgetter(30)"
./python.exe -m timeit -s "$S" "sort_itemgetter_new(30)"
# ncols | 1 | 2 | 3 | 30 |
# ------------------- |----|-----|-----|------|
# sort_zip | 70 | 150 | 150 | 3600 |
# sort_new_or_zip | 20 | 130 | 150 | 3600 |
# sort_itemgetter | 50 | 65 | 75 | 435 |
# sort_itemgetter_new | 35 | 50 | 60 | 420 |
'
So:
1a. if key list does not need to be mutated, this provides performance benefit regardless of whether keylist gets mutated or not. (Clearly visible in benchmarks of the main thread)
1b. Mutating keylist provides very fast path to sort 1 key list and reorder 1 value list (e.g. dict sorting).
2. If there is more than 1 value list, then operator.itemgetter is the way. this provides performance benefit, which decreases as number of value lists grows (as more and more time is spent on reordering while sorting step stays constant).
How would code look like?
Scenario 1. keylist IS mutated
(1a) sort values, but do NOT sort keys
sorted(values, keylist=list(keys))
(1b) sort both values and keys
sorted(values, keylist=keys)
(2a) sort many values, but DO NOT sort keys
idxs = sorted(range(N), keylist=list(keys))
# reorder values with itemgetter
(2b) sort many values and keys as well
idxs = sorted(range(N), keylist=keys)
# reorder values with itemgetter
Scenario 2. keylist is NOT mutated
(1a) sort values, but DO NOT sort keys
sorted(values, keylist=keys)
(1b) sort both values and keys
# if `keylist` is NOT mutated, this is not available. Use slower (2b)
(2a) sort many values, but DO NOT sort keys
idxs = sorted(range(N), keylist=keys)
# reorder values with itemgetter
(2b) sort many values and keys as well
idxs = sorted(range(N), keylist=keys) # if `keylist` is NOT mutated
# reorder keys with itemgetter
# reorder values with itemgetter
Tradeoffs:
- Scenario 1 mutates the
keylist - without good documentation some might trip too often
- Scenario 1 has 1 extra path for maximum performance in case both 1
keys and 1 values need to be sorted based on keys.
- Aggregate verbosity is approximately the same for both cases (scenario 2 does not have (1b), but the (2b) is more verbose)
Path Forward
If non-mutating keylist is preferred, then problem is solved - there is not much to discuss.
However, I think dict sorting (1b case) is quite fundamental and 2.5x faster result compared to 1.4x in case of non-mutation is quite compelling. Also, mutation provides extra performance for (2b), which can be felt for small number of value lists.
Thus, if performance benefits justify keylist mutation, then I see several paths:
- Just do mutating
keylist in list.sort(keylist=keys) / sorted(values, keylist=keys) and focus on documentation and education
- Mutate for
values.sort(keylist=keys), but do NOT mutate sorted(values, keylist=keys). In a certain way it could be less confusing to remember that list.sort mutates everything and sorted mutates nothing. However, this brings inconsistency in another dimension - different argument treatment for 2 different functions, which still confuses people, just in different way.
- Let a bit of
Pike into Python and do keys.sort(parallel=values) and sorted(keys, parallel=values). This way, it is perfectly intuitive that list which is being sorted in parallel is being mutated. In terms of performance, there is no loss here.