Allow `sorted` to take list for `key` argument

I think it is the other way round:

  • keyiter + keybuffer - less verbose for user code
  • keymode - more verbose in user code, but could be more convenient to write derivatives as per above

Although I am not convinced about the above…

e.g. what would this do?
argsort([2, 1, 3], [3, 1, 2], 'iter')
And how does that make sense?

but I appreciate that one more argument needs to be cascaded and early error detection, for cases where sorted gets called far down the line is a bit more involved:

assert sum([v is not None for v in [key, keyiter, keybuffer]]) <= 1

Not catastrophic, but:

assert keymode in {'keyfunc', 'iter', 'inplace'}

is a bit quicker.


However, what you are trying to do with argsort will be tricky with both variations - it is not 1-to-1 relationship and mutability needs to be handled and adapted.

In essence, there are no xs. Only range(N) and key. However, key can be either:

  1. Iterable
  2. Iterable + keyfunc
  3. list to be mutated in-place

I think production version of such will have to give up attempt to make strict equivalence with list.sort and just code it as the concept dictates and make use of list.sort efficiently:

def argsort(xs, key='iter', reverse=False):
    if key != 'inplace':
        if key != 'iter':
            xs = map(key, xs)
        xs = list(xs)
    else:
        assert isinstance(xs, list)
    indices = range(len(xs))

    # keymode
    return sorted(indices, key=xs, keymode='inplace', reverse=reverse)
    # keyiter / keybuffer
    return sorted(indices, keybuffer=xs, reverse=reverse)

However, key as keyfunc in relation to list.sort is unnecessary confusion.
For list.sort, key makes sense as it has values to be sorted which are different than decorated key version.

While here, values don’t exist - they are range(N), thus:

def argsort(xs, reverse=False, inplace=False):
    if not inplace:
        xs = list(xs)

    # keymode
    return sorted(range(len(xs)), key=xs, keymode='inplace', reverse=reverse)

    # keyiter / keybuffer
    return sorted(range(len(xs)), keybuffer=xs, reverse=reverse)

If one wants to use keyfunc, then:

argsort(map(keyfunc, keys))

which leaves implementation simpler, usage more explicit and not much worse than previous:

argsort(keys, key=keyfunc)
2 Likes

Yeah, I think it’s good.

2 Likes

About the “2 for 1” sorting…
To me, the keymode way would be a fine syntax as long as it states operational modes internal to the function. But here, the keymode value determines if the function has impact outside of its object’s scope, and this feels odd to me.

I think the default behavior being the keys are not mutated is fine at first. Then, mutating the keys is an added complexity (not regarding the computational aspect, but the scope impact) → Here, I think added verbosity to reflect this increased complexity is actually justified and desirable.

If you read the code without having the knowledge of what the keymodes mean, you might miss the difference between one line making an outside mutation of the keys and another not doing it.
Thus I would opt for a way where the mutation of the keys clearly stands out not only in the syntax content but in the syntax shape, and complete explicitness should be prioritized to verbosity minimisation, such as this

values.sort(keys=keys)
values.sort(keys=keys, sort_keys=True)  # outside mutation should clearly stand out
2 Likes

It is subtle, and fooled me twice along the way :smile:.

When the user passes precomputed keys, xs is in fact irrelevant to what argsort() computes, beyond just its length (to find its index set), The user passed their own keys, and that’s the only input needed to compute the index permutation. In fact, passing xs in any form then is at best just wasted effort. The function is only trying to sort the index set, and the content of xs is irrelevant to that when an explicit list of sort keys is provided.

Certainly not in any version of argsort() I write :wink:. I see no need to have its signature differ in any way from sorted()'s. Changing in any way is just more for the user to learn, and to trip over. “Use it exactly the same way as sorted(), except it returns an index permutation” is ideal.

The code I posted was a POC prototype, intended to work in cases where the arguments make sense. Making it blow up on that senseless input is just a matter of changing the signature to match whatever sorted()'s turns out to be after PEP adoption. Then it would complain about the above for passing too many positional arguments.

There’s also a hole in this:

if func := kws.get('key'):
    kws['keybuffer'] = list(map(func, xs))
    del kws['key']

That would hide the error if passed both key= and keybuffer=.

Adding more error-checking just wasn’t interesting to me at this time. Getting it to work in all non-error cases was.

For the rest, the level of introspection this needs is easy to code no matter what spelling the PEP ends up using.

I quite like that! No appeal to internal mechanisms or data types, just making it explicit whether whatever type of key argument is passed should be sorted too. That’s future-proof and at a high semantic level.

At least for now, passing sort_keys=True would raise an exception unless a list object were passed to key=, and that’s fine.

One imperfection is that it could then be surprising if a user passed an instance of some other kind of mutable sequence type (other than list - like array.array), There’s no conceptual problem with asking for that to be sorted too,

It would be a restriction of the implementation that, for now, only a list could be used.

But much the same is true under any proposal that asks for inplace mutation.

One actual problem that may kill it, though: some objects are both callable and iterable. A key= that accepts both types is then back to guessing the intent, with no way to know. That was also true of some earlier ideas, though.

The keymode and multiple-arguments ideas are explicit about that distinction.

But that’s a quite different function. There is no possibility for numpy’s argsort() to mutate xs, and so neither for my version. If you’re that keen on making “2-for-1” sorting easy to spell, please use a function name other than “argsort”?

You also appear not to care that this version doesn’t allow for the possibility of using all of sorting’s new possibilities. I care a whole lot about that. For example, for a dict d

ixs = argsort(d, keybuffer=list(d.values()))

is a perfectly sensible thing to want in a post-PEP world - because

keys_by_values = sorted(d, keybuffer=list(d.values()))

is a perfectly sensible thing to want. Ditto for keyiter=d.values(), or key=d.get or key=d__getitem__

The new spellings can be unboundedly faster, though, because they avoid all dict lookups, and and both __eq__ and (especially) __hash__ can be unboundedly costlly.

1 Like

A bit of pedagogy: I expect the PEP would benefit by pointing out that

xs.sort(keybuffer=list(map(func, xs)))

would act almost identically to the current

xs.sort(key=func)

Same results, runtime likely the same within noise, and RAM use almost the same too (key= doesn’t actually create a new list object, just the C array of PyObject *, which typically accounts for the vast bulk of the bytes consumed by a list object - the rest is for a list object header of fixed and small size).

Ir’s emphasizing that the PEP is exposing something the (any, really) implementation already does, but at a lower level, in a more flexible way..

I don’t think “2-for-1” sorting will score many points (people who want “sort many” typically want more than just 2, and the change would only make it easier to implement an appropriate, but distinct, approach to “sort many”),

The sort-dict-keys-by-values example will score points, because it can be unboundedly faster than what’s possible today, by eliminating the need to compute the sort keys via arbitrarily expensive dict lookups.

Which is a lie :wink: In fact I’ve done that at times, via materializing list(dict.items()), sorting that on the 2nd tuple element, then extracting the 1st tuple elements. So it’s possible.

But the new abilities give a way to avoid slower and tedious such DSU (decorate-sort-undecorate) dances in favor of a more efficient direct approach; much as adding key= made it possible to avoid many simpler DSU dances.

2 Likes

I think I understand where you are coming from.

However, what I meant is that by keeping equivalent signature with sorted introduces some gaps. I am having a bit of trouble reconciling the range of intents with provided functionality. E.g.:

argsort([2, 1, 3], key=[3, 1, 2], keymode='iter')

I only see 3 distinct intents:

  1. Iterable
  2. Iterable + keyfunc
  3. list - mutated in-place

Distinct variation space of sorted is much larger.

This is why I am not sure that keeping the same signature for the sake of it is the best path given that large proportion of argument permutations do not serve a unique and valid intent.

1 Like

Ok, I see. Now it makes more sense.
Still unsure regarding the above, but I appreciate the design.

I agree, it is good for it to be clearly seen.
The above has been a part of comparison in Allow `sorted` to take list for `key` argument - #107 by dg-pb.
Results strongly suggested that keymode approach achieves most of that while has qualities that arguably make it superior to 2 extra arguments (one of which is a boolean flag).

Would you say:

values.sort(key=keys, keymode='sorted_inplace')

(or some other descriptive string) is sufficient indication?

1 Like

My intent is very simple: argsort(...) returns the index permutation that sorted(...) applies to get its result, whatever that may be, and whatever they replace ... with.

It would return [1, 2, 0], And, in fact, would do so regardless of which 3-element iterable were passed as the first argument.

That’s what they asked for, and I don’t judge them for it :wink:. It would be simpler & faster for them to just call argsort() are their list of keys Which could be explained in the docs, but I neither want to forbid it, nor “optimize” it for them (I expect it will be uncommon).

argsort(), as in numpy’s, is purely functional, so #3 is off the table for me. It’s not intended to be useful on its own, unless you only want a list of rankings. “The smallest element is at ixs[0], second smallest at ixs[1], …, largest at ixs[-1]”. Then it’s an end in itself.

I agree that the “really good” uses do not involve supplying precomputed keys. They may be confused if they do. Or they may not care about optimality and just want the index permutation with as little thought and/or respelling as possible.

I sometimes use an interactive chess app that sets up puzzling positions, and sometimes when I try to drag one of my pieces to a new square, it refuses, saying “there is a better move” despite that it knows I’d still have a winning position. I don’t want to be that app :wink:.

I could add an optional raise_if_suboptima;= boolean flag argument, though. Python doesn’t really have a way to inform users of “merely dubious” things functions detect.

1 Like

Thanks to everybody who contributed so far!

All the input was valuable and I think some good progress has been made.

I think I am ready to pick a favourite.


So the design space of sort specific extension has been thoroughly explored and both:

def keymode_sort(self, /, *, key=None, reverse=False, keymode='keyfunc'):
def keyiter_keybuffer_sort(self, /, *, key=None, keyiter=None, keybuffer=None, reverse=False):

were deemed to be “good enough” solutions.

Conditional that there will be no further extensions in this dimension.
(i.e. different types of key input)

keyiter_keybuffer_sort is easier to use.
However, no one has objected keymode_sort for being “too hard to use”.
Thus, both of them remain valid solutions.


As this local space has been thoroughly explored, I took a step back.
And it seems that this has larger implications than I initially thought.

This might not be and is very likely not limited to sorted.

If this is implemented, this paradigm should ideally be transferable to all other places where this could be of benefit.
Immediate ones are min, max (and there are many more).

Given that this is transferrable (or ideally should be), this relaxes the assumption of “no further extensibility”.
The range of extensions across all current and future places of applicability is hard to evaluate.
While probability of them occurring is not insignificant.
(I even have an extension idea, which I have been processing from POV of min/max for quite a while now.)

This also strengthens user experience aspect.
If keymode argument is met in different places, user experience is not worse than keyiter_keybuffer_sort.
Looking up docstring for keymode supported options becomes the standard and it is not much worse than browsing docstring for meanings of different arguments.
And in fact might be even better as it will be more localised in docstring than separate arguments.


Thus, if it was a design for this specific sorted case, keyiter_keybuffer_sort would likely be slightly better option.
However, this pattern is not specific to sorted.
From my POV, scales are tipped sufficiently for me to favour keymode_sort:

# Applicable to `min`/`max`
sorted(vs, key=keys, keymode='iter')
max(vs, key=keys, keymode='iter')       # NOTE: future work
min(vs, key=keys, keymode='iter')       # NOTE: future work

# sorted specific
sorted(vs, key=keys, keymode='buffer')  # NOTE: string value is still in question

As usual, all feedback is welcome and I am open to further maneuvers if necessary.

2 Likes

And, in fact, that’s exactly what I did in my

ixs = argsort(d, keyiter=d.values())

example. I know darned well that d is irrelevant in this call (apart from its length), but I want it to look as much as possible like

keys = sorted(d, keyiter=d.values())

In fact, it occurs to me now that since argsort() doesn’t even use d in this (or similar) case(s), it’s not even materially faster, or consume less RAM, to give argsort a “more optimal” spelling, like argsort(d.values()) All sorting sees in either case is list(range(len(d))) and list(d.values()).

1 Like

Are you sure? Offhand, I think sorting may be unique in internally materializing a list of keys, which it does because the same sort key may be referenced O(log n) times. In contrast, e.g., the bisect functions recompute each key result when needed (which can become quite apparent in clock time when using bisection many times on a given list of values). And can be applied to far larger sequences than anyone has RAM to materialize a list of key values (e.g., pass it range(2, 1_000_000_000_000, 2)).

Different contexts grow their own idioms. For example, if I want the value in xs corresponding to the position of the largest value in ys,

>>> import random
>>> ys = random.choices(range(100), k=10)
>>> ys
[63, 8, 48, 12, 6, 22, 86, 20, 33, 4]
>>> xs = list(range(200, 190, -1))
>>> xs
[200, 199, 198, 197, 196, 195, 194, 193, 192, 191]
>>> i = max(range(10), key=ys.__getitem__)
>>> i
6
>>> xs[i]
194
>>> ys[i]
86

It’s “the argsort trick” at very small scale :wink:

But, if expanded to these contexts, it seems to me that doesn’t grow the set of possible key modes, but rather reduces it (having a hard time imagining another context in which ‘inplace’ would make good sense).

Worth some thought! And if the set of key modes can expand, then a single new keymode argument does become ever-more attractive.

Since the only place I recall wanting a more flexible key scheme is in bisect, here’s what’s ideal for that:

  • The module search functions work on indexable sequences, and return indices.
  • The key function, if present, is passed the value at an internal index, not the index itself.
  • But it could save a whale of a lot of time when called repeatedly on the same base sequence if it could use a canned table of key values instead.
    • Which would be retrieved by index, not by value.
    • And also be of any indexable type. For example, for large cases I may well be forced to use an array.array to even fit in RAM. Indexing time is usually insignificant compared to the time to compute key values.

So a keymode of “by-index” would be ideal for that, requiring a key= argument of an indexable sequence type. Merely iterable types seem useless in this context.

Hmm. Also of value: “lazy-by-index”. Meaning that if the key sequence has None at the given index, fill the slot with function(base_sequence[i]). And now you need two key arguments. Haha - can’t win :wink:

But, seriously, bisection only makes \log_2{n} key retrievals per call, and precomputing for the entire range of possible indices may be unbearably expensive.

And lest it not be obvious, the implication is that desirable “key modes” may in fact explode across each context’s unique quirks.

1 Like

Yes, unlikely worth it.
But could be.
e.g. n=1000 - with 100 (10%) calls of pre-computed keys it breaks even.

heapq - could be as well.
Then, itertools.groupby.

Also, possible to extend to predicates…

inplace is very likely sort specific.

Would become:

i = max(range(10), key=ys, keymode='iter')
x = xs[i]

# or a bit more conveniently:
x = max(xs, key=ys, keymode='iter')

Where performance percentage-wise would be even greater (in comparison to sorted).
As only one linear scan is done (not O(nlogn)), key application becomes even larger portion of computation:

xs = list(range(100_000))
%timeit min(xs)                            # 1.52 ms
%timeit min(xs, key=lambda x: x)           # 6.18 ms
%timeit min(xs, key=lambda x: x*2)         # 9.14 ms
# If `keyiter` is precomputed I predict:
%timeit min(xs, key=keys, keymode='iter')  # <2.00 ms
1 Like

To be fair, I don’t believe I’ve ever wanted to find the value in one list corresponding to the max/min value in another list of the same length. My key= args to max/min are always “real” functions, so removing call overhead by merely iterating instead wouldn’t be possible.

sortedcontainers.SortedKeyList is another kind of puzzle. The key argument is a function applied to values, not indices. Like bisect does. But, unlike bisect, there are no fixed indices at work. A SortedKeyList can mutate freely, with insetions/deletions at arbitrary positions between calls to its bisect_xxx() methods.

The only clear way I can think of to speed that is for them to change the implementation, to save the key function’s return value in list entries. They don’t now, presumably to save space, and despite that they need to do a bisection on every insert and delete of a value (which is much more common than insert/delete by index in that context - there isn’t even an .append() method, since all flavors of SortedList impose their own order - and there’s no .insert() (by index) method either - you can access and delete by index, though, which makes them great for double-ended priority queues - well, priority deques - which almost nobody ever needs :wink: ).

As far as I remember, they do.
They store keys in similar way to list.sort. python-sortedcontainers/src/sortedcontainers/sortedlist.py at master · grantjenks/python-sortedcontainers · GitHub

Me neither.
Except argmin / argmax where “one list” is range.

1 Like