Allow `sorted` to take list for `key` argument

About the inplace multisort …

I feel several concerns about homogeneity of how the args and kwargs are processed.
I would find it easier to process cognitively with these 2 rules :
-No kwarg is mutated
-All args are mutated

This is consistent with the ‘hidden argsort’ process :

sort_many(*sequences, base:Sequence=None)

First, argsort(base) is computed. Second, sequences are permuted.
To permute the base, just include it within the sequences.
If no base is provided, just use first sequence or whatever.

We don’t have a sort_many() kind of function, and don’t intend to add one {certainly not so as part of this PEP).

So users will roll their own. “The obvious” way to sort B, C, and D the same way as A would be:

B.sort(key=A)
C.sort(key=A)
D.sort(key=A)

Which only works for B if the list A is mutated as a side effect of sorting B. That would leave A in sorted order, so the attempts to sort C and D would have no effect.

So they don’t want A mutated in general. OTOH, if they want only to sort B the same way as A, they do want A mutated.

Now in an “intelligent” approach to sort_many() (which I’ve spelled out in previous messages - one that deduces the cycle structure of sorted(A)'s index permutation just once and applies it, in linear time, and in-place, to all the lists, but one at a time), adopting this PEP would allow the implementation to save substantial RAM and complication, and has no semantic need to avoid mutating the one list it constructs at the start when needed (when sorting A on its own wants to specify a key= function). But it wants mutation anyway to save the expense (RAM and time) of making a useless copy of the temp list it makes.

Few users will reproduce all that, though. They’ll go with the code at the start, and get surprised when it “doesn’t work”. “In the face of ambiguity, refuse the temptation to guess.” It’s not possible to guess what the user intended (mutate or not) merely from that they give a list.

Now if it were plausible that the new functionality would be heavily used by many people, “practicality beats purity” could make a case for pragmatic guesswork.

But that’s not plausible to me. I expect this will always have a niche audience, and used lightly.. To great effect when it is used, but rarely enough that “make it all explicit” will benefit the code’s readers and writers overall.

You went on to quote pieces of messages that weren’t at all about “multisort”, which makes it hard to follow. @dg-pb in particular is only talking about his (conceptually tiny) variation of the current sort when he says “inplace_sort”. That’s at most “2 for 1” sorting, nothing beyond that. In “inplace_sort”, it’s only possible to pass a list, which will be mutated. If you want anything else, you have to make your own list to pass.

That’s fine by me so far as it goes, but it doesn’t go far enough: sorting base may very well want to use key= and/or reverse= keyword args of its own.

Which I have code for now, but never posted. It’s become off-topic in this thread. If you want to continue talking about multisort, that’s fine, but please start a new topic for it.

Except you’re both right :smile: It’s elegant for its minimality. But not so minimal for plausible use cases, where it’s predictable that the user doesn’t have “a list” to pass, and life is easier for them if the implementation constructs one for them (like sorted by order of dict.values()),

In that sense, I’m more on @Stefan2’s side here. Under the covers, the implementation needs “a list” to work on. But why should that internal need be enforced by the user-facing API? For peak speed they need to pass a mutable list, but making that an API requirement goes too far. Some users value convenience more, and I expect that the SC in particular will weight catering to common use cases at least as heavily as exposing a way to achieve peak speed in ideal-for-the-implementation use cases.

2 Likes

FYI, as of now I like the keymode= variant best. It’s as flexible as any, the least novel, and while it’s the most to type, I don’t care about “code golf” here. It’s not really verbose, just a few more characters than alternatives.

And I’m -1 on any way of incorporating the word “list”. That’s just an implementation accident. What sorting really needs under the covers is a PyObject ** (pointer to an array of pointers to Python objects), and it just so happens that the list type is the only obvious way to get such a thing now. No list methods are actually needed, just the C-level pointer to the C array, and the number of pointers in the array. That may change in the future.

For a bad :wink: example, it’s dead easy to get those from a tuple object today. But tuples are immutable, so unsuitable for ‘inplace’ use.

The keymode= string should stick to a higher-level view, like ‘call’ vs ‘inplace’ vs ‘iiter/copy’. Version-specific doc notes can spell out which specific types can be used for ‘inplace’ mode (initially only list).

Premature depression. The 1-liner can still work, but by adding 6 lines before it :wink:

def argsort(xs, **kws):
    if 'key' in kws:
        def newfunc(i, oldfunc=kws['key']):
            return oldfunc(xs[i])
        kws['key'] = newfunc
    else:
        kws['key'] = xs.__getitem__
    return sorted(range(len(xs)), **kws)

That’s today, and assumes that key= must be callable.

But after the PEP is accepted, it would need to be changed to leave key= (if any) alone unless keymode is ‘call’ (or not given at all, and so defaults to ‘call’)..

Because when we pass precomputed keys in, the range(len(xs)) argument is purely passive - it’s just going along for the ride as the keys get sorted. When ‘call’ is in effect, though, it plays an active role in computing the keys first.

That’s the good news. The bad news is that this really can’t be used as an example of where the new abilities would pay off.. There’s no longer any memory bloat, or extra expense, for building a temp list of decorated tuples.

I agree with @tim.one . mixed_sort is appropriate candidate for final solution. But I simply argue that if final solution does not satisfy constraints to large degree, it might be much more elegant to keep the raw and simple solution which does the straight minimum, which exposes everything at the cost to the user (who doesn’t get it perfect in mixed_sort anyway).

If we were to choose among these 2, then mixed_sort is most likely better. But this was a counter argument against mixed_sort versus 2 current contenders, not an argument for inplace_sort.

def argsort(xs, key=None, reverse=False):
    if key is not None:
        xs = map(key, xs)
    xs = list(xs)

    return sorted(range(len(xs)), keylist=xs)
    # or:
    return sorted(range(len(xs)), key=xs, keytype='inplace')

I would like to double check the situation.

While @tim.one seems to be the big proponent of keymode solution, I am still doing my best to stay balanced, as I still see them both potentially valid depending on what turns out to be more important for the community.

Maybe there is no battle anymore and I am wasting my energy.
So I just want to get some input to see where we are.


So we have 2 solutions.

  • Both are robust and satisfy constraints to sufficient degree.
  • Both can be handled by typing.
  • Both are similarly easy to learn.

def keymode_sort(self, /, *, 
                 key=None,
                 keymode='call',
                 reverse=False):
  1. Clear precedent in Python. open(mode)
  2. 1 new argument
  3. Needs 2 arguments to make use of new functionality

def keyiter_keylist_sort(self, /, *,
                         key=None,
                         keyiter=None,
                         keylist=None,
                         reverse=False):
  1. Precedent in Python is arguably a bit rarer (although coincidentally same open has similar aspects)
  2. 2 new arguments
  3. Needs 1 argument to make use of new functionality

Argsort with both:

# keymode
indices = sorted(range(len(keys)), key=keys, keymode='iter')

# keyiter_keylist
indices = sorted(range(len(keys)), keyiter=keys)

Accidental mutation:

  1. keymode - no risk
  2. keyiter_keylist - risk well within “consenting adults” range

keymode is a safe option.
However, keyiter_keylist has non-trivial advantages for user experience:

Discoverability is better.
keyiter argument can be guessed easily and unambiguously from signature.
For keylist usage, one would still need to read documentation.
In contrast to keymode, where nothing can be inferred from signature.

IDE / ipython suggestions are much better:

sorted(y, k<tab>
    key=
    keyiter=
    keylist=

Both in signature and from POV of alphabetic priority, keyiter gets precedence over keylist due to recognition that this is more common.


Finally, let’s recap the journey that led us here:

keymode keyiter_keylist
@pf_moore has raised concerns about accidental mutations. I have taken them seriously - neiher of these 2 final contenders have this issue. :white_check_mark: No accidental mutation risk. :white_check_mark: Accidental mutation risk is minimal: keylist requires explicit parameter choice and is unlikely to be used without reading docs. Well within “consenting adults” territory.
@Stefan2 has pointed out the importance of user experience - frequent case (as seen in stack) is when the user just wants to source any iterable without the need for conversion and get the answer safely and easily. :warning: sorted(range(len(keys)), key=keys, keymode='iter') I would guess is a bit too verbose compared to what @Stefan2 is advocating for? :white_check_mark: sorted(range(len(keys)), keyiter=keys) is as good as it can get for those who just want to source any iterable without needing to think about inplace behaviours and other undesirable nuances.
@tim.one has stressed the importance of keeping equivalence between sorted and list.sort signatures and behaviour. :white_check_mark: Equivalence between sorted and list.sort is kept. :white_check_mark: Equivalence between sorted and list.sort is kept.
My concern is to keep unambiguous possibiliy to provide a list which gets mutated in-place - this has non-trivial performance benefits in many scenarios (including argsort, dict_sort, sort_many). :white_check_mark: All performance benefits that were initially intended are exposed. :white_check_mark: All performance benefits that were initially intended are exposed.

  1. @pf_moore, in your opinion, does keyiter_keylist sufficiently manages accidental mutation risk?
  2. @Stefan2, does keymode provide sufficiently good user experience?
  3. @tim.one, how do advantages of keyiter_keylist (the IDE discoverability coupled with 1 less argument on use) stand against advantages of keymode (1 less argument in argspec and a bit clearer precedence in Python) for you?

I believe both solutions are viable and would be happy to implement either.
I’m seeking clarity on which tradeoffs the community finds more acceptable.

1 Like

I’m unclear on what this is intended to show. My “after the PEP” means that argsort() would have to change to learn about sort() arguments the user wants to use to sort xs that don’t even exist today.

But you’re ignoring the new “after the PEP” possibilities for what the user may want, catering only to argument possibilities that already exist. And while you allow them to pass reverse to argsort(), the code ignores it :wink:.

My argsort() intends to change over time to allow users to specify anything whatsoever sort() itself offers.

However, I too overlooked that, after the PEP., I’ll have to go back to a longer-winded “sort decorated tuples” dance, because the user may want to pass their own precomputed keys, and my (and your) use of key= for argsort()s internal purpose kneecaps that possibility.

1 Like

I have a mild preference here. Best would be to key on type (e.g., zipfile.ZipFile()'s file argument means different things depending on whether it’s “a string, a file-like object or a path-like object” - that’s Pythonic),.But lists don’t come in “mutable and immutable” subtypes, and it would be crazy to try to invent such a distinction for the tiny purpose here.

Between keymode_sort and keyiter_keylist_sort I see very little difference, after we seem to have agreed that, no, the list of new arguments in the latter won’t grow over time.

However, I have come to object to anything with “list” in its name (as in keylist=None), because the universe of ways to pass an array of PyObject ** may expand over time. It won’t be “list vs not list” then, it will be “copy or mutate”. Keep the names agnostic wrt specific data types, please. They’re just accidents of the current implementation. Long-lived accidents, but for no fundamental technical reasons.

If you want to immortalize implementation accidents :wink. change the keymode string to 'list". And if that sounds silly, the name keylist= is ill-advised to me for the same reasons. It’s not really about list-or-not-list.

I like the discoverability, but that one gets listed before the other is really a stretch. Change keylist= to, say, keycopy=, and the alphabetic order changes. I give this 0 weight, and especially because only 3 possibilities are shown.

My concern is to keep unambiguous possibiliy to provide a list which
gets mutated in-place

And I’m 100% with you on that.

Mainly now that keymode never used the word “list” anywhere :-wink:. Change that & I’m happy. Happy either way, but do indeed like that auto-complete is more helpful with distinct argument names.

1 Like

I think I was the first to suggest mixed_sort, but I’ve since disowned it. We simply cannot guess whether a user does or does not want a list of precomputed args to mutate in place, and “read the docs” isn’t an adequate counter. People blunder all the time, but forcing some way to be explicit about intent reduces the frequency of resulting damage - for code writers and code readers (who may well have never even seen the relevant docs).

We don’t need heavy verbosity to be explilcit about intent. Using different argument names, or different short “mode” strings, is siufficient.

I think suffix is ideally a noun, otherwise it doesn’t read with singular key prefix. e.g. keyinplace reads awkwardly.

Changing it to keysinplace is possible, but I would much rather keep these 3 in sync with singular prefix.

It should suggest that it is a Sequence.

And should signal mutability.

Asked AI to throw some options and there is one that I think could be a good fit keybuffer.

  • Is commonly used to indicate an array which will be mutated. Common in sequence algorithms (sorting included).
  • Forces less experienced programmers to read the docs as some amount of experience is necessary to make a guess with enough certainty.

Agreed. :slight_smile:

sorted(y, k<tab>
    key=
    keybuffer=
    keyiter=

I discard that one and make a new stretch.
Argument name length represents use frequency.
Reason why I like keybuffer more than keybuf shorthand.

Also, if the choice is keymode, I think better string for default mode would be keyfunc.
It will never be typed, while I think has more encoded information and is more to the point.
Also, in line with how many people think about it already. e.g. from argument clinic of list.sort:

list.sort

    *
    key as keyfunc: object = None
    reverse: bool = False

Huh! I never thought of “iter” as a noun before, but “keyiter” is perfect anyway. You think of it as short for “iterable”? I thought of it more as operational: we’ll apply (some equivalent to) iter() to the argument.

My chatbot also suggested that one, and I like it too. And it adds:

Fine by me! And it’s less a stretch than caring about the order - but, ya, a stretch all the same :rofl:.

1 Like

+1. OK, Discourse insists on more characters or it won’t post it. Is this enough?

EDIT: yes, that was more than enough.

1 Like

No, I won’t :smile: If they pass their own precomputed keys, the original list is irrelevant to what argsort() computes. argsort() can pass just range(len(xs)) as the first argument, and it will be sorted directly in the same order as the precomputed keys.

The only “tricky case” here is working with a user-supplied key function, and there are multiple easy-enough ways to get the same effect.

So argsort() will be in the business of analyzing the keyword arguments to see which variant is needed. Which is another possible basis on which to judge different APIs.

A keymode argument is a natural fit to the match/case statement - or a dict mapping the mode string to an appropriate helper function.

Dealing with multiple mutually exclusive arguments seems bound to be more ad hoc.

But in this particular application, the only real questions are:

  • Did they pass a non-None key function?
    • If so, we have to tweak it.
    • If not, did they pass any kind of precomputed keys?
      • If so, there’s nothing to do.
      • If not, inject our own key=xs.__getitem__ into kws.

and they’re trivial to answer regardless of which API is picked..

While I can’t run this yet, and there may even be typos, here’s what I think is at least very close to what’s needed for a post-PEP argsort() allowing users to pass anything they can pass to sort. And also using the PEP’s new features for simplicity and speed. This is assuming the multiple-new-arguments flavor:

def argsort(xs, **kws):
    if func := kws.get('key'):
        kws['keybuffer'] = list(map(func, xs))
        del kws['key']
    elif kws.get('keybuffer') or kws.get('keyiter'):
        pass
    else:
        # numpy's argsort doesn't mutate the array;
        # making a copy is bound to be quicker than
        # passing key=xs.__getitem__, or a keyiter
        # argument iterating one at a time
        kws['keybuffer'] = xs.copy()
    return sorted(range(len(xs)), **kws)
1 Like

Not sure about this branch.
In essence, xs are the keys, so this would be providing extra set of keys and discarding xs (only using it for len).

1 Like

I was originally in favour of having key=, keybuffer=, keyiter= (or similar); but the verbosity of this makes me reconsider.

I think the direct equivalent of your function with keymode sort would be (unless I’ve missed something):

def argsort(xs, key, keymode = None, **kwds):
    if keymode != 'iter':
        if keymode == 'call':
            key = list(map(key, xs))
        elif keymode != 'buffer':
            # numpy's argsort doesn't mutate the array;
            # making a copy is bound to be quicker than
            # passing key=xs.__getitem__, or a keyiter
            # argument iterating one at a time
            key = xs.copy()
        keymode = 'buffer'
    return sorted(range(len(xs)), key=key, keymode=keymode, **kwds)

Feels simpler to me? But maybe the additional verbosity of the separate arguments is a good thing in normal user code (i.e. when you’re not trying to directly wrap the signature of sorted()).

2 Likes