Adding random.shuffled to the random module (renamed thread)

Gouvernathor · September 3, 2022, 6:45pm

Sorting an iterable can be done two ways in python : using list.sort, or sorted. The first one requires a list (or a subtype of list) and does the sorting in-place, the second takes any iterable (including non-sequences) and returns a new sequence.

Randomly shuffling a number of values, however, is a bit more complex. The random.shuffle function offers the equivalent of list.sort, but there’s no clear equivalent to sorted.
random.choices is not one, because it does a random pick with replacement, which is not what we’re looking for. The most similar function is random.sample, but with two main differences.

The first difference is that random.sample (just like other random functions) don’t take just any iterable, but only sequences, for reliability reasons I’m not going to talk about, that’s not the point and it’s unavoidable anyway.

The second difference is that random.sample has a second required argument, documented as k, and representing the number of elements being sampled from the original sequence. My question is simple : why not make that value be len(population), population being the sequence we’re sampling from ?
It would make a copy-shuffle (as opposed to the in-place shuffle done by random.shuffle) just as simple as one function call. It would also allow more use of random.sample in a functional programming fashion, by avoiding the need to create a variable containing the population variable prior to calling the function.

So, the signature would be random.sample(population, k=None, *, counts=None), and just as counts currently defaults to [1]*len(population), k would default to len(population).

iritkatriel · September 3, 2022, 7:33pm

I’m finding it difficult to parse this post. I think it would help if you open with the problem you’re trying to solve, then propose the solution.

Is sorting even relevant here?

pf_moore · September 3, 2022, 7:42pm

So basically you’re saying “let’s give k a default of len(population)”? I don’t see much benefit, but sure, if you think it’s worth it, feel free to create an issue and a PR. I don’t think there’s much debate needed here - a PR will either get accepted or not.

Gouvernathor · September 3, 2022, 8:41pm

I think it is, because shuffling a sequence is a bit like sorting it, only randomly. In a lot of contexts, in my experience, there’s a kind of symmetry between the two operations.

Anyway, I’ll PR that, as advised.

Gouvernathor · September 4, 2022, 12:13am

@rhettinger given your decision on the issue tracker : I disagree about the comparison with choices, because one is with replacement, the other is not, and the default is adapted from the purpose of the function. But I agree that it’s at best a wider reinterpretation of what the function does, and not really matching its name.
Seeing the function’s implementation, I also realized it’s not very much adapted to the way it works internally, aka it wouldn’t be the best way to implement a function that only or mainly shuffles an entire sequence.

So, what would you think of @pochmann’s proposal, a separate random.shuffled ?

rhettinger · September 4, 2022, 1:16am

It would be reasonable if we needed this often and it was considered too burdensome to write k=len(pop) with sample() and if the typical use case required making a full copy of the input rather than an in-place sort.

In my world (mostly teaching, consulting, and code review), none of those three are true. In your world, do you see commonly cases where all three are true?

Gouvernathor · September 4, 2022, 1:26am

Yep. I think I never used sample with anything else than len(partition) as the parameter, and in some instances it was so annoying that I ended up creating a wrapper function to calculate the len and call random.sample inside (by the way that’s not a good solution since it prevents from using random’s bookkeeping functions).
Writing k=len(pop) is only simple when you already have pop laying around, which was rarely the case in my experience.
A basic use case is when you need to show a series of options (in a discord game bot in my case, if you want to know), which is taken from a more indirect sequence origin such as a concatenation of several iterables, or something returned by filter or map, or used to iterate from in a comprehension.

But I think your first and third objections are answered by the existence of sorted in addition to str.sort anyway.

rhettinger · September 4, 2022, 1:29am

The use cases for sorting are profoundly different than for shuffling.

Gouvernathor · September 4, 2022, 1:34am

Possibly in some measure, but they also have much in common : they (re)order things, and that can be done in-place but should also be possible by copying the sequence; and in general these different orderings can have a similar use case.
There are also contexts when the two are used in proximity : when you want to sort winners by order of some score, without getting influenced by any previous sorting, you do sorted(random.shuffled(scores, len(scores)), key=scores.get).

rhettinger · September 4, 2022, 2:04am

I’ve never seen code like that, ever.

pochmann · September 4, 2022, 2:18am

@rhettinger I would use it often. My frequent use case would be for solution in shuffled(solutions):, which I do for benchmarking competing solutions. I dislike doing shuffle(solutions) beforehand, mostly because I really do mean shuffled iteration, and for solution in shuffled(solutions): expresses that much more clearly.

Another case from a few days ago:

tmp = parts[1::2]
shuffle(tmp)
parts[1::2] = tmp

I’d much prefer:

parts[1::2] = shuffled(parts[1::2])

which reads much more nicely (like one operation), doesn’t leave that tmp list lying around, and doesn’t push two other lines of code out of view.

Note I will never use sample in such situations. Because I don’t want to sample. I want to shuffle. As I argued on GitHub, I don’t think sampling generally includes shuffling. I’d find it abusive/misleading to use random.sample just because it happens to also shuffle (and I don’t like that the shuffle doc suggests doing that). Also, yes, I would loathe having to explicitly write the length. Which in the above case with len(parts[1::2]) would even make another copy of the slice.

Btw, I see that random.sample doesn’t optimize this “all” case, so it’s slower and takes much more memory than creating a list and calling random.shuffle on it would.

@Gouvernathor I wouldn’t necessarily call my comment a proposal… random.shuffled has been proposed and rejected a few times already.

Gouvernathor · September 4, 2022, 3:10am

Okay, then I guess we don’t work on the same kind of things. What’s your point ?
Going back to the matter at hand, I think I answered all the concerns you expressed, including the three you gave at the beginning, so, do you have any others ?

I think there could even be an implementation of shuffled() that’s faster and lighter in memory than duplicating the list then shuffling it : the branch of random.sample where counts are given. But I’m not 100% sure, to be checked. There is plenty of choice anyway.

steven.daprano · September 4, 2022, 8:07am

I don’t understand what you mean by “random’s bookkeeping functions”.

Could you do this?

import random

class MyRandom(random.Random):
    def sample(self, population, k=None, *, counts=None):
        if k is None: k = len(population)
        return super().sample(population, k, counts=counts)

Now instances of MyRandom have the full set of random methods, including your sample. (You can even add your own shuffled method as well.)

Personally, given the name “sample” (singular), I think that a default of k=1 makes more sense than defaulting to len(population), but I don’t feel strongly about this. If there is a consensus that defaulting to the length of the population is useful and sensible, I’d go along with it.

Hmmm… thinking as I write… wouldn’t k=1 be semantically identical to choice()? So maybe despite the name being singular, a default of k=len(population) makes sense, especially since counts has a default.

As far as shuffled goes:

on the plus side, I find myself nearly always wanting a shuffled copy of the list, or at least I am indifferent to whether it is a copy or not. If shuffled() existed, I would probably use it in preference to shuffle() almost always.
on the negative, that’s one more function to learn and maintain, and one more feature for beginners to get confused over.

Personally even when I would prefer shuffled() over in-place shuffle, I haven’t preferred it enough to bother writing a helper function.

So call me +0 on both these proposals: neutral leaning towards slightly in favour, but not enough to care one way or another.

(I reserve the right to change my mind in the future, according to my mood and/or experience.)

steven.daprano · September 4, 2022, 8:43am

Hmmm, well, no you certainly don’t do that, because random.shuffled doesn’t exist, and the signatures don’t match. Did you mean sample?

What’s scores? With a get method, is it a dict? But you’re treating it as a list or sequence in the call to shuffled. So I can’t work out what your code actually does and have to guess.

I think this use-case (sort without any influence from existing order in the event of ties) is very niche. I can’t really imagine many cases where it would make much of a difference, and being so niche it’s okay if you have to write your own function for it:

def sort_winners(scores):
    """Return a sorted list of players from input {player: score}."""
    winners = list(scores.keys())
    random.shuffle(winners)
    winners.sort(key=lambda x: scores.get(x))
    return winners

This gives you a nice function that you can test and ensure that it actually does what you intend, instead of hoping that it works.

Your use-case still leaves me neutral on the requested functionality.

Gouvernathor · September 4, 2022, 11:13am

That was my thoughts exactly. Moreover, the k=1 is a valid default for a choice with replacement, where the number of picks is a bit arbitrary or at the very least has no particular reason to scale up with the number of possible values.
Whereas sample’s without-replacement picking has a natural max value which is the length of the sequence, and that max value makes it identical in purpose with a shuffled (its name and implementation notwithstanding).

As for the maintenance cost argument, I mean… if it’s a three-liner function that either copies the sequence, shuffles it and return it, or calls sample with k=len(seq), then it’s just relying on the implementation of the other methods.

Yep I did a little mixup between the two functions… and this was primarily 3.9- python where passing dicts was ok. But anyway.
I do think that the ability to do sorted(shuffled(list(scores)), key=scores.get), with shuffled meaning either a new random.shuffled or a random.sample with a default k, is very handy and that having to create a function for each such case is bothersome.
It all comes down to functional programmin as a paradigm : only having functions which either work by side-effect like shuffle, or take redundant arguments making them bothersome to chain like sample, impedes that paradigm. My (initially bogus) example is just one example of chaining these functions, if I want to chain them in different ways at different points in my code, I either need to code a function like your example for each possible chaining, or come up with one in-and-out wrapper (what sorted is to str.sort) for each function.

That’s true. I hadn’t thought about that. I do think that having these function would be good, and reasonably cheap on the maintaining side, and would save devs the need to come up with such designs. But that’s one point in favor of “workarounds can be easy to make”.

By the way, I just thought about it, because before I never made a use of the counts arguments, but in fact it would be better if k defaulted to sum(counts), which sort of defaults to len(seq)… right, know what I mean ?

pf_moore · September 4, 2022, 12:09pm

Personally, I find that example baffling as a one-liner and I’d much prefer to see it written as a standalone function, with a meaningful name and documentation explaining what it’s trying to achieve. I’d also add tests because the logic seems tricky to me.

And of course as a separate function, a bit of verbosity having to specify k is less problematic

… and this demonstrates why working out what the default value should be is not as simple as it first seems.

Anyway, you raised a PR, and apparently it got rejected. Why are we back having the discussion here? It feels like you intend to just keep arguing until you get your way, which is not how these things work. The issue was closed by the core dev responsible for this code, with a clear explanation of the reasoning. There was no suggestion that further discussion would make any difference, so frankly it seems like you’re wasting people’s time here by prolonging the debate.

Gouvernathor · September 4, 2022, 12:32pm

Because we’re considering another proposition to make it a separate function, which would resolve some of the problems the first proposition had (we can always rename the thread to follow this, if it’s possible). From what Steven says there are pros and cons which haven’t been resolved either way, and the reservations rhettinger had about the separate function were all answered, so why should our version be the one leaving the discussion ?

Exactly, that’s what makes discussion so important. I’m not saying my initial proposition was perfect, I’m saying it shouldn’t be dismissed.

But if that particular chaining is only used once, you start littering your namespace with single-use functions which are easy to mixup with one another. I can have shuffled(filter(None, itrbl)) some lines later for example, or even need to use map.
I guess if you find it baffling that means functional programming is not your thing and defining single-use functions is. That’s fine, to each their own, but Stefan and I are showing you there is a use for what we’re proposing.

iritkatriel · September 4, 2022, 12:38pm

You could prove that by implementing it and publishing on PYPI. Merging code into the python stdlib is not the first step to thinking something through.

Gouvernathor · September 4, 2022, 12:46pm

I myself never use PYPI. Does the basic Python dev know how to look in there for the specific thing they want ? At the risk of spending more time parsing through a giant catalogue than it would take coding it… with possibly not less bugs ? And who would download a whole package for just one or two functions ?
I’m very doubtful about that use of PYPI as a kind of metric for Python enhancement ideas’ popularity, or as a snippet storage.

Although obviously I agree that we whouldn’t merge anything immediately, before stripping the idea apart and discussing all its merits.

pf_moore · September 4, 2022, 1:19pm

I’m very comfortable with functional programming, thank you very much. But even in Haskell or similar, I’d name complex bits of logic. I’ve said this elsewhere when you got insistent that one-use functions were somehow bad, I’d happily name a complex part of an expression, why should that only apply to the parts that aren’t functions?

I guess if you’d rather write complex one-liners that need careful thought to understand, then that’s fine, to each their own. But I’d rather make things readable. And more importantly, Python has a tradition of favouring readability over terseness.