Why is there mention of permutations in random.shuffle() docs?

ptmcg · December 9, 2024, 5:10pm

What is the purpose of the third paragraph in this description of random.shuffle()?

Shuffle the sequence x in place.

To shuffle an immutable sequence and return a new shuffled list, use sample(x, k=len(x)) instead.

Note that even for small len(x), the total number of permutations of x can quickly grow larger than the period of most random number generators. This implies that most permutations of a long sequence can never be generated. For example, a sequence of length 2080 is the largest that can fit within the period of the Mersenne Twister random number generator.

It seems like this better belongs with itertools.permutations.

JamesParrott · December 9, 2024, 5:15pm

To Shuffle a deck of cards, is to apply a Permutation to them. The latter is the formal mathematical term.

AlSweigart · December 9, 2024, 5:16pm

I agree, it does seem like an needless fact to have here. I can mention it to the docs folks.

jamestwebber · December 9, 2024, 5:20pm

I assume it’s meant as a cautionary tip, as in “you won’t be able to exhaustively permute a big list with this function, and it’ll take forever to try”.

It might be even more relevant to itertools.permutations as it’s easy to make an (effectively infinite) generator that way. Perhaps a cross-reference would make sense.

ptmcg · December 9, 2024, 5:24pm

Maybe not “needless”, but some rewording might help, to better indicate that the results of shuffling a large sequence may not select from the number of all possible permutations, due to the limitations in the RNG.

ptmcg · December 9, 2024, 5:25pm

Thanks, after reading a few times, I finally got the connection as you state.

jamestwebber · December 9, 2024, 5:27pm

Yeah thinking about it more, it’s slightly different from how the same warning applies to itertools. Itertools can generate any permutation from a sequence, it’ll just take more time than you have available^[1]. Whereas the PRNG literally cannot sample uniformly from the space of all permutations, because the space grows so large that it exceeds the number of PRNG states.

at some scales one could imagine distributing this tedious job over many machines ↩︎

ptmcg · December 9, 2024, 5:29pm

So would sequence.sort(key=lambda _: random.random()) offer a more complete set of possible results? Or is it just the same underlying constraint, just coming from a different angle?

jamestwebber · December 9, 2024, 5:38pm

I’m probably not the person to ask, I’ve changed my mind a couple times writing this

I think it might be the same problem. You’re basically generating a list of floats between 0 and 1, equal to the size of your list, and sorting it. You’re hoping that the new sequence is some novel permutation of the sequence. But the RNG can only generate so many distinct sequences of floats before it starts to repeat… I don’t think you can generate all the sequences you need.

Rosuav · December 9, 2024, 6:10pm

Shuffling is, ultimately, the selection of one random permutation. If you want to shuffle the list ["spam", "eggs", "sausage", "ham"] there are 4! = 24 possible permutations, and you would expect each one to have 1/24 chance of being selected. But if you had a longer sequence, there would be less certainty of this.

Why does this matter? Because a shuffled list is expected to have certain randomness to it, all of which are guaranteed if it’s truly selecting a permutation with equal probability, and not guaranteed if it isn’t. For example:

Every element is equally likely to be the first one listed.
For any pair of values in the list, they have equal probability of being in either order. (“spam” is as likely to come before “eggs” as “eggs” is to come before “spam”.)
Every pair of adjacent elements is equally likely to occur.
If you slice the first N elements from a shuffled list, they are a random selection of N elements from the original list.

You could try some of these expectations out using a list of 10,000 numbers and see how well it holds up. Maybe it won’t. Maybe it will. But it’s not guaranteed.

IMO this is relevant in the place it’s currently written.

ntessore · December 9, 2024, 6:25pm

It’s a very topical warning: when trying to find an unknown permutation with specific properties, you might be tempted to try all of them in random order. The warning says that isn’t feasible as most RNGs won’t be able to produce all permutations eventually.

eendebakpt · December 9, 2024, 6:42pm

Also see the discussion on GitHub in

github.com/python/cpython

Consider a faster alternative algorithm for random.shuffle()

opened 12:43AM - 29 Aug 23 UTC

d-rideout

docs stdlib 3.11 3.12 3.13

The documentation for the [``shuffle()``](https://docs.python.org/3.13/library/r…andom.html#random.shuffle) function in the random module expresses concern that, for sequences of length larger than 2080, the ``shuffle()`` function is not able to produce all permutations of the sequence. While true, I find the emphasis on the finite period of the Mersenne Twister random number generator misleading, since the generator is only called n-1 times for a sequence of length n. The real issue is that any algorithm based on a sequence of calls to a pseudorandom number generator will not be able to generate more permutations than the generator has internal states. (I also wonder how large of a sequence can be sent to ``shuffle()`` in practice, to be guaranteed a uniform distribution over permutations. Is 2079 small enough? 2078? Might there be a check for this?) ### Linked PRs * gh-108610

tim.one · December 9, 2024, 6:49pm

There is no theoretical limit on what itertools.permutations() can do. It produces all permutations “in theory”, although the universe may end before it finishes.

Shuffling is trying to produce a single permutation, and with each possible permutation equally likely. It cannot do that if the list is “too large”.

I agree the 3rd paragraph is of marginal value now. It was originally written a long time ago, when Python used the Wichmann–Hill generator, with a far smaller period (~6.95e12). That was so small that even most permutations of a deck of cards weren’t possible results. In fact, some permutations of list(range(16)) were out of reach then. This could lead to visibly biased outcomes even in “toy” programs.

mikeshardmind · December 9, 2024, 6:53pm

The warning could probably be made less academic if it just said that python uses a randomness source that is not guaranteed to be appropriate when an even distribution of all possible outcomes is required.

There are endless recipes for shuffling a simulated deck of cards using an API that provides verifiable random noise, and a way for clients to verify the hand of cards after the fact, so clearly this is a problem that those who need to handle it already know they need to, as long as it remains visible to those who may not know they need more than python provides to prompt thought about it, I think it’s fine.

tim.one · December 9, 2024, 7:01pm

Nope. len(sequence) floats are generated as keys at the start, and it’s again the case that the internal state of the PRNG at the start of this wholly determines the outcome.

AlSweigart · December 9, 2024, 9:17pm

Yes, and the random documentation does state that the module uses pseudo-random generator. For this paragraph, I’m fine taking out, leaving it as is, or maybe rewording it to be a bit clearer and more concise:

“Note that a sequence of n length will have n! possible permutations, which may be greater than the period of the pseudo-random number generator. This means that some permutations may be impossible to return from random.shuffle(). If true randomness is required, use the random.SystemRandom class.”

Or something like that: explains the facts, the implication, and points towards a solution. Also it doesn’t tie it down to any specifics like naming the PRNG used. I’m assuming that a cryptographer, mathematician, or anyone who’s working with number spaces this big and needs true randomness would be aware of the issues of using PRNG in the first place. (But maybe this is a bad assumption to make?)

tim.one · December 9, 2024, 10:02pm

As @eendebakpt pointed at earlier, there’s an enormous amount of discussion about this already here.

Fine by me if the 3rd paragraph were eliminated, as already covered there.

Short of that, it’s a decent balance between saying too much and too little.

Naming the specific PRNG used is important for those aware of the issues, because the period length of the specific PRNG used is crucial. For example, there’s no real reason to be concerned about sequences of length 40 today, under the Mersenne Twister. There were major reasons to be concerned about such sequences under the earlier Wichmann–Hill generator.
For those very concerned, using random.SystemRandom().shuffle() is worth considering, but Python has nothing to do with the platform’s implementation of the underlying systen urandom(), so can’t promise anything about the quality of results. If it’s “just” a crypto-strength PRNG, nothing material about the period-length problem changes. If it mixes in “real entropy” at times, much better in that respect, but Python has no say in whether it does, or the rate at which it does, or the quality of its entropy sources.

The extreme length of the referenced issue reflects how very much there is to know, for those who really care.

I think it would be better to say nothing at all.

ayhanfuat · December 10, 2024, 6:39am

I would avoid using terms like true randomness. Cryptographically secure maybe. SystemRandom also produces pseudo random numbers.

Paddy3118 · December 10, 2024, 10:00am

Hi, I don’t agree as many people would use shuffle without thought of its limitations otherwise.

Would this be a better wording?
Note that even for small len(x), the total number of shufflings, (also called permutations), of x can quickly grow ...