What is the purpose of the third paragraph in this description of random.shuffle()?
Shuffle the sequence x in place.
To shuffle an immutable sequence and return a new shuffled list, use sample(x, k=len(x)) instead.
Note that even for small len(x), the total number of permutations of x can quickly grow larger than the period of most random number generators. This implies that most permutations of a long sequence can never be generated. For example, a sequence of length 2080 is the largest that can fit within the period of the Mersenne Twister random number generator.
It seems like this better belongs with itertools.permutations.
I assume it’s meant as a cautionary tip, as in “you won’t be able to exhaustively permute a big list with this function, and it’ll take forever to try”.
It might be even more relevant to itertools.permutations as it’s easy to make an (effectively infinite) generator that way. Perhaps a cross-reference would make sense.
Maybe not “needless”, but some rewording might help, to better indicate that the results of shuffling a large sequence may not select from the number of all possible permutations, due to the limitations in the RNG.
Yeah thinking about it more, it’s slightly different from how the same warning applies to itertools. Itertools can generate any permutation from a sequence, it’ll just take more time than you have available[1]. Whereas the PRNG literally cannot sample uniformly from the space of all permutations, because the space grows so large that it exceeds the number of PRNG states.
So would sequence.sort(key=lambda _: random.random()) offer a more complete set of possible results? Or is it just the same underlying constraint, just coming from a different angle?
I’m probably not the person to ask, I’ve changed my mind a couple times writing this
I think it might be the same problem. You’re basically generating a list of floats between 0 and 1, equal to the size of your list, and sorting it. You’re hoping that the new sequence is some novel permutation of the sequence. But the RNG can only generate so many distinct sequences of floats before it starts to repeat… I don’t think you can generate all the sequences you need.
Shuffling is, ultimately, the selection of one random permutation. If you want to shuffle the list ["spam", "eggs", "sausage", "ham"] there are 4! = 24 possible permutations, and you would expect each one to have 1/24 chance of being selected. But if you had a longer sequence, there would be less certainty of this.
Why does this matter? Because a shuffled list is expected to have certain randomness to it, all of which are guaranteed if it’s truly selecting a permutation with equal probability, and not guaranteed if it isn’t. For example:
Every element is equally likely to be the first one listed.
For any pair of values in the list, they have equal probability of being in either order. (“spam” is as likely to come before “eggs” as “eggs” is to come before “spam”.)
Every pair of adjacent elements is equally likely to occur.
If you slice the first N elements from a shuffled list, they are a random selection of N elements from the original list.
You could try some of these expectations out using a list of 10,000 numbers and see how well it holds up. Maybe it won’t. Maybe it will. But it’s not guaranteed.
IMO this is relevant in the place it’s currently written.
It’s a very topical warning: when trying to find an unknown permutation with specific properties, you might be tempted to try all of them in random order. The warning says that isn’t feasible as most RNGs won’t be able to produce all permutations eventually.
There is no theoretical limit on what itertools.permutations() can do. It produces all permutations “in theory”, although the universe may end before it finishes.
Shuffling is trying to produce a single permutation, and with each possible permutation equally likely. It cannot do that if the list is “too large”.
I agree the 3rd paragraph is of marginal value now. It was originally written a long time ago, when Python used the Wichmann–Hill generator, with a far smaller period (~6.95e12). That was so small that even most permutations of a deck of cards weren’t possible results. In fact, some permutations of list(range(16)) were out of reach then. This could lead to visibly biased outcomes even in “toy” programs.
The warning could probably be made less academic if it just said that python uses a randomness source that is not guaranteed to be appropriate when an even distribution of all possible outcomes is required.
There are endless recipes for shuffling a simulated deck of cards using an API that provides verifiable random noise, and a way for clients to verify the hand of cards after the fact, so clearly this is a problem that those who need to handle it already know they need to, as long as it remains visible to those who may not know they need more than python provides to prompt thought about it, I think it’s fine.
Nope. len(sequence) floats are generated as keys at the start, and it’s again the case that the internal state of the PRNG at the start of this wholly determines the outcome.
Yes, and the random documentation does state that the module uses pseudo-random generator. For this paragraph, I’m fine taking out, leaving it as is, or maybe rewording it to be a bit clearer and more concise:
“Note that a sequence of n length will have n! possible permutations, which may be greater than the period of the pseudo-random number generator. This means that some permutations may be impossible to return from random.shuffle(). If true randomness is required, use the random.SystemRandom class.”
Or something like that: explains the facts, the implication, and points towards a solution. Also it doesn’t tie it down to any specifics like naming the PRNG used. I’m assuming that a cryptographer, mathematician, or anyone who’s working with number spaces this big and needs true randomness would be aware of the issues of using PRNG in the first place. (But maybe this is a bad assumption to make?)
As @eendebakpt pointed at earlier, there’s an enormous amount of discussion about this already here.
Fine by me if the 3rd paragraph were eliminated, as already covered there.
Short of that, it’s a decent balance between saying too much and too little.
Naming the specific PRNG used is important for those aware of the issues, because the period length of the specific PRNG used is crucial. For example, there’s no real reason to be concerned about sequences of length 40 today, under the Mersenne Twister. There were major reasons to be concerned about such sequences under the earlier Wichmann–Hill generator.
For those very concerned, using random.SystemRandom().shuffle() is worth considering, but Python has nothing to do with the platform’s implementation of the underlying systen urandom(), so can’t promise anything about the quality of results. If it’s “just” a crypto-strength PRNG, nothing material about the period-length problem changes. If it mixes in “real entropy” at times, much better in that respect, but Python has no say in whether it does, or the rate at which it does, or the quality of its entropy sources.
The extreme length of the referenced issue reflects how very much there is to know, for those who really care.