Random.sample() should return an iterator (or new function `isample`)

boniest · May 27, 2025, 12:53am

here is the code.

i see:

one return statement that recursively calls the function without counts.
two branches where a result array is built up one-by-one in a for loop then finally returned at the end.

i might be wrong but this looks like an ideal candidate to also have a version that returns an iterator, i think.

it seems to me that a lot of numerical applications use sample() a lot. i could be wrong.

an example in the documentation names the ability to choose a ‘winner’ and ‘runner-up’ with sample(). there is a case where an iterator would be much more appropriate to return, say if there are billions of potential winners.

(edit: to clarify, the documentation talks about ‘winners’ from which there is a ‘grand prize winner’ and ‘runner-up’. so the sample returned would not be of size 2. it would be of a size for all potential winners, out of which a ‘grand prize winner’ and ‘runner up’ are chosen. the advantage to an iterator is a low latency to get the first item, so this example from the documentation was not a good choice on my part.)

thoughts?

    def sample(self, population, k, *, counts=None):
        if not isinstance(population, _Sequence):
            raise TypeError("Population must be a sequence.  "
                            "For dicts or sets, use sorted(d).")
        n = len(population)
        if counts is not None:
            cum_counts = list(_accumulate(counts))
            if len(cum_counts) != n:
                raise ValueError('The number of counts does not match the population')
            total = cum_counts.pop() if cum_counts else 0
            if not isinstance(total, int):
                raise TypeError('Counts must be integers')
            if total < 0:
                raise ValueError('Counts must be non-negative')
            selections = self.sample(range(total), k=k)
            bisect = _bisect
            return [population[bisect(cum_counts, s)] for s in selections]
        randbelow = self._randbelow
        if not 0 <= k <= n:
            raise ValueError("Sample larger than population or is negative")
        result = [None] * k
        setsize = 21        # size of a small set minus size of an empty list
        if k > 5:
            setsize += 4 ** _ceil(_log(k * 3, 4))  # table size for big sets
        if n <= setsize:
            # An n-length list is smaller than a k-length set.
            # Invariant:  non-selected at pool[0 : n-i]
            pool = list(population)
            for i in range(k):
                j = randbelow(n - i)
                result[i] = pool[j]
                pool[j] = pool[n - i - 1]  # move non-selected item into vacancy
        else:
            selected = set()
            selected_add = selected.add
            for i in range(k):
                j = randbelow(n)
                while j in selected:
                    j = randbelow(n)
                selected_add(j)
                result[i] = population[j]
        return result

edit: or add a new function isample. choices and (maybe) shuffle also are candidates for an iterator function.

Rosuav · May 27, 2025, 12:55am

Benchmark it. Under what circumstances is this advantageous?

boniest · May 27, 2025, 1:17am

i type faster with lower case.

it would be more appropriate because instead of creating a huge list of billions of entries, it would produce one.

boniest · May 27, 2025, 1:28am

well if i am not mistaken which i very well might be, the code i displayed above has a list created.

so if there are billions of potential winners, you have an entire list of potential winners.

but only one of them is winner.

which is why i thought an iterator might come in handy here, because the results array is built up one by one in a for loop and that seems like a good candidate for an iterator.

edit: meaning, i only need to know the winner. that whole list with billions of entries is not useful except for the winner at the very front of it that was put there the first go around during the for loop. am i wrong?

boniest · May 27, 2025, 2:52am

here is a repository with the random class now having an isample method that returns an iterator.

here is the line of printout from my computer comparing sample to isample with a population size of pow(2, 10) and a k=pow(2, 10) (ie. a list of pow(2, 10) is generated by sample), then accessing the first one.

the latency is lower. i think i changed three lines of code (hopefully without adding any bugs).

PS C:\Users\boniest\source\code> python .\sample.py
avg sample time: 0.0002298271283507347
avg isample time: 1.2665055692195892e-05

jamestwebber · May 27, 2025, 3:28am

What’s wrong with choosing a single random number from the desired range? This seems like you’re just using the wrong function for what you need.

boniest · May 27, 2025, 3:52am

you are correct, but i used the “winner” example only because it is used in the documentation for random.sample().

here is the exact text. i used “winner” in the sense that they use “grand prize winner”:

Returns a new list containing elements from the population while leaving the original population unchanged. The resulting list is in selection order so that all sub-slices will also be valid random samples. This allows raffle winners (the sample) to be partitioned into grand prize and second place winners (the subslices).

(source: random — Generate pseudo-random numbers — Python 3.13.3 documentation)

so let me simplify my sales pitch.

using an iterator in this case lowers the latency of receiving the first random sample.

for instance in a for loop, the latency of getting into the loop with the first entry is an order of magnitude faster with an iterator. meanwhile if you want the entire list then you just do the norm. so i don’t see a downside because it is basically just the standard sales pitch for iterators. or am i wrong?

for s in sample(population, k=n): # builds entire list first before emitting any
  print(s)

for s in isample(population, k=n): # emits the first as soon as it is determined
  print(s)

boniest · May 27, 2025, 4:10am

if i am not mistaken, random.choices() can easily be made to return an iterator, too. though it would probably be necessary to leave that alone and create a different method like ‘ichoices’.

edit: also random.shuffle() by emitting in reverse (i think). i will test these but am off to bed for now.

gerardw · May 27, 2025, 10:34am

What if population changes while iterating?

jamestwebber · May 27, 2025, 1:42pm

I get your point. It’s still not clear when I would care about this, though. Time-to-first-element is not particularly important in a lot of scenarios. A real example would be more convincing.

These functions can’t be changed easily. Returning an iterator would break existing code: users that wanted a sequence now need to wrap their call in list(). So this proposal is either to add new functions (like isample), or start a deprecation process to change the output. Either way, real use-cases for returning an iterator (preferably from existing code, not a hypothetical) would be helpful.

boniest · May 27, 2025, 3:22pm

What if population changes while iterating?

this should not happen because the function is called with a static population parameter. any new function isample that returns an iterator would have the same parameters, presumably, only return an iterator instead of a list.

boniest · May 27, 2025, 3:26pm

I get your point. It’s still not clear when I would care about this, though. Time-to-first-element is not particularly important in a lot of scenarios. A real example would be more convincing.

the sales pitch is the same for iterators in general. i get that latency is not important in a lot of scenarios, but in the remaining scenarios it is important.

i don’t have a real world example. whenever sample is created and used in code one-by-one. also choices and (maybe) shuffle could also return a generator.

These functions can’t be changed easily. Returning an iterator would break existing code: users that wanted a sequence now need to wrap their call in list(). So this proposal is either to add new functions (like isample), or start a deprecation process to change the output. Either way, real use-cases for returning an iterator (preferably from existing code, not a hypothetical) would be helpful.

you are correct that i should have suggested adding a function and not changing the current one. i am new here.

actually if i am not mistaken, a beginner could change sample (or create a new method that returns an iterator). if you look above i posted a repository with an added isample method that is identical to sample except for 4 lines that are commented out and 3 lines that now use iterators.

the list is created, then is populated one-by-one from the start of the list. it is simply a matter of yield -ing the value instead of putting it into a list.

jamestwebber · May 27, 2025, 3:36pm

Welcome! And don’t worry about it. The important lesson here is that it is never simple to modify the standard library, or CPython internals. Even trivial changes come with a cost in terms of maintenance and documentation. It’s also important to consider the cost for umpteen package maintainers, as they will get drive-by PRs to update to the “new” thing.

All of this adds up a fairly high bar for addition to the Python core. Small improvements are no longer small.

I don’t anyone would argue that this is possible. But if there’s no use for it, it’s very hard to justify changing or adding code. There’s plenty of useful stuff to do instead.

Perhaps a package on PyPI would make sense, if nothing like this exists already.

boniest · May 27, 2025, 3:44pm

alright, thanks for you consideration. it is an order of magnitude speed up in latency so i thought i’d mention it.

boniest · May 27, 2025, 4:06pm

apart from the real world example of generating a sample and iterating over it, an isample with k=len(population) would serve as a shuffle that returns an iterator instead of shuffling in place.

i really don’t see a downside. i expect to do this myself and put it on pypi.

tim.one · May 27, 2025, 5:45pm

If a user only wants a winner and a runner-up, they would pass k=2 to sample(), yes? For example, let’s get a winner and runner-up from positive ints with no more than 18 digits:

>>> import random
>>> random.sample(range(10**18), 2)
[304657231332148292, 259090992183583552]

That requires no appreciable time or RAM - the only list constructed is the 2-element result list.

An internal list is constructed if and only if the code believes it would require less RAM than using a set for an internal accept/reject method. For k=2, an internal set (tiny!) will always be used

boniest · May 27, 2025, 6:00pm

you are right Tim. i would not have picked this example if not for the documentation which talks about a pool of possible winners and then a grand prize winner.

a better example would be if we need ALL the winners from a larger subset of players (edit: AND a grand prize winner), but frankly this use case is marginal at best for an iterator.

where the iterator would come in handy, i think, is scientific applications where either sample or choices or even shuffle is used to create a sample which is iterated.

boniest · May 27, 2025, 6:07pm

with maximum respect, i could be wrong, but the code seems to create a list results of size n whenever the counts parameter is left unspecified, and when the counts parameter is not None, the function recursively calls itself without counts.

tim.one · May 27, 2025, 6:15pm

Try my example: random.sample(range(10**18), 2). There’s not a computer on the planet with enough RAM to construct a list with 10**18 elements. Instead the code uses an internal set whose maximum len is k (2).

boniest · May 27, 2025, 6:31pm

you are correct. however we are talking about different things. the ‘results’ list size is determined by ‘k’. i am on my phone. i probably should have used ‘k’ before and not ‘n’.

edit: in your example if we want to take a sample of 1/2 of range(pow(10, 18)) then the program would crash when creating the ‘results’ list, but an iterator should be able to handle it. i think.