Randomly select a line in a file that uses little memory

Right, that’s what it’s called. I’ve seen the name since, but I didn’t know it at the time I first saw the technique.

Not quite sure what you mean. Higher probabilities are used for the initial checks, but that balances against the fact that they could be overruled later. And vice-versa.

:man_facepalming: Of course.

Oh, even better. Although I don’t think it generalizes.

1 Like

Mathematically proving exactly how much bias it has may be tricky. But it’s worth noting that, apart from being shorter, this isn’t particularly better than the previously-given algorithm; it still has to generate one random number for every line in the file. For a PRNG like random(), that’s probably fine, but for system entropy, that’s quite expensive.

Of course, it also might be unnecessary to use true entropy for every line. A PRNG seeded from true entropy might well be sufficient, and would be a whole lot faster.

You mean generalize to selecting k instead of 1? You could use heaps.nsmallest for that.

1 Like

Well… random() is fairly fast. A little benchmark, choosing from 10000 items:

  1.05 ± 0.00 ms  choice_min
  6.11 ± 0.04 ms  choice_reservoir_2
 28.38 ± 0.58 ms  choice_reservoir

Where choice_reservoir_2 uses random.randrange.

Benchmark code:

from timeit import timeit
from statistics import mean, stdev
from itertools import repeat
from random import random, randrange
from secrets import randbelow

def choice_reservoir(xs):
    for n, x in enumerate(xs, 1):
        if not randbelow(n):
            result = x
    return result

def choice_reservoir_2(xs):
    for n, x in enumerate(xs, 1):
        if not randrange(n):
            result = x
    return result

def choice_min(xs):
    return min(xs, key=lambda _: random())

solutions = choice_reservoir, choice_reservoir_2, choice_min

for s in solutions:
    print([s(range(3)) for _ in range(1000)].count(1))

times = {s: [] for s in solutions}
def stats(s):
    ts = [t * 1e3 for t in sorted(times[s])[:10]]
    return f'{mean(ts):6.2f} ± {stdev(ts):4.2f} ms '
for _ in range(100):
    for s in solutions:
        t = timeit(lambda: s(repeat(None, 10**4)), number=1)
        times[s].append(t)
for s in sorted(solutions, key=stats):
    print(stats(s), s.__name__)

Attempt This Online!

Ok, I think I somewhat get what you’re saying now. But I only see you talk about real numbers. I see nothing there reflecting the limited number of possible results from random() (2⁵³, if I remember correctly). Am I overlooking it? I knew that earlier lines have a higher probability, due to equal random() values, that’s the tiny bias I meant. But I don’t see how to get that from what you’ve shown.

Actually, I computed the wrong random variable. I computed the distribution of the minimum (the first order statistic), but what you are using is the index/rank of the minimum. If we assume that random() is giving us samples from iid random variables, then the position of the minimum is uniformly distributed.

Hmm, that’s not true, and also not what the linked page says. Maybe if you add “continuous” like they said there. But random() isn’t continuous. Like I said it chooses from 2⁵³ or so values. Consider an extreme case: If it instead chose from just 2 values, then the min solution would 50% of the time result in the first line being chosen (namely if random() produced the smaller of the two values for it). The second line would have 25% probability. Etc. That would be terribly biased, not uniform. With 2⁵³ values it’s much better, but there’s still a tiny bias. The question is how tiny.

The proofs don’t assume continuous, but yes no duplicates. We can work under the assumption that the file has much less than 2^{53} lines.

Well, yes, but range() can produce duplicates. And yes, their file certainly has far fewer than 2⁵³ lines, likely even fewer than the ~2²⁷ lines that would give a 50% chance that there is any duplicate (birthday paradox), and for duplicates to be an issue, they’d have to be not just duplicates but the minimum value. Which is why I feel relatively confident about calling the bias “tiny”.

Yes, that’s true, but remember that random() is a PRNG, not system entropy; the OP asked for something based on the secrets module. Try that again with secrets.randbelow and you’ll find much worse numbers.

That’s why I suggested the compromise of seeding a PRNG with entropy and then using that, which - if I’m not mistaken - is what random() actually is, and thus is usually “good enough”. It’s still not using true entropy though, and if you were able to inspect all of Python’s memory before asking for a random line from a file, you would be able to predict which one would be returned.

I know, that’s why I included the randrange version in the benchmark, as that is also PRNG. And still much slower.

Yeah, that’s why I only mentioned it in parentheses and preluded it with “if a tiny bias were acceptable”, because wanting to use secrets to me sounds like wanting perfection.

What do you mean? That’s already in there.

Edit: oh you mean try the min solution with randbelow? I don’t think that makes sense. I’d use secrets.randbits instead.

Benchmark for the min solution with various randomness functions, again the time for choosing from 10000 items:

  1.04 ± 0.02 ms  random.random
  1.51 ± 0.00 ms  random.getrandbits
  3.83 ± 0.03 ms  random.randbytes
  7.58 ± 0.05 ms  random.randrange
 12.43 ± 0.07 ms  os.urandom
 14.63 ± 0.16 ms  secrets.token_bytes
 16.55 ± 0.13 ms  secrets.randbits
 37.25 ± 0.50 ms  secrets.randbelow

I used 53 bits or 7 bytes of randomness with each, as 53 bits is indeed what random() uses. The expression 2**53 gets compiled to a constant.

Code:

from timeit import timeit
from statistics import mean, stdev
from itertools import repeat
from random import random, randrange, getrandbits, randbytes, shuffle
from secrets import randbelow, randbits, token_bytes
from os import urandom

def random_random(xs):
    return min(xs, key=lambda _: random())

def random_randrange(xs):
    return min(xs, key=lambda _: randrange(2**53))

def random_getrandbits(xs):
    return min(xs, key=lambda _: getrandbits(53))

def random_randbytes(xs):
    return min(xs, key=lambda _: randbytes(7))

def secrets_randbelow(xs):
    return min(xs, key=lambda _: randbelow(2**53))

def secrets_randbits(xs):
    return min(xs, key=lambda _: randbits(53))

def secrets_token_bytes(xs):
    return min(xs, key=lambda _: token_bytes(7))

def os_urandom(xs):
    return min(xs, key=lambda _: urandom(7))


solutions = [
    random_random,
    random_randrange,
    random_getrandbits,
    random_randbytes,
    secrets_randbelow,
    secrets_randbits,
    secrets_token_bytes,
    os_urandom,
]

for s in solutions:
    print([s(range(3)) for _ in range(1000)].count(1))

times = {s: [] for s in solutions}
def stats(s):
    ts = [t * 1e3 for t in sorted(times[s])[:10]]
    return f'{mean(ts):6.2f} ± {stdev(ts):4.2f} ms '
for _ in range(100):
    shuffle(solutions)
    for s in solutions:
        t = timeit(lambda: s(repeat(None, 10**4)), number=1)
        times[s].append(t)
for s in sorted(solutions, key=stats):
    print(stats(s), s.__name__.replace('_', '.', 1))

Attempt This Online!

1 Like

But again, them wanting to use the secrets module to me sounds like they want perfection, so using the min solution with any of the secrets functions doesn’t make sense to me (that’s probably the main reason I originally didn’t understand that that’s what you meant).

Oh! Sorry, I think I misread your imports - I thought you were using multiple strategies that were all built on the random module. My bad. Withdraw the objection.

Thanks for clarifying it in the followup; if anyone else made the same mistake I did, this will make it a bit easier to see what’s happening.

Ha, ok :-). Anyway, I added a few more in the min benchmark above, os.urandom(7) is now the fastest “secure” one.

In the title and first message you only said “little memory”. Which one is it? If just “little”, then you could do things like read chunks of 1000 lines into memory, and selecting a random chunk and a random line within that chunk. That would very much reduce how many random numbers you need. Then again, if I didn’t overlook it, you never said anything about speed…

1 Like

I don’t really care about speed, as long as it doesn’t take 10 seconds to run the script (correct me if I or why I should prioritize speed). I’m using the script for secret purposes, so I didn’t release the bash source, but the pseudocode instead.

I’m thinking of making two functions. One for putting the line in a list, then using secrets to pick a random line if the file is small enough for memory (granted, I have to ask what’s small? should I know if there’s enough memory for n lines?) and two for using two passes over the file or the reservoir sampling algorithm if the file is large.

Wouldn’t checking the checksum solve most of the few possibilities?

You could checksum it to ascertain whether or not it’s changed, but if the checksum is different, what do you do? You still have the same possible problems, with the same potential solutions. It’s really only the line count that matters here.

Of course, if you have a transactional file system (eg if your “file” is actually rows in a PostgreSQL database), you can guarantee this. But most actual files don’t have that promise.

1 Like

Well, I’m more concerned about the security risks if you’re talking about the random algorithm.