Randomly select a line in a file that uses little memory

Denise_unique · June 18, 2023, 7:39am

How do I randomly select a line in a file in Python using the secret module that uses little memory?

TeamSpen210 · June 18, 2023, 10:57am

That’s not really possible, at least with an arbitrary text file. To find a specific line, or even how many lines there are you have to read through the file, counting newline characters. You really need to have an index stored - some sort of compact data telling you at what position each line will be in the file. A robust approach would be a database file, but you could come up with something yourself.

If you can’t change the format, you really have no choice other than to read the whole file in. If you really want to keep memory down you could read it in twice. The first time count the lines, the second find the selected one.

sinoroc · June 18, 2023, 1:42pm

Maybe we could select a random byte of the file, assuming we know what is the number of bytes in the file. Then figure out which line it belongs to, maybe we could seek forward and backwards until newline characters are encountered. We also have to decide what to do when the randomly selected byte belongs to a newline character. An obvious drawback is that longer lines will have a higher probably of being selected.

I do not know if it is a better strategy than any other. It all depends on what kind of files we are working with. Mostly long lines, short lines, lines of comparable lengths, inconsistent line lengths, long files, short files, etc. Probably this strategy is better suited for long files, containing only short lines of comparable lengths.

What is meant by “the secret module that uses little memory”? Do you have a specific library in mind?

Denise_unique · June 18, 2023, 9:23pm

No, sorry for the confusion over my bad English. The standard secrets library (secrets — Generate secure random numbers for managing secrets — Python 3.12.1 documentation) is what I mean, and I want the script to use as little memory as possible.

Denise_unique · June 18, 2023, 9:24pm

@sinoroc @TeamSpen210

How about I use the sum() method to get the number of lines then the secrets randbelow() method + 1 to get the random line number finally looping through the file for the random line number (or using the linecache library maybe)?

kknechtel · June 19, 2023, 7:40am

There is a standard algorithm for this that I have known about forever and learned about back when I still used Perl (so it must be something like 20 years now). Reading the entire file is unavoidable, but it is not necessary to store more than one line simultaneously (except, of course, that the system will buffer a full “page” of the file while reading it - you can check the size as io.DEFAULT_BUFFER_SIZE; on my system it is 8KB). The algorithm is as follows:

Read the first line and initially “choose” it.
Read the next line and, with probability 1/2, replace the selection with the newly read line.
Read the next line and do so with probability 1/3, etc.

In general, when we read the nth line, we replace the result variable with probability 1/n. (The first line isn’t a special case; we assign it with probability 1/1 = 1, and it doesn’t matter that there wasn’t a value before.) Correctness can easily be proven by induction - left as an exercise.

In Python, this might look like:

with open('...') as f:
    for line in f:
        if not secrets.randbelow(n): # result was 0, so 1/n chance was confirmed
            result = line
# result will not be defined if the file is empty;
# error handling is up to your individual needs.

The algorithm can also be generalized to select a sample of k out of the n lines; choose the first k in a list, and then for each subsequent line generate a random index from 0..n; if it’s in range, replace an element, and otherwise ignore it. This does not preserve the original order of the lines; that’s not hard to fix, but hurts performance.

BTW: The random standard library doesn’t implement these algorithms, and they’re among things I’d like to see added

Rosuav · June 19, 2023, 8:36am

While this does indeed minimize memory usage, it comes at the price of N individual calls to the RNG. With a PRNG such as the default random.random() function, this is probably not a problem, but with a large file and the secrets module the OP demanded, this will put a significant load on the system entropy source. Here’s some stats:

Starting with the whole thing in memory:

rosuav@sikorsky:~$ python3 -m timeit -s 'import random' 'random.choice(list(open("/usr/share/dict/british-english-insane")))'
5 loops, best of 5: 77 msec per loop
rosuav@sikorsky:~$ wc -l /usr/share/dict/british-english-insane 
654299 /usr/share/dict/british-english-insane
rosuav@sikorsky:~$ python3 -m timeit -s 'import random' 'random.choice(range(654299))'
500000 loops, best of 5: 682 nsec per loop

Since I have enough memory to hold this word list and take a single selection, this is pretty cheap. Comparing with the secrets module:

rosuav@sikorsky:~$ python3 -m timeit -s 'import secrets' 'secrets.choice(list(open("/usr/share/dict/british-english-insane")))'
5 loops, best of 5: 75.6 msec per loop
rosuav@sikorsky:~$ python3 -m timeit -s 'import secrets' 'secrets.choice(range(654299))'
100000 loops, best of 5: 3.48 usec per loop

With the actual file, the difference between secrets and random is completely insignificant - not enough difference to make a difference, within the variation from one round to the next ^[1], basically not worth considering. Even on the in-memory range, the time cost went up from half a microsecond to a few microseconds. Not really a lot of cost. But what if we use the lower-memory algorithm?

rosuav@sikorsky:~$ python3 -m timeit -s 'import random' 'for n, datum in enumerate(range(654299), 1):' ' if not random.randrange(n):' '  selection = datum'
1 loop, best of 5: 358 msec per loop
rosuav@sikorsky:~$ python3 -m timeit -s 'import secrets' 'for n, datum in enumerate(range(654299), 1):' ' if not secrets.randbelow(n):' '  selection = datum'
1 loop, best of 5: 1.99 sec per loop

Ouch. System randomness is NOT CHEAP. To justify this, you’d need to have insufficient memory to store the whole file AND the cost of reading from the disk has to be high enough to not want to do this in two passes.

thanks Paul Harrell ↩︎

Denise_unique · June 19, 2023, 9:25am

I’m not really a math person, but I feel like this algorithm makes the `secrets’ module kind of useless? Please correct me if I’m wrong so I can learn.

Denise_unique · June 19, 2023, 9:26am

What about this code? Is it good enough?

import os
import secrets


def count_lines(filepath: os.PathLike) -> int:
    with open(filepath, "r", encoding="UTF-8") as file:
        return sum(1 for __ in file)


def random_line(filepath: os.PathLike) -> str:
    num_lines = count_lines(filepath)
    random_index = secrets.randbelow(num_lines)
    with open(filepath, "r", encoding="UTF-8") as file:
        for index, line in enumerate(file):
            if index == random_index:
                return line.strip()
    return ""  # Line not found

kknechtel · June 19, 2023, 9:42am

The secrets module offers a few very specific utilities for working with random numbers, which are not the same as the ones in the random module. They have been chosen to be ones that make the most sense in the cases where you need cryptographically secure randomness. But the point of the module is that cryptographic security.

To make sure you understand: this means that people who study the random results, should never be able to figure out anything about the logic used to compute the random numbers (and therefore predict what values will come next). This may come at the expense of speed, and is usually not necessary - so it isn’t the default.

So, what do you mean by “make it kind of useless?”

If you do need to make absolutely sure that someone using your program can’t guess which line will be chosen, based on previous program output (like, because someone’s bank account could be compromised - if it’s “somebody could make a bot to cheat at a single-player videogame”, then probably nobody cares), then it will not matter that you ask for many random numbers. That’s the point, after all.

If you mean “I think the module should implement this algorithm for me already” - well, neither does the basic random module. Although that one does implement a lot of other stuff that is not in secrets, because it’s stuff that’s generally useful when you do simulations or modelling, and not useful when you try to e.g. invent an agreed-upon secret name for the person you’re communicating with that third parties don’t know (basically what an API token is). Meanwhile, secrets contains a couple of things that are more useful in those contexts (like token_hex).

This approach will avoid storing the whole file simultaneously in memory, yes. As you’re aware, it requires opening and reading through the file twice.

Rosuav · June 19, 2023, 12:04pm

This algorithm would be built upon the secrets module. Broadly speaking, whenever a programmer talks about “probability”, there are two ways to get that:

A pseudo-random number generator - basically a big ol’ number cruncher that makes “fairly random” numbers
Or true entropy, which comes from spinning cylinders in fluid, or blackbody radiation

(Sometimes “true entropy” is augmented with “true unpredictability” which comes from sources such as keypress timings. From a program’s point of view, it’s all the same.)

A PRNG can, if you know enough about its internal state, be predicted. It is algorithmic after all. However, it’s also really REALLY fast, so a good hybrid is to get a little bit of true entropy and use that to seed a PRNG.

In Python, you can access an entropically-seeded PRNG using the random module, or you can get true entropy using the secrets module. For the vast majority of purposes, the random module is fine, but if you’re making actual cryptographic secrets, well, that’s where the secrets module got its name from

The algorithm described is one which depends on a source of randomness, so it can be built on either; however, since it requires so many separate questions of “should we replace the selection with this one?”, it is horrendously inefficient with true entropy.

Rosuav · June 19, 2023, 12:10pm

Two passes over the file is an efficient way to do things in that it doesn’t require storing the whole file in memory, uses only one call to the random number generator, and has finitely bounded costs. However, what happens if the file changes while you’re reading it twice? There are a few possibilities:

The file has been appended to. Not a big deal as you will just ignore the newly-added lines.
The file has been prepended to. Uncommon but possible. You will ignore the trailing N lines, which could potentially bias your sample away from the tail of the file (imagine if you’re selecting from a file which is constantly being prepended to; you could have something that is biased against the oldest lines in the file).
The file has been altered, but has the exact same number of lines. Not a problem.
The file has had some lines deleted. You now have a big problem: what do you do? In your example here, you return an empty string, so your code is biased towards empty string returns. I personally would prefer an exception here; another option would be to use the final index as your new num_lines and repeat the exercise, although this can result in unbounded cost to select.

Whether it is cheaper to do two passes or to hold everything in memory depends on the relative costs of I/O and temporary storage, but if the choice is to use more I/O, it would be worth at least acknowleding these risks.

franklinvp · June 19, 2023, 7:19pm

One option could be to generate a random number in [0,1], which is interpreted as the target r=\frac{\text{random_line_number}}{\text{total_number_of_lines}}.

Then use two file handlers h1 and h2, one h1 that seeks the end of the file but that each time it advances, the second is advanced only if its h2’s line over h1’s line is too far from (smaller than) r.
When h1 reaches the end of the file, return the line where h2 is now.

Denise_unique · June 22, 2023, 8:32am

So you use the Reservoir Sampling algorithm?

And you are right, I mean “I think the module should implement this algorithm for me already”.

It just feels unfair that the second, third line and so on has more chances, but I just don’t understand the algorithm.

Rosuav · June 22, 2023, 8:39am

Let’s start from the last line in the file and work back. (For this example I’ll use a ten line file but the same would apply.) This line is only ever checked once, and since it’s the tenth line, there’s one chance in ten that it will be selected. This is perfect!

By simple arithmetic, we can see that there’s a 90% chance that one of the previous nine will be selected. Of those lines, the last one (the ninth line) was checked once, and at the time, it had one chance in nine of being selected. That is to say, out of the 90% chance of it not being the very last, there’s one ninth chance of it being the second last, and multiplying those together gives a 10% chance that it’s the second last line. Also correct!

You can follow the same logic back all the way to the first one. Every line has the same chance, because of this repeated slicing and re-slicing.

Denise_unique · June 22, 2023, 9:46am

You really make me think! Thank you very much!

But I’m really bad at percentages, decimals, so probability. But this is a nice way to challenge how I solve complex problems, even if it’s simple arithmetic.

Can you tell me why the method is called randbelow? Does it have something to do with the word “below”?

This is the only guess that makes sense to me at the moment.

secrets randbelow method = 1/(n+1). Since it also includes 0 in the equation or formula, it’s like a list or tuple starting at index 0, the probability of selecting the first line is 1/3 (0 or 1 or 2)? If 0, then the first line is selected.

Note: I’m about to send this but I just forgot where I got the 1/(n+1). Math is so hard man…

Rosuav · June 22, 2023, 10:01am

Yes, that’s exactly what it is. It returns a random whole number below the one you gave it. (A whole number is a non-negative integer, that is to say, 0, 1, 2, 3, 4, 5, 6, 7…) The random module offers this same functionality under the name randrange as a single argument to range or randrange will give the same effect of “whole number below this limit”, so if you think about how range(6) gives you 0, 1, 2, 3, 4, and 5, it makes sense that randrange(6) gives you one of those six numbers.

Using randbelow as the fundamental primitive makes a lot of sense; randbelow(10) has precisely 10 possible options, randbelow(100) has precisely 100, etc.

Yep! The possible options for randrange/randbelow precisely match the indices of something with that length. So, for example:

items[random.randrange(len(items))]
items[secrets.randbelow(len(items))]
random.choice(items)

all select a random element from a list/tuple.

randbelow(6) might return 0, 1, 2, 3, 4, or 5. However, it won’t return 6, just like how range(6) doesn’t include 6. So the probability that randbelow(6) == 0 is 1/6.

Rosuav · June 22, 2023, 10:03am

BTW - Karl’s original description of the algorithm used if not secrets.randbelow(n): which is effectively equivalent to asking “if randbelow returns zero”, or in a more human way, “if an n-sided dice comes up 1”. (Since dice tend to not come with zeroes on them, usually.)

pochmann · June 22, 2023, 12:10pm

I think for n, line in enumerate(f, 1): would be better, so that your code actually works (currently you don’t define/update n).

pochmann · June 22, 2023, 12:25pm

More Itertools already has sample, using a more efficient algorithm (fewer random numbers needed). Although I think it creates a tiny bias due to using floats. (And if a tiny bias were acceptable, this task could simply be done with result = min(f, key=lambda _: random())).