Non-optimal bz2 reading speed

(Long time Python user, I have a few packages on PyPi, but first post here!)

I’ve noticed that opening a .bz2 file with the built-in module bz2 and iterating over its lines is probably non-optimal:

import bz2
with"F:\test.bz2", 'rb') as f:
    for l in f:

I found that a custom-built buffering like this seems to give at least a 2x reading speed factor.

This seems strange that a “read by 1 MB blocks” method improves it, because Lib/ already uses a BufferedReader. Shouldn’t a io.BufferedReader handle this for us?

I really think there is still room for a massive improvement because, during my tests on a 10 GB bz2 file:

  • time to decompress test.bz2 into test.tmp + time to process test.tmp line by line

is far smaller than:

  • time to iterate directly over lines of'test.bz2')

It should not be the case, is that right?


1 Like

At a guess, having run into this kind of issue elsewhere in CPython, the default buffer sizes are probably too small for modern PCs and modern workflows.

BTW, this question belongs in the Users category, unless you’re specifically suggesting a way to improve it and looking for opinions on integrating it into the core runtime (in which case, it still probably belongs in Ideas, but I’d forgive a fully-formed idea showing up in this category).


Would you provide the test.bz2 too?

@steve.dower Thank you for your adivce. (Noted for the categorization, I’ll do better next time).

@methane I used files with between 0 and 25 characters per line. Then for reproducible benchmarks I generated a random file with these specifications and it shows the same performance. So feel free to use this which generates a 93 MB bz2 file, which is enough for testing:

import bz2, random, string
with'test.bz2', 'wb') as f:
    for i in range(10_000_000):
        line = ''.join(random.choices(string.ascii_letters + string.digits, k=random.randrange(25)))
        f.write(line.encode() + b'\n')

I get ~ 250k lines/sec read speed with

import bz2, time
t0 = time.time()
with"test.bz2", 'rb') as f: # 250k / sec
    for i, l in enumerate(f):
        if i % 100000 == 0:
            print('%i lines/sec' % (i/(time.time() - t0)))

whereas betweeen 2 and 3 times faster with the code here, which is a really dirty hack with custom buffering (reinventing the wheel) that should be avoided, ideally.


Thank you. I can reproduce it. And I confirmed that overhead is comes from BZ2File.readline(). It is a pure Python function.

For the record, you don’t have to write a custom buffering by yourself. BufferedReader() provides readline() implemented in C.

with"test.bz2", 'rb') as f:
    for i, L in enumerate(io.BufferedReader(f)):

Or you can access internal buffer directly, although it is a private member.

with"test.bz2", 'rb') as f:
    for i, L in enumerate(f._buffer):

Is it safe adding this in BZ2File?

def __iter__(self):
    return iter(self._buffer)

I think gzip and lzma have same issue. But bz2 is special: it uses RLock.
It makes overhead larger.

Why only bz2 use RLock?

Thanks @methane, your 2 methods already improves it massively: 550k lines/sec and 600k, vs. 250k with the naive method.

This custom method seems to give 700k lines/sec, what do you think about it? Can we avoid this dirty hack by using built-in buffering solutions?

import bz2, time
t0 = time.time()
i = 0
s = b''
with"test.bz2", 'rb') as f:
    while True:
        s +=*1024)
        L = s.split(b'\n')  # better than splitlines in case of weird end-of-line like "\r\r\n", see
        if not L:
        for l in L[:-1]:   # the 1 MB block that we read might stop in the middle of a line ... (*)
            i += 1
            if i % 100000 == 0:
                print('%i lines/sec' % (i/(time.time() - t0)))
        s = L[-1]          # (*) ... so we keep the rest for the next iteration
1 Like

You can already try to play with the buffer size.

1 Like

IMHO we could increase the defaults if it creates a better experience for users.

1 Like

Thanks! I tried with larger buffer sizes like io.BufferedReader(f, buffer_size=1024*1024), but it did not really improve, still between 550k and 600k per second for my example.

Side-question, why is it required to manually add a BufferedReader:

with"test.bz2", 'rb') as f:
    for i, L in enumerate(io.BufferedReader(f)):

when it seems that there is already one under the hood:
cpython/ at 3.9 · python/cpython · GitHub

Shouldn’t the library automatically do this when we do:

with"test.bz2", 'rb') as f:
    for i, L in enumerate(f):


1 Like