Non-optimal bz2 reading speed

Hello,
(Long time Python user, I have a few packages on PyPi, but first post here!)

I’ve noticed that opening a .bz2 file with the built-in module bz2 and iterating over its lines is probably non-optimal:

import bz2
with bz2.open("F:\test.bz2", 'rb') as f:
    for l in f:

I found that a custom-built buffering like this seems to give at least a 2x reading speed factor.

This seems strange that a “read by 1 MB blocks” method improves it, because Lib/bz2.py already uses a BufferedReader. Shouldn’t a io.BufferedReader handle this for us?

I really think there is still room for a massive improvement because, during my tests on a 10 GB bz2 file:

  • time to decompress test.bz2 into test.tmp + time to process test.tmp line by line

is far smaller than:

  • time to iterate directly over lines of bz2.open('test.bz2')

It should not be the case, is that right?

Jo

1 Like

At a guess, having run into this kind of issue elsewhere in CPython, the default buffer sizes are probably too small for modern PCs and modern workflows.

BTW, this question belongs in the Users category, unless you’re specifically suggesting a way to improve it and looking for opinions on integrating it into the core runtime (in which case, it still probably belongs in Ideas, but I’d forgive a fully-formed idea showing up in this category).

2 Likes

Would you provide the test.bz2 too?

@steve.dower Thank you for your adivce. (Noted for the categorization, I’ll do better next time).

@methane I used files with between 0 and 25 characters per line. Then for reproducible benchmarks I generated a random file with these specifications and it shows the same performance. So feel free to use this which generates a 93 MB bz2 file, which is enough for testing:

import bz2, random, string
with bz2.open('test.bz2', 'wb') as f:
    for i in range(10_000_000):
        line = ''.join(random.choices(string.ascii_letters + string.digits, k=random.randrange(25)))
        f.write(line.encode() + b'\n')

I get ~ 250k lines/sec read speed with

import bz2, time
t0 = time.time()
with bz2.open("test.bz2", 'rb') as f: # 250k / sec
    for i, l in enumerate(f):
        if i % 100000 == 0:
            print('%i lines/sec' % (i/(time.time() - t0)))

whereas betweeen 2 and 3 times faster with the code here, which is a really dirty hack with custom buffering (reinventing the wheel) that should be avoided, ideally.

2 Likes

Thank you. I can reproduce it. And I confirmed that overhead is comes from BZ2File.readline(). It is a pure Python function.

For the record, you don’t have to write a custom buffering by yourself. BufferedReader() provides readline() implemented in C.

with bz2.open("test.bz2", 'rb') as f:
    for i, L in enumerate(io.BufferedReader(f)):
        pass

Or you can access internal buffer directly, although it is a private member.

with bz2.open("test.bz2", 'rb') as f:
    for i, L in enumerate(f._buffer):
        pass

Is it safe adding this in BZ2File?

def __iter__(self):
    return iter(self._buffer)

I think gzip and lzma have same issue. But bz2 is special: it uses RLock.
It makes overhead larger.

Why only bz2 use RLock?
https://bugs.python.org/issue5863

Thanks @methane, your 2 methods already improves it massively: 550k lines/sec and 600k, vs. 250k with the naive method.

This custom method seems to give 700k lines/sec, what do you think about it? Can we avoid this dirty hack by using built-in buffering solutions?

import bz2, time
t0 = time.time()
i = 0
s = b''
with bz2.open("test.bz2", 'rb') as f:
    while True:
        s += f.read(1024*1024)
        L = s.split(b'\n')  # better than splitlines in case of weird end-of-line like "\r\r\n", see https://stackoverflow.com/questions/65765343/splitlines-and-iterating-over-an-opened-file-give-different-results
        if not L:
            break
        for l in L[:-1]:   # the 1 MB block that we read might stop in the middle of a line ... (*)
            i += 1
            if i % 100000 == 0:
                print('%i lines/sec' % (i/(time.time() - t0)))
        s = L[-1]          # (*) ... so we keep the rest for the next iteration
1 Like

You can already try to play with the buffer size.

1 Like

IMHO we could increase the defaults if it creates a better experience for users.

1 Like

Thanks! I tried with larger buffer sizes like io.BufferedReader(f, buffer_size=1024*1024), but it did not really improve, still between 550k and 600k per second for my example.

Side-question, why is it required to manually add a BufferedReader:

with bz2.open("test.bz2", 'rb') as f:
    for i, L in enumerate(io.BufferedReader(f)):

when it seems that there is already one under the hood:
cpython/bz2.py at 3.9 · python/cpython · GitHub

Shouldn’t the library automatically do this when we do:

with bz2.open("test.bz2", 'rb') as f:
    for i, L in enumerate(f):

?

1 Like