Non-optimal bz2 reading speed

Hello,
(Long time Python user, I have a few packages on PyPi, but first post here!)

I’ve noticed that opening a .bz2 file with the built-in module bz2 and iterating over its lines is probably non-optimal:

import bz2
with bz2.open("F:\test.bz2", 'rb') as f:
    for l in f:

I found that a custom-built buffering like this seems to give at least a 2x reading speed factor.

This seems strange that a “read by 1 MB blocks” method improves it, because Lib/bz2.py already uses a BufferedReader. Shouldn’t a io.BufferedReader handle this for us?

I really think there is still room for a massive improvement because, during my tests on a 10 GB bz2 file:

  • time to decompress test.bz2 into test.tmp + time to process test.tmp line by line

is far smaller than:

  • time to iterate directly over lines of bz2.open('test.bz2')

It should not be the case, is that right?

Jo

1 Like

At a guess, having run into this kind of issue elsewhere in CPython, the default buffer sizes are probably too small for modern PCs and modern workflows.

BTW, this question belongs in the Users category, unless you’re specifically suggesting a way to improve it and looking for opinions on integrating it into the core runtime (in which case, it still probably belongs in Ideas, but I’d forgive a fully-formed idea showing up in this category).

2 Likes

Would you provide the test.bz2 too?

@steve.dower Thank you for your adivce. (Noted for the categorization, I’ll do better next time).

@methane I used files with between 0 and 25 characters per line. Then for reproducible benchmarks I generated a random file with these specifications and it shows the same performance. So feel free to use this which generates a 93 MB bz2 file, which is enough for testing:

import bz2, random, string
with bz2.open('test.bz2', 'wb') as f:
    for i in range(10_000_000):
        line = ''.join(random.choices(string.ascii_letters + string.digits, k=random.randrange(25)))
        f.write(line.encode() + b'\n')

I get ~ 250k lines/sec read speed with

import bz2, time
t0 = time.time()
with bz2.open("test.bz2", 'rb') as f: # 250k / sec
    for i, l in enumerate(f):
        if i % 100000 == 0:
            print('%i lines/sec' % (i/(time.time() - t0)))

whereas betweeen 2 and 3 times faster with the code here, which is a really dirty hack with custom buffering (reinventing the wheel) that should be avoided, ideally.

2 Likes

Thank you. I can reproduce it. And I confirmed that overhead is comes from BZ2File.readline(). It is a pure Python function.

For the record, you don’t have to write a custom buffering by yourself. BufferedReader() provides readline() implemented in C.

with bz2.open("test.bz2", 'rb') as f:
    for i, L in enumerate(io.BufferedReader(f)):
        pass

Or you can access internal buffer directly, although it is a private member.

with bz2.open("test.bz2", 'rb') as f:
    for i, L in enumerate(f._buffer):
        pass

Is it safe adding this in BZ2File?

def __iter__(self):
    return iter(self._buffer)

I think gzip and lzma have same issue. But bz2 is special: it uses RLock.
It makes overhead larger.

Why only bz2 use RLock?
https://bugs.python.org/issue5863

Thanks @methane, your 2 methods already improves it massively: 550k lines/sec and 600k, vs. 250k with the naive method.

This custom method seems to give 700k lines/sec, what do you think about it? Can we avoid this dirty hack by using built-in buffering solutions?

import bz2, time
t0 = time.time()
i = 0
s = b''
with bz2.open("test.bz2", 'rb') as f:
    while True:
        s += f.read(1024*1024)
        L = s.split(b'\n')  # better than splitlines in case of weird end-of-line like "\r\r\n", see https://stackoverflow.com/questions/65765343/splitlines-and-iterating-over-an-opened-file-give-different-results
        if not L:
            break
        for l in L[:-1]:   # the 1 MB block that we read might stop in the middle of a line ... (*)
            i += 1
            if i % 100000 == 0:
                print('%i lines/sec' % (i/(time.time() - t0)))
        s = L[-1]          # (*) ... so we keep the rest for the next iteration
1 Like

You can already try to play with the buffer size.

1 Like

IMHO we could increase the defaults if it creates a better experience for users.

1 Like

Thanks! I tried with larger buffer sizes like io.BufferedReader(f, buffer_size=1024*1024), but it did not really improve, still between 550k and 600k per second for my example.

Side-question, why is it required to manually add a BufferedReader:

with bz2.open("test.bz2", 'rb') as f:
    for i, L in enumerate(io.BufferedReader(f)):

when it seems that there is already one under the hood:
cpython/bz2.py at 3.9 · python/cpython · GitHub

Shouldn’t the library automatically do this when we do:

with bz2.open("test.bz2", 'rb') as f:
    for i, L in enumerate(f):

?

1 Like

It happened that I studied the size of these buffers a few days ago, I will reply here to introduce them later.

You are right. We did it. You can use fast iteration automatically since Python 3.10.

Thank you for reporting it here!

Thanks! I tried with larger buffer sizes like io.BufferedReader(f, buffer_size=1024*1024), but it did not really improve, still between 550k and 600k per second for my example.

The default value of buffer_size in io.BufferedReader() is 8 KiB.
Only Changing buffer_size may not help a lot.

If buffer_size is 100 KiB, a 420 KiB read request will be completed in these steps:

  1. Read 400 KiB from the underlying stream in block mode.
  2. Request 100 KiB from the underlying stream, and store to the buffer. If 20+ KiB is read, it will stop to prevent blocks indefinitely [1].
  3. Return 420 KiB data to the caller.

BufferedReader.readline() has the similar behavior, if it reads a ‘\n’, it stops [2].

The role of BufferedReader here is more to ensure that enough bytes can be read. The underlying stream’s .read(size) method only return at most size bytes [3]. BufferedReader.read(300_000) will continuously call the underlying stream’s .read() method, until it return 300_000+ bytes.

_compression.BUFFER_SIZE (default 8 KiB) should also be changed, it’s the decompressor’s input chunk size [4]. If it’s 8 KiB, maybe only dozens of KiB can be decompressed, can’t fill BufferedReader’s 1 MiB buffer.

Even increase these two values at the same time, the improvement is not big. Increasing them from 8KiB to 128KiB, .read(-1) will speed up from 6.6x seconds to 6.5x seconds. I tested bz2/lzma modules, .readline()/.read(-1) methods, when I have time, I will look into the reasons.

At a guess, having run into this kind of issue elsewhere in CPython, the default buffer sizes are probably too small for modern PCs and modern workflows.

OS usually has read prefetching, this paper written in 2005 introduced the prefetching mechanism in Linux/BSD.

For example:

    decomp = Decompressor()
    while True:
        input_data = ifp.read(READ_SIZE)
        output_data = decomp.decompress(input_data)
        ofp.write(output_data)

If the OS has prefetching and write cache, it may perform the tasks (read/decompress/write) in parallel to some degree.

If change the default buffer size, it’s best to ask OS experts, and those who are familiar with various storage devices. If it is too large, it may be slower.

@methane @steve.dower For an unknown reason, my posts were marked as “spam” (which is definitely not), probably because my account is rather new.
Can someone (or a moderator) undo this? Thank you in advance.

You’ll need @moderators for that, I don’t even see the alleged spam.

BTW, someone will need to turn this into a bug on https://bugs.python.org/ to get it fixed in the runtime. This is a discussion board for non-bug items, so nothing gets tracked here.