(Long time Python user, I have a few packages on PyPi, but first post here!)
I’ve noticed that opening a .bz2 file with the built-in module
bz2 and iterating over its lines is probably non-optimal:
import bz2 with bz2.open("F:\test.bz2", 'rb') as f: for l in f:
I found that a custom-built buffering like this seems to give at least a 2x reading speed factor.
This seems strange that a “read by 1 MB blocks” method improves it, because Lib/bz2.py already uses a BufferedReader. Shouldn’t a
io.BufferedReader handle this for us?
I really think there is still room for a massive improvement because, during my tests on a 10 GB bz2 file:
- time to decompress
test.tmp+ time to process
test.tmpline by line
is far smaller than:
- time to iterate directly over lines of
It should not be the case, is that right?