Multithreaded gzip reading and writing

rhpvorderman · February 21, 2023, 3:45pm

It is possible to write a gzip file with multiple thread. The pigz implementation that Mark Adler made contains comments on how to do so efficiently.
gzip reading is limited to single-thread (although the crc could be checked in a separate thread).

Since CPython’s zlib module allows escaping the GIL when multiple threads are used this could be used to allow the use of multiple cores when compressing gzip files. For decompression, it should allow to escape the main python thread, and do the decompression on a separate core. Both applications will free the main thread to get actual work done in python.

Currently this is possible using the xopen library. But that escapes the GIL by calling a subprocess that calls pigz (or igzip) for reading and writing. It would be much more elegant to have a threads keyword-only argument to gzip.open that allows this behaviour.

guido · February 21, 2023, 4:25pm

It sounds to me that this would be an excellent feature for a 3rd party extension module, but it doesn’t feel like it belongs in the stdlib – it feels too much like an extreme corner case, and likely won’t work on all platforms.

Also keep in mind that even if it is an excellent idea, that doesn’t mean it’s going to be implemented by the core development team – this looks like a big coding project that requires someone really dedicated to making it work.

vovavili · February 21, 2023, 6:45pm

An analogue for JSON would be a simple stdlib json module and extremely fast, multithreaded, complex, C++ -based third-party packages like pysimdjson. Great idea, but probably not for stdlib.

rhpvorderman · February 22, 2023, 8:26am

It sounds to me that this would be an excellent feature for a 3rd party extension module, but it doesn’t feel like it belongs in the stdlib – it feels too much like an extreme corner case,

An analogue for JSON would be a simple stdlib json module and extremely fast, multithreaded, complex, C++ -based third-party packages like pysimdjson. Great idea, but probably not for stdlib.

Thanks for this feedback. In my field (bioinformatics) zlib compressed formats with multiple larger than GB files are pretty common, but that indeed distorts my view on “normal” workloads.

Also keep in mind that even if it is an excellent idea, that doesn’t mean it’s going to be implemented by the core development team – this looks like a big coding project that requires someone really dedicated to making it work.

That would be me. I already maintain python bindings for ISA-L and zlib-ng to allow for faster compression and decompression. My plan of attack would be to implement it in those libraries first and then backport the code to the python stdlib. (All performance improvements I found so far I have backported to CPython.)

I am glad that I pitched it here first though, because not having to worry about stdlib compatibility first gives me more freedom in how I implement it in the extension modules. Thank you both!

LtWorf · February 25, 2023, 9:39am

I can imagine cases where threading would be slower.

Such as a web framework using gzip encoding to send an html page. The gains would probably end up being eaten by the time to do the multithreading.

rhpvorderman · February 27, 2023, 5:24am

I agree with that. The default would be to use no threads. In the end I think that allowing threaded behaviour is indeed to specialistic for CPython.

rhpvorderman · December 26, 2023, 9:59am

For those interested, multithreaded reading and writing gzip implementations have been added to python-isal and python-zlib-ng.

In order to enable this, I had to rewrite the gzip reader implementation. This change also reduced decompression overhead significantly for single-threaded applications. I have proposed backporting this change into CPython. Rewrite gzip._GzipReader in C for more performance and less overhead · Issue #110283 · python/cpython · GitHub