Stdlib lzma module: expose MT stream en/decoder?

mistotebe · October 20, 2025, 4:02pm

Hi, had a few conversations during PyCon UK and it came up that there were capabilities in the MT stream encoder/decoder that were not available in the default implementation already exposed. While drafting a proposed implementation I also spoke with @hugovk about whether this is a good idea (threads) and he suggested I open a discussion thread here as well as a PR. And it looks like a PR needs an issue needs a discussion anyway?

So you ask, why would one want to do this?

it does speed things up (you have to explicitly ask for it)
this is the encoder the xz binary actually defaults to
you have control over things like block size, and the stream encoder makes different decisions how to encode stuff - important for some people (see Friday’s PyCon lightning talks for an interested user - being able to byte-for-byte recreate an existing archive - https://youtu.be/CouUftzuQVQ?t=2327)
the MT encoder even sets different flags in the header, it always encodes the orig/compressed size, … (again for those who care about being able to generate files indistinguishable from xz et al.)

The draft code is here, I assume the documentation is not up to scratch yet: Comparing python:main...mistotebe:lzma_mt · python/cpython · GitHub

moreati · October 20, 2025, 6:25pm

I’m the speaker in said PyCon UK lightning talk. My thanks to Ondřej for doing this, I’ll be trying it this week.

I’ve since created and released GitHub - moreati/lzma-cf: Multithreaded LZMA & XZ compression on Python for control freaks, based on PyPy’s CFFI bindings. The API differs but I expect performance, memory usage, and output will be comparible. There are Linux, macOS & Windows wheels for anyone wishing to try it.

gpshead · October 20, 2025, 10:09pm

Just taking a quick glance at this, I suggest going ahead and making a github issue and PR. It looks straightforward enough and maintainable.

People who want things today or want a place to continually evolve features are encouraged to use things like moreati’s PyPI package though.

emmatyping · October 20, 2025, 10:18pm

Hi Ondřej! Thanks for working on this!

+1 to opening a PR, there is already an issue actually!

github.com/python/cpython

LZMA MultiThreading XZ compression support

opened 11:13AM - 03 Feb 24 UTC

mkomet

type-feature

# Feature or enhancement ### Proposal: ```python import lzma data = b'8'…*(2 << 30) # Current API, using single-threaded pool in liblzma, # using `lzma_easy_encoder` / `lzma_stream_encoder` lzma.compress(data) # Compress using XZ underlying 4 background threads, using `lzma_stream_encoder_mt` lzma.compress(data, threads=4) # Use thread pool based on nproc (`lzma_cputhreads`) lzma.compress(data, threads=0) # Support in `LZMAFile` class with LZMAFile(BytesIO(data), threads=4) as f: pass # Throws ValueError for negative `threads` lzma.compress(data, threads=-1) ``` ### Notes: * This will extend the current API in `lzma.py`, by adding `threads=1` * `threads` is a suggestion to the underlying liblzma invocations, and is a hardcap for the number of background threads to use. Some presets might still use less threads (the more "aggressive" compression presets). * Using more background `threads` usually will cause deterioration in compression ratio, but will yield better performance time-wise. ### Has this already been discussed elsewhere? No response given ### Links to previous discussion of this feature: https://discuss.python.org/t/multi-threaded-lzma/26708/3 ### Linked PRs * gh-114954

That approach was slightly different, as it didn’t expose all of the lzma_mt options. I think it should be fine to do so however.

moreati · October 20, 2025, 10:54pm

compression.zstd.CompressionParameter and the options parameter of compression.zstd.compress() et al might serve as model for design of the API, from the examples

from compression import zstd

options = {
   zstd.CompressionParameter.checksum_flag: 1
}
with zstd.open("file.zst", "w", options=options) as f:
    f.write(b"Mind if I squeeze in?")

emmatyping · October 27, 2025, 12:45am

FWIW, I think a dictionary based approach is more consistent with the way users currently supply filters in lzma. While I like the CompressionParameter design I think it is more important to be consistent within the module.