Stdlib lzma module: expose MT stream en/decoder?

Hi, had a few conversations during PyCon UK and it came up that there were capabilities in the MT stream encoder/decoder that were not available in the default implementation already exposed. While drafting a proposed implementation I also spoke with @hugovk about whether this is a good idea (threads) and he suggested I open a discussion thread here as well as a PR. And it looks like a PR needs an issue needs a discussion anyway?

So you ask, why would one want to do this?

  • it does speed things up (you have to explicitly ask for it)
  • this is the encoder the xz binary actually defaults to
  • you have control over things like block size, and the stream encoder makes different decisions how to encode stuff - important for some people (see Friday’s PyCon lightning talks for an interested user - being able to byte-for-byte recreate an existing archive - https://youtu.be/CouUftzuQVQ?t=2327)
  • the MT encoder even sets different flags in the header, it always encodes the orig/compressed size, … (again for those who care about being able to generate files indistinguishable from xz et al.)

The draft code is here, I assume the documentation is not up to scratch yet: Comparing python:main...mistotebe:lzma_mt · python/cpython · GitHub

2 Likes

I’m the speaker in said PyCon UK lightning talk. My thanks to Ondřej for doing this, I’ll be trying it this week.

I’ve since created and released GitHub - moreati/lzma-cf: Multithreaded LZMA & XZ compression on Python for control freaks, based on PyPy’s CFFI bindings. The API differs but I expect performance, memory usage, and output will be comparible. There are Linux, macOS & Windows wheels for anyone wishing to try it.

3 Likes

Just taking a quick glance at this, I suggest going ahead and making a github issue and PR. It looks straightforward enough and maintainable.

People who want things today or want a place to continually evolve features are encouraged to use things like moreati’s PyPI package though.

2 Likes

Hi Ondřej! Thanks for working on this!

+1 to opening a PR, there is already an issue actually!

That approach was slightly different, as it didn’t expose all of the lzma_mt options. I think it should be fine to do so however.

1 Like

compression.zstd.CompressionParameter and the options parameter of compression.zstd.compress() et al might serve as model for design of the API, from the examples

from compression import zstd

options = {
   zstd.CompressionParameter.checksum_flag: 1
}
with zstd.open("file.zst", "w", options=options) as f:
    f.write(b"Mind if I squeeze in?")
1 Like

FWIW, I think a dictionary based approach is more consistent with the way users currently supply filters in lzma. While I like the CompressionParameter design I think it is more important to be consistent within the module.

1 Like