It is very nice that Python 3.14 added Zstandard to the standard library. Please see here for its package documentation. I can use it with a rudimentary syntax
from compression import zstd
from pathlib import Path
import shutil
outDir = r"E:\Personal Projects\tmp"
outDir = Path(outDir)
inTar = outDir / "chunk_0.tar"
zstdDir = outDir / "chunk_0.tar.zst"
with open(inTar, 'rb') as f:
with zstd.open(zstdDir, 'wb') as g:
shutil.copyfileobj(f, g)
Could you explain how to use it to compress a file in streaming and multi-threaded mode?
In this way, we can utilize modern hardware with multi-core CPUs to compress a file that does not fit in the memory. Unfortunately, the documentation does not contain any examples of my use case.
from compression import zstd
from pathlib import Path
from multiprocessing import cpu_count
import shutil
outDir = r"E:\Personal Projects\tmp"
outDir = Path(outDir)
inTar = outDir / "chunk_0.tar"
zstdDir = outDir / "chunk_0.tar.zst"
# here I am using the number of processors on your device, but you may
# want to limit this to 4 or 8 workers depending on your use case.
compressionOptions = {
zstd.CompressionParameter.nb_workers: cpu_count()
}
with open(inTar, 'rb') as f:
with zstd.open(zstdDir, 'wb', options=compressionOptions) as g:
shutil.copyfileobj(f, g)
Note that sometimes Zstandard does not support multi-threaded compression depending on how it was compiled. If this is code that you expect to run in multiple environments, you may wish to add error handling for the resulting ZstdError.
I expect people will want multi-threaded compression frequently enough we should add an example to the docs. Maybe we can just add the nb_workers flag to the last example.
Thank you very much for your answer which works very well. Does zstd.open(zstdDir, 'wb', options=compressionOptions) as g automatically handle bigger-than-memory compression?
shutil.copyfileobj takes a length parameter which you can use to set the size of the buffer used to copy data. The documentation also notes “by default the data is read in chunks to avoid uncontrolled memory consumption”.