I would like to compress multiple ndjson files into ZSTD file. For random seeking later on, I would like that each file is compressed independently as a frame.
- With the Python binding
python-zstandard, I can use itscopy_stream:
import zstandard as zstd
from pathlib import Path
file_to_compress = [r"E:\Personal Projects\tmp\chunk_0.ndjson",
r"E:\Personal Projects\tmp\chunk_0.ndjson"]
file_to_compress = [Path(p) for p in file_to_compress]
output_file = r"E:\Personal Projects\tmp\dataset.zst"
output_file = Path(output_file)
cctx = zstd.ZstdCompressor(write_content_size=True,
write_checksum=True,
threads=5)
with open(output_file, "wb") as f_out:
for src in file_to_compress:
file_size = src.stat().st_size
with open(src, "rb") as fin:
cctx.copy_stream(fin, f_out, size=file_size)
- With the standard library
compression.zstd,
from compression import zstd
from pathlib import Path
file_to_compress = [r"E:\Personal Projects\tmp\chunk_0.ndjson",
r"E:\Personal Projects\tmp\chunk_0.ndjson"]
file_to_compress = [Path(p) for p in file_to_compress]
output_file = r"E:\Personal Projects\tmp\dataset.zst"
output_file = Path(output_file)
options = {
zstd.CompressionParameter.nb_workers: 10,
zstd.CompressionParameter.content_size_flag: True,
zstd.CompressionParameter.checksum_flag: True,
}
cctx = zstd.ZstdCompressor(options=options)
with open(output_file, "wb") as f_out:
for src in file_to_compress:
file_size = src.stat().st_size
data = src.read_bytes()
cctx.set_pledged_input_size(file_size)
compressed_data = cctx.compress(data, mode=zstd.ZstdCompressor.FLUSH_FRAME)
f_out.write(compressed_data)
One problem remains: in the second snippet with compression.zstd, the whole file is loaded into memory, possibly causing out-of-memory (OOM) problem.
How can we avoid OOM problem?
Hi @emmatyping, could you have a look at this question? Thank you for your help.