Multiframe ZSTD file: how to jump to and stream the second file?

I compress two ndjson files into a multiframe ZST file where each ndjson is compressed into a frame. I have the following metadata meta_data (as a list) of the ZST file:

import zstandard as zstd
from pathlib import Path

input_file  = r"E:\Personal projects\tmp\test.zst"
input_file  = Path(output_file)

meta_data = [{'name'                : 'chunk_0.ndjson',
              'uncompressed_size'   : 2147473321,
              'compressed_offset'   : 0,
              'uncompressed_offset' : 0,
              'compressed_size'     : 175631248},
             {'name'                : 'chunk_1.ndjson',
              'uncompressed_size'   : 2147473321,
              'compressed_offset'   : 175631248,
              'uncompressed_offset' : 2147473321,
              'compressed_size'     : 175631248}]

In Python, how can we leverage the above meta_data to seek to chunk_1.ndjson, start decompressing, and stream it line-by-line? In this way, we don’t need to

  • decompress chunk_0.ndjson,
  • load the whole compressed chunk_1.ndjson into the memory.

Thank your for your help.

1 Like

Bit of a hack, but if the chunks/frames are in that order, and zstd (or the decompression library) accepts any file object, open it and call f.seek on the sum of all the compressed_sizes of all the frames preceding the one desired. Check if the file format has a header or inter-frame spacer bytes, or wrapping that need to be added on too.