Get uncompressed file size without reading it?

smontanaro · May 7, 2024, 9:50pm

I don’t suppose any of the various file compression modules can tell me the uncompressed file size without reading the file’s content? I’m sure anything like that would need to be supported by the underlying compression scheme, stashed somewhere in the file header or footer. I’m thinking specifically of the gzip, lzma and bz2 modules. The application need is to provide progress feedback. os.path.getsize() doesn’t cut it for compressed files.

kknechtel · May 7, 2024, 10:22pm

It would, yes, since otherwise the uncompressed size has to be calculated by a process that is not really any easier than actually decompressing it. (I guess a really clever interface could allow decompression into an output stream, coupled with an output stream that counts what it receives without storing it - to save memory.)

As far as I can tell, nothing in the standard library wraps up this functionality for you.

The Gzip format stores this data at the end , so it should be easy enough to seek around and grab it.

LZMA is really just the algorithm and there seem to be a variety of different containers. The standard library supports a legacy .lzma format that seems to have a 13-byte header, wherein the last 8 bytes are uncompressed size (note that this site looks to me like it’s hosting AI-generated content), and .xz (yeah, that one). The latter apparently stores a separate optional uncompressed size for each block, and then also stores an “index” for each “stream” which contains a sequence of records corresponding to blocks, each of which includes (no longer option) an uncompressed size for the block. Also it looks like those size values are encoded as variable-length integers. And of course an .xz file can contain multiple streams. So there is quite a bit of work to do there.

Bzip2 is reverse engineered, not formally specified. As far as I can tell, it doesn’t store such metadata at all.

cameron · May 7, 2024, 10:31pm

I don’t suppose any of the various file compression modules can tell me
the uncompressed file size without reading the file’s content? I’m sure
anything like that would need to be supported by the underlying
compression scheme, stashed somewhere in the file header or footer.
I’m thinking specifically of the gzip, lzma and bz2 modules.

Not as far as I know. But I could be wrong.

The application need is to provide progress feedback.
os.path.getsize() doesn’t cut it for compressed files.

In terms of the file uncompressed size, no. But since progress bars are
for human consumption, you can get a pretty good bar by presenting the
progress through the compressed data as you feed it to the decompressor.

smontanaro · May 7, 2024, 11:37pm

Good point. At the moment I’m just using gzip.open() and similar front-end functions, not reading the raw compressed data and feeding it to the decompressor.

cameron · May 8, 2024, 12:14am

Except that a gzip file can contain multiple gzipped things appended
to each other. This is legal:

 gzip filename
 gzip <filename2 >>filename.gz

Seeking to the end would only get the info for filename2.

The above isn’t an archive format, it is just compressed data. The
output of gunzip < filename.gz is what you’d get from:

 cat filename filename2

before filename was replaced.

BTW, this is what makes saving new messages to a gzipped mbox file
feasible - you can just append!

kknechtel · May 8, 2024, 2:12am

Ah, fair enough. And it seems not to record the compressed size for the individual pieces, either - so you can’t just seek out the end of each piece. (I assume the uncompressed size is at the end because back in the day you wouldn’t care about that information until after decompressing.) How obnoxious.

blhsing · May 8, 2024, 9:06am

You can feed to gzip.open a file wrapper instead.

The file wrapper in this StackOverflow answer should do: