Gzip.py: allow deterministic compression (without time stamp)

Is this the right forum for concerns about the standard library?

gzip compression, using class GzipFile from gzip.py, by default
inserts a timestamp to the compressed stream. If the optional
argument mtime is absent or None, then the current time is used [1].

This makes outputs non-deterministic, which can badly confuse
unsuspecting users: If you run “diff” over two outputs to see
whether they are unaffected by changes in your application,
then you would not expect that the *.gz binaries differ just
because they were created at different times.

I’d propose to introduce a new constant NO_TIMESTAMP as
possible value of mtime.

Furthermore, if policy about API changes allows, I’d suggest
that NO_TIMESTAMP become the new default value for mtime.

How to proceed from here? Is this the kind of proposals that
has to go through a PEP?

  • Joachim

[1] cpython/gzip.py at 6f1e8ccffa5b1272a36a35405d3c4e4bbba0c082 · python/cpython · GitHub

Could most of the benefit not be achieved by simply adding an explanation to the documentation, suggesting that if you need deterministic output you should pass mtime=0?

I have no opinion on the question of changing the default, as I’ve never needed gzip output to be deterministic.

Thanks for mtime=0. I wasn’t aware of that possibility. Does it prevent the timestamp from being written, or does it expand to a constant (incorrect, but almost always inconsequential) date? Epoch = 1970?

According to the docs here, “All gzip compressed streams are required to contain this timestamp field”. So it inserts a timestamp of 0, because you can’t omit the timestamp altogether.

Indeed, this conforms with the gzip specification: “MTIME = 0 means no time stamp is available.”