`tarfile` vs `zipfile` timezone discrepancy in handling modification times

ncoghlan · August 16, 2024, 5:00am

I’m currently working on a project that involves creating reproducible tar and zip archives from various inputs (it’s the same project that prompted this open feature idea)

This thread isnt about that idea, it’s about a much more specific discrepancy between tarfile and zipfile and their implicit assumptions about the way filesystems handle time zones.

Specifically:

tarfile reads the filesystem mtime value as a float from the stat result, then truncates it to an integer when writing it to the actual tar archive
zipfile reads the filesystem mtime from the stat results, converts it to a time_struct with time.localtime and then encodes only the first 6 fields when writing the value to the zipfile (discarding the local timezone info entirely)

The way tarfile works is essentially assuming that the file mtime is in UTC, which is going to be a valid assumption for essentially every filesystem other than FAT or FAT32 (even NTFS stores timestamps in UTC, so this isn’t a Windows vs non-Windows discrepancy).

The way zipfile works presumably originates in the pre-NTFS Windows era, where file timestamps were genuinely stored in the local timezone.

The possible fix I’m considering is to just change zipfile to use time.gmtime instead of time.localtime when it converts the file mtime value to a time_struct (i.e. making the same assumption as tarfile, that the filesystem stores times in UTC, not the local timezone).

For zipfiles generated on NTFS and other filesystems that use UTC timestamps, this would fix a subtle bug in the timestamps recorded when the local timezone is not UTC. That way archives produced on a modern Windows client system running in the user’s local timezone would get the same archive entry timestamps as those produced in a modern Windows CI environment running in UTC.

However, it would also introduce a corresponding bug if the filesystem really does store local times (such as FAT or FAT32).

Are there other downsides to making that change that I’m not seeing?

storchaka · August 16, 2024, 5:54am

What other Windows implementations do?

storchaka · August 16, 2024, 6:00am

BTW, I have an unfinished patch to store date and time with higher precision and larger range. There are several extensions for this and Windows implementations can already do this by default, so you should look in details what is written in original date and time fields and what is written in extra fields.

I guess I should revive it.

ncoghlan · August 16, 2024, 7:20am

The archive formats truncate the modification time resolution to a second at best anyway (I believe zipfile resolution isn’t even that good), so higher precision times elsewhere shouldn’t affect the archive formats.

For my original question, comparing with other implementations is a good idea.

The Windows native “Send to compressed folder” option looks like it produces a zipfile with local times. The “Host OS” metadata field is also reported by 7zip as “FAT” (the same as it is for Python), but cpython/Lib/zipfile.py at 3.11 · python/cpython · GitHub indicates that is purely OS dependent in zipfile, whereas the use of time.localtime when reading timestamps is unconditional (so archives created on non-Unix systems will still use local timestamps, but will not be flagged as originating from a FAT style filesystem).

7-zip archive is the same (storing local timestamps).

Given those examples, I’m withdrawing my original idea of suggesting switching to local.gmtime as the default behaviour, as this is clearly a common convention across zipfile creation tools. Instead, I filed `zipfile` and `tarfile` docs should cover local timezone impact on entry timestamps · Issue #123059 · python/cpython · GitHub as a docs issue so we can provide some authoritative guidance on this topic.

After figuring out the full workaround for my current use case, I’ll consider whether or not to propose a standardised way to override zipfile’s timestamp processing (if the existing workaround is clean enough, it may not be worth making any changes to simplify it).

ncoghlan · August 16, 2024, 8:10am

My current workaround (for both this local time issue and for timestamp clamping) is to build the archive from a working directory and actively modify the file timestamps with os.utime.

Attempting to modify the ZipInfo objects themselves on the fly isn’t currently a nice option, since there’s no filter callback like the one offered by TarFile.add. Since ZipFile.write doesn’t accept ZipInfo objects as input, you have to choose between reimplementing that (e.g. in a ZipFile subclass) so you can still use shutil.copyfileobj for the data transfer, or else using ZipFile.writestr, which means loading the entire file into RAM rather than streaming it in chunks.

That means the simplest API improvement that could be made is to also accept ZipInfo objects in ZipFile.write, so timestamps can be customised by doing:

    zip_entry = ZipInfo.from_file(fs_path_to_add)
    zip_entry.date_time = _clamp_zip_mtime_as_utc(zip_entry.date_time)
    zf.write(zip_entry, arcname)

A larger API enhancement project would be to add a ZipFile.add recursive inclusion function, along similar lines to TarFile.add (including a filter callback for customisation of ZipInfo entries).

barry-scott · August 16, 2024, 11:11am

That would potentially be a change in behavior that may break existing users of zipfile right? For example existing code is expecting the local time to be stored.

ncoghlan · August 16, 2024, 12:18pm

Yeah, it would only be justifiable if the current behaviour could legitimately be considered a bug.

The fact other zip archiving tools behave the same way that Python does makes it clear this behaviour is a genuine difference in expected conventions between the two archive formats, so there’s no justification for changing it (but we can at least put some notes in the documentation about it for those cases where the difference actually matters).

gpshead · August 16, 2024, 4:55pm

Take a look at my stalled draft PR gh-113924: Allow pre-compressed data to be written into a zip via `zipfile`. by gpshead · Pull Request #113925 · python/cpython · GitHub which might allow for this even though the title and other stated reason for its existence might hide that fact. My ex-employer was creating canonical zip files, this would’ve been used to avoid the hoops that had to jump through to set fixed timestamps including avoiding the nightmare of having to read multi-gigabyte inputs into ram just to call writestr (not a friendly thing to do on a shared pool of distributed build workers each operating within resource limits, let alone in a more common rest of the world environment: on anyone’s laptop which rarely has enough ram). See line 1867 of Lib/zipfile/__init__.py adding zinfo= support to .write in that PR…

if anyone wants to pick that PR up and run with it, please feel free. supporting zinfo= on zipfile.write is a good concept.

barry-scott · August 16, 2024, 5:18pm

The other zip format “bug” is that the paths in a zip file are just bytes and not in any specific or identified codec.

gpshead · August 16, 2024, 5:39pm

That is also true of most filesystems. codecs are a high level abstraction put on top. The zip format specifies a couple ways to treat names, but zip archives do exist that ignored that and put other values in the fields. We apparently don’t deal with those very well. zipfile: Corrupts filenames containing non-UTF8 characters · Issue #83042 · python/cpython · GitHub

(also, this is getting off topic for this thread)