Add copy between zipfiles without recompressing files

eric_collins · October 11, 2024, 9:38am

Motivation

Compression can take 80-90% of the time of creating a zipfile. Caching this work can speed up the creation time of zipfiles by up to 5-10x.

Design Problem

In zipfiles, each file is individually and independently compressed. Therefore, when copying individual files between zipfiles, decompressing and recompressing is unnecessary work. However, we must do this work with the current Zipfile stdlib implementation.

Proposed Solution

add the ability to read files from a zipfile without decompressing them
add the ability to write precompressed files to a zipfile

Proposed API Design

Individual files within a zipfile contain two sections:

zinfo, which contains metadata about the zipfile, such as:
- the filename
- how the file was compressed (what algorithm was used)
- a CRC to check the data integrity when it’s decompressed
the file data itself

A file’s zinfo provides all the metadata necessary for writing a precompressed file to a zipfile. The ZipFile stdlib already includes a method, ZipFile.getinfo(name), to return a file’s zinfo. Therefore, we need to add:

a way to get file data from a zipfile without decompressing the data
a way to write precompressed file data to a zipfile using the zinfo

To achieve this, I propose the following API changes.

APIs that affect both read and write

class ZipFile:
     # Zipfile.open can perform both reading and writing
-    def open(self, name, mode="r", pwd=None, *, force_zip64=False):
+    def open(self, name, mode="r", pwd=None, *, force_zip64=False,
+             precompressed=False):

APIs to read from ZipFile without decompressing files

class ZipExtFile(io.BufferedIOBase):
    def __init__(self, fileobj, mode, zipinfo, pwd=None,
-                 close_fileobj=False):
+                 close_fileobj=False, decompress=True):

class ZipFile:
-    def read(self, name, pwd=None):
+    def read(self, name, pwd=None, decompress=True):

APIs to write to ZipFile with precompressed files

Design option 1 - add two new methods to the ZipFile class

class ZipFile:
+   def write_precompressed(self, fileobject, zinfo):

+   def writestr_precompressed(self, data, zinfo):

Design option 2 - extend existing ZipFile class methods

class ZipFile:
    def writestr(self, zinfo_or_arcname, data,
-                compress_type=None, compresslevel=None):
+                compress_type=None, compresslevel=None, *,
+                precompressed=False):

-    def write(self, filename, arcname=None,
-             compress_type=None, compresslevel=None):
+    def write(self, file, arcname=None,
+             compress_type=None, compresslevel=None,
+             *, zinfo=None, precompressed=False):

A significant aspect to the changes in design option 2 is Zipfile.write’s filename parameter

We could change ZipFile.write to accept a file-like object as well as a filename. Note that this design might be desirable independently of the changes proposed in this document.

However, if we accept the file-like object as the first positional argument, then the parameter name filename is no longer accurate. filename could be changed to something generic, such as file. However, this would introduce a backwards-incompatible change for code that uses keyword arguments. To address this, we could use a keyword-only filename parameter, but since the first argument is positional, this would still cause a signature mismatch. To avoid this mismatch, the file argument itself could have a default value set to None, but this seems like a pretty significant change, especially to address the relatively small issue of a parameter name no longer being accurate. Additionally, it may be confusing.

Alternatively, if keep filename as is, and simply add file or fileobj as a keyword argument, ZipFile.write would still expect the filename positional argument, which wouldn’t be necessary or used.

Another option is to allow the filename parameter to accept zinfo (which does include the filename, among other things), although this parameter name is still misleading. Then the file data could be supplied in a new keyword-only argument such precompressed_file or fileobj. However, if we want to accept file objects as an argument to ZipFile.write not only when dealing with precompressed data, this signature wouldn’t support that, because it depends upon having the object’s zinfo.

Notes

The github user gpshead proposed a similar, but different solution here: Allow pre-compressed data to be written into zip files via the `zipfile` module. · Issue #113924 · python/cpython · GitHub I have based my work on his PR draft.

In that GitHub issue, serhiy-storchaka commented that:

I thought that it would be more useful to add a method that copies data (one or several entries) from a ZIP file to other ZIP file. It can be used to remove from a ZIP archive (there were several requests for this feature). It can be used to merge several ZIP files in one.

I created and tested the changes serhiy-storchaka suggested, and it worked effectively for my caching needs.

I searched for similar issues and proposals to this one, but I couldn’t find any. This might be due to my inexperience, since this is my first attempt at contributing to CPython.

Feedback

I am posting this document to solicit feedback, especially for the API proposal, although other feedback is also welcome.

nmstoker · October 11, 2024, 10:15pm

Thank you for the detailed proposal.

Personally I haven’t ever had to copy between zip files using python but I was curious if you had a sense of how common this task is for a typical python user? (if there is such a thing?!)

Clearly it would be a subjective estimate, but at least that would give a sense of how broadly beneficial overall an improvement like this might be.

It would also be interesting to think about where it’s most commonly used, if people doing it currently just put up with it or if they use workarounds (eg calling out to more performant zip tools) and whether that might be “good enough” for most people (again subjective).

eric_collins · October 12, 2024, 12:28pm

I anticipate that the most common use case for this would be to cache files used in builds.

Example Usage - Ren’Py

Ren’Py is a popular visual novel engine that’s written in Python. I recently posted a poll in Ren’Py’s discord and although the sample size is small, the early results suggest that the average Ren’Py developer builds 30 times per game, and building takes about 3 minutes. Most of this time is spent compressing files.

A single Ren’Py build might create four zipfiles for different platforms. Those zipfiles mostly share the same files. Since most files don’t change between releases, a user might compress the same file 100+ times as they iterate on their game.

In my test, using a cache made the compression part of the build 5.1x faster. In other caching tests, I’ve seen speed increases up to 52.8x.

Therefore, a cache could reduce the 30 x 3 minutes = 90 minutes that an average developer spends waiting on builds down to 2-18 minutes.

Other Performance Tests

I’ve also performed a couple of other tests:

Compressing several large images:

67 seconds without a cache
7 seconds with a full cache, a 9.6x speed increase

Compressing several csv files:

106.841 seconds without cache
2.024 seconds with a full cache, a 52.8x speed increase

Parallelized Compression

Additionally, there is another use case for this change: parallelized compression. Even if a user has an empty cache (or don’t want to retain a cache on disk), they could still parallelize the compression of the individual files to a temporary cache location on disk, and then combine the compressed files into one zipfile.

Since the temporarily cached files could be removed as they’re consumed, the space overhead this introduces is:

negligible metadata overhead due to the zipfiles being individual files
the size of the largest individual compressed file, since it will temporarily exist at both at the destination zipfile and in the cache

Here’s an example of what the code might do:

concurrently zip each file individually (one zipfile per file) via multiprocessing
wait for all files to be zipped
read and collate the zipped files in a single process

Notably, this doesn’t require multiprocessing/multithreading support from the Zipfile library itself.

My test results from this were:

67 seconds without a cache
14 seconds with an empty cache, a 4.8x speed increase

efimov-mikhail · October 13, 2024, 8:33am

These arguments make sense for me. I don’t sure about exact API changes, but basic idea of direct manipulations on compressed data via public API looks good.

Moreover, making a library with good API for parallel archive creation would be useful. For example, library user have to define only list of files and maximal number of processes, but all parallel logic for real archive creation can be generic.

CharlieClark · October 13, 2024, 11:04am

I’ve recently been thinking along the same lines. Lots of file formats, eg. Microsoft Office, use zip as a kind of container for their data files. It’s fast enough, compresion is reasonable and ubiquitous.

I’m planning to add documentation for an EditableZipFile to Openpyxl with ability to delete and replace archive members. The ability of being able to copy members between archives could also dramatically improve performance for large reports in Excel using templates; and it would also be possible to write zips in parallel. But, as the main focus is helping users deal with broken workbooks, I’m unlikely to release any library.

I’m not keen on the proposed API here, particularly the method names, but then I don’t think that, if it were developed now, ZipFile would have the same API as it has. There is also a suggested pull request to allow member deletion, which I think makes a lot of sense, but given the lack of work on it, reminds me of how little love the standard library seems to get.

FWIW we’re also adding an incremental writer to xml.etree.

eric_collins · October 14, 2024, 4:12am

I agree that the proposed API could use some work. How about this?

Proposed External API Design - Version 2

# all methods are only available when mode="w"
class ZipFile:
+    def copy_file(self, source_zipfile, file):

     # while we can call copy_file for every file we want to copy
     # in a zipfile, if we add a batch operation, we can open
     # the zipfile just once
     # instead of opening it 10,000 times to read 10,000 files
+    def copy_files(self, source_zipfile, files):

+    def copy_all_files(self, source_zipfile):

Besides the overall design, there are also two details that I’d like to hear opinions on.

The first is whether the API should only accept a single type for each parameter, or multiple types. For example, the file parameter could accept:

strings
path objects
zinfo

A second detail to consider is whether filename strings should be transformed internally within these methods. For example, when writing to a Zipfile, the following transformations occur:

/User/myfile.txt becomes User/myfile.txt (the initial / is cut off)
mydirectory/ becomes mydirectory (the trailing / is cut off)

There might be other transformations, too. Should the API apply these transformations automatically internally, or should it be dumb instead?

CharlieClark · October 14, 2024, 9:11am

I was thinking of using update() which could be provided an individual or list of filenames but I’d try keep the API as simple as possible because anyone who wants to do this should be considered a consenting adult, this also guarantees that Zinfo will be correct.

My own code was initially based on python - overwriting file in ziparchive - Stack Overflow but I found the approach of a class decorator a bit weird so simply subclassed. The way of finding the end of a member is also a bit weird, I optimised this by countring backwards from the end of the archive, but then I found the private attribute _end_offset in Zinfo; this important because entries can contain empty space, so can’t simply rely on compressed size + header + extra :-/

I’m still working on the details for my own use cases, because there’s currently too much fiddling with self.fp for my liking, as is the need to update both self.filelist and self.NameToInfo. I’ll post a link to the updated docs when I’ve got something I’m happy with.

eric_collins · October 20, 2024, 10:01pm

I’ve created a PR for this issue: gh-113924: Add copy from zipfile without decompress/recompressing data by sunrisesarsaparilla · Pull Request #125718 · python/cpython · GitHub

Melendowski · October 20, 2024, 10:36pm

Id like to add my support for this functionality. It is indeed useful. Other formats (more specifically for numerical data) such as hdf5 support operations such as this and makes copying of data very fast since we are only copying the compressed version.