Motivation
Compression can take 80-90% of the time of creating a zipfile. Caching this work can speed up the creation time of zipfiles by up to 5-10x.
Design Problem
In zipfiles, each file is individually and independently compressed. Therefore, when copying individual files between zipfiles, decompressing and recompressing is unnecessary work. However, we must do this work with the current Zipfile stdlib implementation.
Proposed Solution
- add the ability to read files from a zipfile without decompressing them
- add the ability to write precompressed files to a zipfile
Proposed API Design
Individual files within a zipfile contain two sections:
- zinfo, which contains metadata about the zipfile, such as:
- the filename
- how the file was compressed (what algorithm was used)
- a CRC to check the data integrity when it’s decompressed
- the file data itself
A file’s zinfo provides all the metadata necessary for writing a precompressed file to a zipfile. The ZipFile stdlib already includes a method, ZipFile.getinfo(name)
, to return a file’s zinfo. Therefore, we need to add:
- a way to get file data from a zipfile without decompressing the data
- a way to write precompressed file data to a zipfile using the zinfo
To achieve this, I propose the following API changes.
APIs that affect both read and write
class ZipFile:
# Zipfile.open can perform both reading and writing
- def open(self, name, mode="r", pwd=None, *, force_zip64=False):
+ def open(self, name, mode="r", pwd=None, *, force_zip64=False,
+ precompressed=False):
APIs to read from ZipFile without decompressing files
class ZipExtFile(io.BufferedIOBase):
def __init__(self, fileobj, mode, zipinfo, pwd=None,
- close_fileobj=False):
+ close_fileobj=False, decompress=True):
class ZipFile:
- def read(self, name, pwd=None):
+ def read(self, name, pwd=None, decompress=True):
APIs to write to ZipFile with precompressed files
Design option 1 - add two new methods to the ZipFile class
class ZipFile:
+ def write_precompressed(self, fileobject, zinfo):
+ def writestr_precompressed(self, data, zinfo):
Design option 2 - extend existing ZipFile class methods
class ZipFile:
def writestr(self, zinfo_or_arcname, data,
- compress_type=None, compresslevel=None):
+ compress_type=None, compresslevel=None, *,
+ precompressed=False):
- def write(self, filename, arcname=None,
- compress_type=None, compresslevel=None):
+ def write(self, file, arcname=None,
+ compress_type=None, compresslevel=None,
+ *, zinfo=None, precompressed=False):
A significant aspect to the changes in design option 2 is Zipfile.write
’s filename parameter
We could change ZipFile.write
to accept a file-like object as well as a filename. Note that this design might be desirable independently of the changes proposed in this document.
However, if we accept the file-like object as the first positional argument, then the parameter name filename
is no longer accurate. filename
could be changed to something generic, such as file
. However, this would introduce a backwards-incompatible change for code that uses keyword arguments. To address this, we could use a keyword-only filename
parameter, but since the first argument is positional, this would still cause a signature mismatch. To avoid this mismatch, the file
argument itself could have a default value set to None
, but this seems like a pretty significant change, especially to address the relatively small issue of a parameter name no longer being accurate. Additionally, it may be confusing.
Alternatively, if keep filename
as is, and simply add file
or fileobj
as a keyword argument, ZipFile.write
would still expect the filename
positional argument, which wouldn’t be necessary or used.
Another option is to allow the filename
parameter to accept zinfo
(which does include the filename, among other things), although this parameter name is still misleading. Then the file data could be supplied in a new keyword-only argument such precompressed_file
or fileobj
. However, if we want to accept file objects as an argument to ZipFile.write
not only when dealing with precompressed data, this signature wouldn’t support that, because it depends upon having the object’s zinfo.
Notes
The github user gpshead proposed a similar, but different solution here: Allow pre-compressed data to be written into zip files via the `zipfile` module. · Issue #113924 · python/cpython · GitHub I have based my work on his PR draft.
In that GitHub issue, serhiy-storchaka commented that:
I thought that it would be more useful to add a method that copies data (one or several entries) from a ZIP file to other ZIP file. It can be used to remove from a ZIP archive (there were several requests for this feature). It can be used to merge several ZIP files in one.
I created and tested the changes serhiy-storchaka suggested, and it worked effectively for my caching needs.
I searched for similar issues and proposals to this one, but I couldn’t find any. This might be due to my inexperience, since this is my first attempt at contributing to CPython.
Feedback
I am posting this document to solicit feedback, especially for the API proposal, although other feedback is also welcome.