PEP 784: Adding Zstandard to the standard library

I don’t see that the namespace vs namespace package thing is a factor. Nobody that I can see has said anything to insinuate they don’t like the compression namespace but would happily back it if you put in an empty compression/__init__.py.

What’s the rationale behind removing the current convenience of doing:
python -m gzip file1.txt

Will the new Zstd library come with a command-line utility, or will it no longer be as convenient?
python -m compresion.zstd file2.txt?

Oh I see, you want compression to be a namespace package. I don’t think this is a good idea for a few reasons:

  1. Namespaces work best when managed by a single entity
  2. Allowing third parties to register e.g. compression.lz4 would taking that name in the future, and we’d run into the exact same issue that the compression package is trying to solve, that the ideal name is already taken
  3. There are security and stability concerns. Things in the standard library are trusted more than third party code by many users. Allowing third parties to place modules inin a compressn namespace means I don’t know if the module I’m importing is from the standard library or not until I’ve imported it and checked __file__. I think it is important users are able to know to trust items under compression
5 Likes

The PEP does not remove this convenience. gzip, zipfile, and tarfile all provide these CLIs but are not compression libraries themselves, they are file formats that use compression. In the case of GZip, it defines a file format that uses DEFLATE (provided through zlib). None of the modules specified to be moved in the PEP currently have a CLI interface. That does not change with the PEP.

Just like lzma,bz2, and zlib, it will not come with a command line for the initial implementation. But that is not something the PEP prohibits either. If someone proposes CLIs for compression formats then that can always be added.

The way to think about this is archive formats, i.e. gzip, zipfile, and tarfile (modules that define file formats that contain multiple members), come with command lines to generate those archive formats. Compression modules, i.e. zlib, lzma, bz2, and with this proposal zstd, do not.

5 Likes

TL;DR: I don’t think using the ‘compression’ namespace to group compression algorithms while leaving out gzip, zip, and tarfile would provide a good user experience.


From an end user’s perspective, this is becoming too complex. There is a subtle distinction between archiving and compressing. Users primarily care about how to handle files with extensions like .zip, .gzip, .ztd, etc.

If a user encounters a ‘compression’ library and gzip is not included, it either means that gzip does not perform compression, or the ‘compression’ library is not actually about compression.

I would personally expect compression to be able to handle files at the very least. I don’t think Python namespaces are about taxonomy, but rather about user experience. Today, when someone says ‘compress,’ they mean compressing a file.

I don’t think Python’s ‘http’ module would be useful if it were only about the HTTP protocol and not about creating a web server or making a web request.

3 Likes

Those three are not compression algorithms. Those are container file formats. One of which (tar) doesn’t even offer compression as part of its container file format.

1 Like

That is not what I mean when I say compress.

Compression means bytes go into an API and generally fewer bytes come out the other end. No files involved.

Regardless, of your trio listed, gzip is single file based and thus sits on the fence - it makes sense to me for inclusion in compression as it is effectively the predecessor equivalent to BZ2File, LZMAFile, and ZStdFile - it just happens to have a top level name matching the command line utility name for historical reasons instead of living within zlib which is the compression algorithm the gzip file format wraps.

4 Likes

Including gzip opens the door to future improvements, a unified API, and more. Thanks for your understanding!

1 Like

Yeah gzip is a weird in-between since it’s contents are supposed to be one file/stream. I think to me it is still an archive/“container” format because it can contain extra metadata about it’s contents, which isn’t really under the purview of a compression algorithm. Furthermore, the RFC abstract for gzip states that while it uses DEFLATE, it could be extended to use other compression algorithms:

The format presently uses the DEFLATE method of compression but can be easily extended to use other compression methods.

To me that makes it much more of an archiving tool rather than a compression tool: it doesn’t define a particular algorithm for compression, but rather a file format that uses one or more compression algorithms.

3 Likes

Hah, that’s a weird bit of trivia I didn’t realize was there. For all practical purposes though, gzip is just zlib surrounded by metadata. Nobody has ever used it with another compression format and it is highly unlikely anyone ever would because 30+ years of software does not expect otherwise.

2 Likes

Very true! I don’t mean to imply that it would be used with another compression format, I merely point it out as an indicator of the design goals of the format. If people think gzip should be in compression I wouldn’t be opposed.

1 Like

No, I definitely don’t. I’m just saying that’s what the name insinuates. If I hadn’t read this PEP, I would be surprised to discover it’s from compression import gzip but not from compression import brotli. If I was writing a new compression library for PyPI and hadn’t read this PEP, I’d have probably tried to put it into the compression namespace.

Oh, sorry again for misunderstanding you. I think the double negatives confused me.

I don’t see how a package name implies whether or not it is a namespace. concurrent is similarly generic, but I don’t think I’ve ever seen someone expecting it to be a namespace package. Regardless the documentation for compression can make it very clear that it is only for stdlib modules.

1 Like

Actually, it can contain multiple chunks (section 2.2 of the RFC includes this). And this isn’t a theoretical thing: bgzip (block gzip) is widely used in the bioinformatics field to compress large data files in a way that can be indexed (i.e. you can store a pointer to a particular block). This is also the format of the CBCL[1] files from some of the larger Illumina sequencers.

I think that puts gzip firmly in the “archiving” camp, rather than just being compression.

Really, even though it doesn’t currently exist? If you write a new xml parser would you try to put it in that namespace? Any name in the stdlib (package or module) takes precedence over PyPI packages that might want to use that name, that’s just how it works[2]. I don’t think it’s an argument against adding a name.


  1. concatenated base call ↩︎

  2. unless we someday move the stdlib under std or something similar ↩︎

3 Likes

Now that it has been over 24 hours and it looks like most commenters have voted, I wanted to loop back to the poll on naming. It appears that compression.zstd has received a vote from the majority of voters (62%), so I think I will move forward with the current naming in the PEP. zstdlib and an experimental _zstd both received votes from just under a third of voters (28%). There were a handful of voters that chose “other” (5%) or “we should not add Zstandard” (4%).

I’ll work on expanding the rejected ideas to cover the discussion about module name in more detail. In addition, I will add a section discussing the rejection of including archive modules like zipfile and tarfile.

The main open question to me is whether or not we should include gzip under compression. Based on @jamestwebber’s most recent post:

I am inclined to say that we should not put gzip under compression, considering it is a container for multiple files, rather than a compression algorithm.

Once the discussion around gzip is settled, I hope to submit the PEP later this week barring other issues being raised. Hopefully before the next Steering Council meeting :grin:

Thank you again to everyone has has participated in the discussion thread, either voting or commenting.

7 Likes

I think gzip should be in compression package.

Although gzip file format can be used to store multiple files, it is uncommon.
Python API doesn’t support such use case well.

4 Likes

Here’s some background on how gzip came to replace compress:

Also (emphasis mine):
" gzip is a single-file/stream lossless data compression utility, where the resulting compressed file generally has the suffix .gz ."

That’s referring to the binary itself (i.e. the gzip you invoke on the command line). The second sentence is “gzip also refers to the associated compressed data format used by the utility.” Which, as mentioned above, can have multiple members.

edit: I should admit that I’ve been imprecise when describing block/multi-member gzip. I don’t think it’s used much for multiple files, in the sense that decompressing would lead to multiple different paths on the file system. But it does get used with multiple chunks quite a bit[1], because this allows for indexing[2].

The python module doesn’t support this, but it could if someone wanted to add that functionality. I don’t know that it would see a ton of use though, because there are more performant options out there.


  1. in my world, that is ↩︎

  2. and parallel decompression ↩︎

1 Like

Could you please explain what makes gzip unsuitable to be part of the compression namespace?

Feature gzip zstd (Zstandard)
Compression Algorithm DEFLATE (LZ77+Huffman) Zstd (LZ77+FSE+Huff)
File Extension .gz .zst
Supports Multiple Files No No
Combined with tar Yes (.tar.gz) Yes (.tar.zst)
Compression Speed Moderate Very Fast
Decompression Speed Fast Extremely Fast
Compression Ratio Good Better than gzip
Streaming Support Yes Yes
Dictionary Support No Yes
Parallel Compression Limited Yes
Release Year 1992 2016
Open Source Yes Yes
Common in OS Tools Yes (ubiquitous) Yes (newer systems)

There’s a subtle distinction between command line tools (as described by the table you posted here) and compression algorithms. In common parlance “gzip” has been conflated to mean both things but it’s really just the former (the table makes this clear in the first row). If you read the code for gzip.py, you’ll see that it imports zlib for performing compression, because that’s the actual compression method.

I don’t think it needs to move from the top level namespace, where lots of people are expecting it. And it doesn’t properly belong in a collection of compression algorithms. But I don’t really care if a shim is added as well, it’s basically free to do that. If the SC insisted on it, I’m not gonna argue.

1 Like