PEP 784: Adding Zstandard to the standard library

tjreedy · April 7, 2025, 10:49pm

I think a compression namespace might be a good idea, but with a shorter name (easier to type) that also does not conflict with existing 3rd party module. Perhaps ‘compress’ or ‘shrink’. I sometimes wish tkinter were ‘tk’ and ‘idlelib’ were 'idle.

emmatyping · April 7, 2025, 11:15pm

(Free) Threading is an area I need to poke at more. The current branch I have nominally has support for threading, but I need to test it more. Right now the compressor/decompressor have per-object locks that check at the method level that there isn’t concurrent access against the (de)compression contexts. So it should be safe, even if not performant.

I think for purposes of the PEP, even having single threaded Zstandard would be a compelling addition, but I am hopeful to have time to add multithreaded support. Note that all other compression modules are single threaded (and using locks) as far as I am aware (e.g. LZMA MultiThreading XZ compression support · Issue #114953 · python/cpython · GitHub).

emmatyping · April 7, 2025, 11:23pm

The compression name has the benefit that it is much less likely to conflict with 3rd party code, as the package with that name was reserved for CPython. There is a chance someone is using it for some private/internal code. But in that scenario I don’t think compression is any more or less likely to be used than compress. I think shrink won’t be nearly as obvious, and compress is claimed: compress · PyPI

gpshead · April 7, 2025, 11:33pm

This is a case where we maybe could’ve proposed just using that name, but it is always a risk to do so: You never want any API divergence between what the PyPI package you’ve adopted into a stdlib shape now offers. That is not something we are good at guaranteeing unless we own the upstream PyPI package and take on that guarantee and future maintenance of it ourselves. We’ve often left such things in separate names regardless to avoid confusion.

It is also important to have distinct names in order to reason about code. Which module is import zstd importing? You can’t assume by reading the code when it could be stdlib but sometimes could not. (it doesn’t matter if you can’t assume anything when it comes to Python, for practical purposes all of us do and can in normal code environments)

If I write from compression import zstd I know exactly what I’m getting. It allows users to explicitly specify “try the stdlib import, if that fails, fallback to this other import zstd thing” in their code. Instead of magically hoping their packaging configs and import path make that happen outside of view of the code.

Another pain point of reusing an existing PyPI module name: Which set of sorted imports does it belong in (a common Pythonic norm is to sort: stdlib, then third party, then local owner application, as enforced and automated via popular import sorting tooling). When something moves from one to another but keeps the same name you get a dance of disagreeing import sorters based upon what their view of the stdlib vs installed 3p packages is.

gpshead · April 7, 2025, 11:43pm

I think I’m responsible for asking Emma to include a mention of lz4 in the PEP as I assumed if we did not, someone would bring it up with a “why not lz4?” question.

I doubt we’d add that one given zstandard exists (and was created by the creator of LZ4…) but it serves as a good example because it still has one niche and sees use.

BTW, we should probably link to RFC 8478 - Zstandard Compression and the application/zstd Media Type from the PEP as well.

bwoodsend · April 7, 2025, 11:54pm

I’d also really like to avoid another enum34 situation where the enum module was backported as the enum34 PyPI distribution for Python < 3.4 (with the enum import name), tonnes of packages depended on it without adding an appropriate ; python_version < '3.4' marker, emum34 became obsolete and out of date and then years later, random parts of unrelated code (including the standard lib) would break over missing IntFlag because the stale backport was still in people’s environments/dependency trees and shadowing the up to date stdlib version.

I’d prefer a slightly suboptimal name to doing that again.

emmatyping · April 8, 2025, 2:31am

Good point, I have added this in my PR that updates the timeline text PEP 784: Updates to timeline and RFC reference by emmatyping · Pull Request #4352 · python/peps · GitHub

malemburg · April 8, 2025, 9:36am

And flat is better than nested (PEP 10).

Python’s stdlib has traditionally always tried to use a flat structure, not a nested one.

If you are concerned about possible collisions with PyPI names, the better option would be to consider placing the entire stdlib into a namespace, e.g. “stdlib” or just “py” (yes, there is a PyPI with that name, but it’s been in maintenance mode for around 4 years now and I’m sure we could ask the maintainers about the idea to reuse their top-level package). To maintain backwards compatibility, all current stdlib modules would continue to be available under their current names at top-level (in addition to being available under the package name), while all new ones would only be available under this new package name.

We could then also rename compromises such as “hashlib” and “tomllib” back to “py.hash” and “py.toml”.

But back on the topic:

Using a top-level name such as “zstdcomp” would certainly be possible today, without breaking applications (you are forgetting that “compression” may well be in use by applications out there, not just in packages on PyPI; the name is just too generic).

Later on, this could then be renamed to “py.zstd”.

Jost · April 8, 2025, 11:57am

Every new package created on PyPI has some risk of breaking someone’s private/internal code (whether as a deliberate supply chain attack or by accident). And even an existing public package a simply adding another (existing, public) package b as a new dependency breaks any code that relies on a as well as a private package named b.

Adding a new module to the standard library affects more users at once; but I still wouldn’t be surprised if the overall impact is dwarfed by the sheer number of new packages or new dependencies added day after day. (We just don’t hear about those, because every individual case only affects a small number of people.)
Thus, I don’t think fear of maybe breaking some internal code somewhere should stop us from adding a compression module.

Is there a public list of such reserved names anywhere? That might be useful to help people avoid such collisions in the future.

(Other options that come to mind: Encouraging people to use, e.g., company_name.* as a namespace for internal modules, to reduce the risk of accidental collisions? Or specifically reserving a namespace for internal usage, similar to the 192.168.0.0/16 or 10.0.0.0/8 IPv4 address ranges?)

methane · April 8, 2025, 12:29pm

pip install pytest adds py module.

pytest/src/py.py at main · pytest-dev/pytest · GitHub

I prefer std.

Jost · April 8, 2025, 12:52pm

I want to underscore this point. Being able to do import json or import csv but not import toml has tripped me up once or twice in the past (and has a good chance of doing so again in the future, next time I need tomllib after not using it for a long time). Not a big deal for me, sure; but multiply that little paper cut by a large number of users, both now and in the years (decades?) to come and it is starting to become a pretty big deal.

Deliberately choosing a worse name may avoid a fixed amount of “transition pain” in the short term, but we’ll have to pay for that with an ever increasing amount of “paper cut pain” in the long term. To me, that is a strong argument against zstdcomp (or libzstd as in the Rejected Ideas section of the PEP, or any similar name).

(To be clear, I’m not saying we should blithely inflict any amount of transition pain. Certainly, 2.5M monthly downloads for the existing zstd module are a strong argument against reusing that name. But without any evidence that using compression would cause anything close to that level of transition pain, to me the long term advantage of picking an intuitive name like compression.zstd clearly wins out.)

effigies · April 8, 2025, 1:12pm

What about org.python.std? After all, namespaces are one honking great idea.

Seriously, if there were a major migration to a stdlib namespace, std seems better to me than py or stdlib or python. I would hope that the nesting could be kept to three levels as much as possible (e.g., std.compress.*). I also would suggest that sys, os and probably a few other really core modules would still make sense to keep at the top level.

Monarch · April 8, 2025, 1:36pm

If we are truly looking to solve the problem of stdlib module names being taken by PyPI, compression.* just seems like a band-aid that fixes the problem for a couple of modules over the course of several years while breaking millions of lines of code once the original modules are removed.

What happens when another new module, unrelated to compression, is being proposed for the standard library a few years down the line? Would we then create yet another namespace? Settle on a less-than-ideal name simply because it’s not taken on PyPI? Or would we find ourselves having this same discussion again?

A much better and long-term (minimum 10 years) solution would be claiming an std namespace for all of stdlib. An example of this would be Rust, which puts its entire stdlib behind the std::* namespace. Then, all new modules can be added solely to std, for example, from std import zstd, while older modules remain available in both namespaces for years. Linters can then promote the std namespace until the core team decides it’s finally time to ditch the older flat namespace.

This is a significant change that deserves its own PEP to allow for thorough discussion of its design and implications.

steve.dower · April 8, 2025, 1:40pm

Obviously I agree with most of your post. But I don’t think download counts are actually a good metric for “pain” when it comes to naming. 2.45M of those downloads might be one change in one popular package.

Like you, I worry more about the state of the world in ten years time, when people are posting “gotchas” about our inconsistent standard library, and complaining about how Python “doesn’t make sense” and why do you use ZipFile.open but tarfile.open (or is it the other way around?) and json.loads is different from tomllib.loads. As it turns out, very few people will remember the historical context at that point (source: people in this thread forgetting the historical context behind current inconsistencies ).

We have processes for managing transitional pain, and I’m more and more convinced that deprioritising the stdlib imports might be another improvement on that^[1]. But we don’t have processes or systems for dealing with changing the stdlib over time to be more consistent, and in fact we explicitly don’t do this to avoid causing disruption.

So we really ought to commit ourselves to designing the stdlib we want in ten years, not the one that’s safest today.

With a “magic” std module that provides only the stdlib modules - for those not aware, this is probably just a std.py file containing __path__ = [<stdlib directory>], or at most a new importer. ↩︎

ngoldbaum · April 8, 2025, 2:21pm

I created Add support for the free-threaded build and upload cp313t wheels · Issue #32 · Rogdham/pyzstd · GitHub. Let’s move further discussion about free-threaded support to the pyzstd issue tracker.

gpshead · April 8, 2025, 5:43pm

Feel free to propose your own PEP for that. It isn’t going to happen here.

malemburg · April 10, 2025, 2:14pm

I am well aware.

Regardless, I still don’t believe that a top-level compression package in the stdlib is a good idea. The name is too long, it’s too generic, but most of all: this is a separate topic and should not block adoption of a new Zstandard module in the stdlib.

There are plenty top-level names which could be used instead of the existing “zstd” package name, e.g. “zstd1” (based on the major version number), “zstdcomp”, “zstdc” (both adding an indication that this is a compression lib), “zst” (based on the file extension), “libzstd” (following the name of the .so file), or even the original package name “pyzstd”.

Or we just use “zstd” with the process suggested by Steve.

petersuter · April 10, 2025, 4:23pm

The PEP mentions a ZstdFile API. With the existing module names zipfile and tarfile wouldn’t zstdfile be the obvious candidate for the new module?

merwok · April 10, 2025, 4:37pm

zipfile and tarfile are container formats; compression modules are named zlib, gzip, bz2, lzma, suggesting the use of zstd, brotli, lz4, lzip for new modules.

What about zstandard though? clear and without conflict with pyzstd or zstd, and nicer than zstdlib.

[edit:] zstdmod

hugovk · April 10, 2025, 7:01pm

I can see lots of support for adding Zstandard to the standard library, but no consensus for the compression package.

There’s just under a month until Python 3.14’s feature freeze.

Here’s two options.

One:

Keep this PEP more-or-less as is: add Zstd, and compression package. I don’t expect this to be decided in time for 3.14, but there is plenty of time before 3.15’s feature freeze in May 2026.

Two:

Focus this PEP only on adding Zstandard to the stdlib. There’s not a lot of time, but there is still a chance to revise the PEP, submit to the SC for consideration, have a PR reviewed and merged before the freeze.
And defer the compression package to another PEP targeting 3.15. Zstandard can join the 10-year deprecation along with the other compression modules. Plenty of time until May 2026 to try to find consensus.

Some figures in this 2020 thread show Zstandard would be a great choice for packaging: it’s much faster and produces smaller files.

But to be useful in packaging, it needs widespread adoption. And an extra year headstart is important. Therefore I recommend focussing this PEP only on adding zstd to the stdlib and aim for 3.14.