After spending a number of weekends on the reference implementation, I have finally had time to write up a PEP about adding support for Zstandard compression to the standard library.
Abstract
Zstandard is a widely adopted, mature, and highly efficient compression standard. This PEP proposes adding a new module to the Python standard library containing a Python wrapper around Meta’s zstd library, the default implementation. Additionally, to avoid name collisions with packages on PyPI and to present a unified interface to Python users, compression modules in the standard library will be moved under a compression.* namespace package.
While this PEP currently targets implementation in 3.14, if the window between acceptance and the beta cutoff is too close, the PEP implementation can be pushed out to 3.15.
Thanks for the PEP, looking forward to zstd in the zstdlib!
! have one small question: the PEP mentions compression being a “namespace package” twice, but doesn’t explicitly say whether that’s an actual PEP 420 namespace package, and if so, why. The comments about “compression” already being a reserved PyPI package name suggests namespace collisions with existing packages is not a driving concern for this.
If the intent is to make it a PEP 420 namespace package, I think we need to be very careful and deliberate with uses of compression.zstd in the stdlib, and not move any existing modules to the compression (providing shims in compression is fine, as long as nothing in the stdlib uses them for existing compression modules). Any module named compression would shadow the stdlib namespace package, and any imports of compression in the stdlib would fail in confusing and unpredictable ways. The PEP could use some mention of the risks there.
If the intent is not to make compression a PEP 420 namespace package, yay! Let’s not! But maybe don’t use the exact wording “namespace package”(perhaps make it “top-level namespace” instead), and maybe mention making it an actual PEP 420 namespace package under rejected ideas.
I don’t think these two things should go into the same PEP.
Adding a zstd module will cause a lot fewer headaches than introducing a generic compression namespace, since the latter is most likely going to conflict with existing modules and packages used in applications out there (not just packages and modules on PyPI).
I agree! I think the experience of concurrent.futures tells us that the stdlib is better as a flat namespace.
futures are great, but concurrent did not work out. threading, asyncio, contextvars, etc happily live as top-level modules. I don’t think much would have been gained by moving or creating them under concurrent.
So I don’t think introducing compression (or a better, less generic name) would be good.
Thank you for pointing out that confusion. I think compression should not be a namespace package, considering that would restrict the flexibility of the stdlib (which is part of the reason of this PEP). I don’t know of any packages that distribute compression.foo, so I don’t think that is a consideration either.
I think the introduction of Zstandard motivates the creation of the compression package, so to me it seems reasonable to keep them in the same PEP. They could be two PEPs that reference each other, but I think that would make the compression package PEP hard to understand (and logistically more complicated ). I think as long as the compression.zstd module stays the proposed interface, it makes sense to have them stay in the same PEP.
My concern with choosing a name that matches an existing package on PyPI is that it would make migration much more difficult for users of that package. If you are using advanced features of the zstd package, you lose access to those features and that package entirely once there is a module in the standard library taking precedence. They could do some importlib hacks I’m sure, but given zstd has over 2.5M monthly downloads, I don’t think it makes sense to break them without obvious recourse. If lz4 is going to be added to the stdlib in the future, is it acceptable to break the users who download it a total of 40 million times a month? For reference, that is twice as popular as Django.
When introducing a compression import name, I see one of a few cases where there would be a top level compression package or module in existing code (as to my understanding, but if I miss any cases or the below is wrong, someone please correct me):
There is a compression.py in the same directory as Python is run (implicit front of sys.path). This will override the standard library module, but the user can rename this module if that causes issues with other packages.
A compression module is introduced in sys.path some other, dynamic, way, e.g. sys.path.append. This is another case where the user has control over which module is imported, and can make any hard decisions.
The user ships a private package which has a top level import compression. In this case, things break and they must rename.
Furthermore, by introducing a compression package, this change/break will only happen once. If we decide to not namespace and introduce zstd, and in the future lz4 and for any future compression library, those will each break users.
I would rather choose to break some smaller subset of unknown users once (introducing compression) over breaking a large number of known users multiple times in the future (introducing zstd, lz4, etc.).
Only if the zstd maintainers are unwilling to migrate their package name, right? If the introduction of the built-in module coincided with a rename, then users could migrate[1] but most would be able to switch to the stdlib package with no changes at all.
The trade-off here depends on how many users would want to keep using the third-party package, versus how many would start using the stdlib version. If the former is large enough that you don’t want to disrupt them, that seems like an argument that this shouldn’t be added to stdlib in the first place (because it’s insufficient for most users).
No, because there are cross-version compatibility issues with doing that. Everything is just massively simpler for everyone if existing third party project names are left alone.
That doesn’t follow, as one rationale for adding things to the standard library is so that they can eventually become an assumed default feature (where different projects will assess when “eventually” has arrived based on the rate at which their userbase migrates to newer versions of Python).
I have no opinion on whether zstd should be under compression or kept at the top level, but I’m definitely concerned about renaming the existing compression libraries. That’s the sort of change, being done “to keep things tidy”, that is typically discouraged because it introduces real breakages for little or no practical benefit.
Many 3rd party projects will rely on the existing compression libraries. And they will often support a much wider range of Python versions than the core team does. So for those projects, there’s no point where a “clean” switch is possible - instead, they will need to carry compatibility code for at least some period of time. That code might be as simple as
try:
from compression import lzma
except ImportError:
import lzma
but that’s still code that is only needed because we’re moving things around. And I think we should acknowledge that imposing that sort of cost on 3rd party projects needs a better justification than simply “it’s tidier”.
That’s a false dichotomy. It’s not the introduction of compression that breaks users. It’s moving existing libraries under it. And conversely, introducing compressionwithout moving existing libraries still avoids the future breakage you’re concerned about[1] while not breaking users now.
I also feel that you overstate the future breakage, but that’s not relevant to my point ↩︎
I agree with Alyssa on this. Additionally, package distribution renaming is very difficult to properly communicate and execute without breaking people and having users miss the change. I would expect a number of users simply staying on the last release running into errors unaware of why.
For the initial transition, I expect users to import both the stdlib and third-party implementation, depending on Python version to maintain compatibility. I expect the popularity of the zstd package to continue until 3.13 is dropped from most packages.
Doesn’t this also apply to moving the compression packages to a namespace, though? Even with a very long deprecation period, it will break people – whereas it is always possible to find an unused name for a new stdlib package.
I also find the rationale in the PEP for hypothetical future compression.* packages uncompelling: there has been a recent trend to be hesitant in adding new packages to the stdlib, even in namespaces. Adding LZ4, or any other new compression library would face the same hurdles, of which naming is likely to be one of the least challenging.
It’s probably worthwhile asking the maintainers of the zstd PyPI module for their thoughts on this. Likewise, the maintainers of the pyzstd module should chime in with their thoughts on having their module included in the stdlib.
This is how we usually manage inclusion of PyPI modules or packages in the stdlib. The last time this has happened (IIRC) was for the toml module. The compromise was to use the name tomllib instead of toml. I’m sure we can find a good compromise for the Zstandard implementation as well.
I have corresponded with Rogdham, the pyzstd maintainer, a number of times, and they reviewed the PEP draft before it was posted. Rogdham was enthusiastically supportive of the module’s integration and the PEP.
Reaching out to the zstd maintainers is a good idea, I will do that some time today.
My understanding here is that this worked well because the APIs were compatible. If zstd in the stdlib exported an API completely different from the zstd PyPI package, that surely will lead to compatibility headaches.
Previously we’ve renamed the package when it was brought into the stdlib. For example, PEP 680 imported the format-reading part of the third-party tomli as the stdlib’s tomlib.
If you only need to read TOML files and want maximum compatibility:
if sys.version_info >= (3, 11):
import tomllib
else:
import tomli as tomllib
But if you want to write TOML (the advanced features), you can still use the third-party tomli directly.
Historically we’ve done this to use a “better” name, or a more consistent name. tomllib is an anomaly, where we chose a distinctly worse name (you misspelled it ) with less consistency (with the other serialization format packages we have).
There’s no way to appeal to precedent on this. We just have to choose the name pragmatically, or come up with another clever idea (such as having site-packages shadow the stdlib but make stdlib.<module> always bypass it to the stdlib, or a compression not-quite-namespace package).
I disagree. We’re not likely to add zstd without it going in a proper submodule because our top level module name space is already polluted and claimed by PyPI packages and we’d end up with an unfriendly module name. Which is why I suggested do the same thing I did for hashlib. and make a similar concept for our compression batteries. Thus compression..
It seemed like a pretty obvious choice to me.
Namespaces (as in .s and submodules, I don’t care about PEP 420 “namespace packages” in this context) are a good thing.
The PEP gives a 10 year timeline for this. That is plenty of time to back out if future maintainers decide they’re rather keep our history of just squatting in the top level package namespace for arbitrary things instead of grouping related functionality under a topical name as we’ve successfully done with hashlib.
Sorry - I misread the relevant section of the PEP (“standard deprecation timeline” distracted me ). Accepted, that does seem like a reasonable timeline for 3rd party code, even code that aims for extreme backward compatibility.
I’m no longer against the move, although I’d still describe myself as neutral rather than actively in favour.
For what it’s worth, including zstandard bindings in the standard library would also help out with free-threaded support.
That said, a cursory look indicates that pyzstd isn’t testing the free-threaded build. Might there be any thread safety issues due to assumptions around the GIL?
As far as I understand, it’s not safe to share zstd compression or decompression handles between threads.
Of course all of that can be fixed, but it would be good IMO if it were already handled before the code is added to CPython so it doesn’t need to happen later.
Ah sorry if that was confusing! I’ll update it to remove “standard deprecation timeline” to emphasize that starts once the last Python without compression leaves support, in 5 years, and that means the modules will co-exist for a total of 10 years.