Adding CRC message digests to the Python standard library

TexZK · October 3, 2024, 6:35pm

Dear Pythonists, here’s an idea I’ve been wondering about for a long time.

I’ve spent the last 10+ years working for embedded systems, dealing with firmware images, custom communication protocols, safety norms, log analysis, all requiring handling of several CRC values.
I prototyped a bunch of in-house bootloader host programs, realtime communication nodes, and binary log analyzers. Of course, my main language for these activities was Python.

Despite loving it, I quickly found that I had to either roll my own CRC calculators (often non-standard) or use the very few native ones.
There are some third-party libraries on PyPI, all completely different, yet all implementing almost the same ubiquitous CRC algorithm in some form.
Most are written in pure Python, which is very (very!) slow, especially for realtime communication, or when analyzing gigabytes of binary streams coming from development boards.
The ones written in C still require a compiler be available each time an executable has to be given out; this can be a hassle in some companies and departments.
An option was to create a private Cython module, to get both the benefits of a fast imlpementation, and packaging for deployment in our labs.

So, eventually I came to this idea: adding native CRC support to Python itself, as a companion to the existing message digests, i.e. as an hashlib extension module, just like MD5 or SHA-1.
The many third-party attempts at creating CRC libraries, along with its message digest nature, and the rarity of CRC support in current Python, should ring a bell: we need native support for CRC!

In my spare time I’ve already drafted a PEP and a reference implementation (with tests), you can find at my forks (still WIP):

Please provide some feedback, because I’m willing to sumbit the PEP for PR soon.
I haven’t found any other topics about this feature.

jamestwebber · October 3, 2024, 6:40pm

Why is this? Why can’t you distribute a compiled wheel?

Nineteendo · October 3, 2024, 6:45pm

Can’t you provide a Python fallback implementation like simplejson? I see it would be 250x as slow when not available though.

TexZK · October 3, 2024, 6:50pm

Yes, eventually we made an internal Cython package with additional in-house stuff, for ad-hoc services.

The only public library I found was crc-ct, which is rather fine per se, yet with fewer features than the proposed implementation (e.g. sub-byte CRCs, wordwise optimization).

TexZK · October 3, 2024, 6:56pm

Yes of course, a fallback pure Python implementation is suggested in the Possible Future Enhancements chapter fo the drafted PEP. It would be rather easy to implement.

I haven’t found similar implementations under hashlib: they’re all C implementations, from either Python or OpenSSL.
How should I call the module made in pure Python? Simply Lib/crc.py, in contrast to Modules/crcmodule.c?

jamestwebber · October 3, 2024, 6:57pm

What I meant was, there are tons of C extensions available on PyPI that don’t require the installer to have a compiler. It seems like this could be another one.

Nineteendo · October 3, 2024, 7:01pm

I meant that you could create a PyPI package with a C implementation falling back to pure Python when no compiler is installed. Because your main issue with the ones written in C is that they require a compiler and thus can’t be used in all scenario’s.

And of course, then you would be able to use this right now, instead of having to wait until the version with the feature becomes available.

jamestwebber · October 3, 2024, 7:05pm

I should have thought of it immediately but crcmod is already exactly this: a C extension that implements CRC digests and is installable without a compiler. The documentation says you need a compiler to build the extension, but most users don’t need to do that, they can install the wheel.

That package is used by Google’s gsutil CLI tool so it’s quite battle-tested.

TexZK · October 3, 2024, 7:35pm

Yes I could create a package on PyPI. I get your point (which is fine to me) but my idea is to place CRC directly into the Python standard library, along with the existing native message digests.

Take MD5 or SHA-1 for example: there’s no fallback Python implementation, just the native one (plus OpenSSL optionally; let’s ignore this for now).
There’s no public module for MD5 or SHA-1, just the native _md5 and _sha1.
They’re meant to be accessed indirectly via hashlib, which is the “hub” for all the available hashing and message digest algorithms of the standard library.
A _crc extension to hashlib should follow the same architecture.

TexZK · October 3, 2024, 7:52pm

Yes I forgot about mentioning crcmod, thanks.
Implementation wise (excluding C code generation) it’s almost like crc-ct: bytewise table optimization, yet limited to a few bit widths.
That’s good for the most common CRC configurations, but we may need non-8-bit-aligned ones (and there are many of them).
I may be missing something though.

I mean, all the points mentioned by both of you are fine to me, I get them, and I’ve already solved my own tasks.
Despite this, I still think Python should provide a more comprehensive support for CRC algorithms to the community, with a rather small effort code wise, and fitting perfectly as a hashlib message digest.

In the end, the core CRC algorithm boils down to very few lines of code, which could provide ALL the possible configurations: at least up to 64-bits with high performance, and arbitrary with pure Python.
Indeed, the high number of third party libraries trying to provide these few lines of code (in C or Python), and the lack of proper CRC support by the current Python standard library, to me means that the community would benefit from these lines be provided natively.

bwoodsend · October 3, 2024, 7:52pm

Anything wrong with zlib.crc32() (beyond being in a slightly obscure place)?

TexZK · October 3, 2024, 7:59pm

zlib.crc32() is just one of the many standard configurations.

Another common one I’ve been using extensively is binascii.crc_hqx() (CRC-16/CCITT-FALSE).

In embedded systems and telecom there are so many configurations around (as you can see), which in the end would benefit from the very same few lines of code, instead of constraining the user to the very few configurations available in the standard library.

Nineteendo · October 3, 2024, 8:01pm

Would it be an option to add extra arguments to zlib.crc32() to give you some customisation?

Edit: probably not ideal if there are multiple crc functions in different modules.

jamestwebber · October 3, 2024, 8:07pm

But on the other hand, it looks like your current PR is about 3000 lines of code^[1]…?

I appreciate that you want to add this to the stdlib, but it also sounds like there are a lot of different variations and features that you want to support, and so it doesn’t appear to be all that simple of an addition. Adding it is a multi-year commitment of maintenance and support.

I’m curious why there are so many different libraries out there–is it because they all do slightly different things? If so, doesn’t that imply that the stdlib version needs to provide all of those things if it’s going to replace them? Again, that suggests this isn’t a simple addition.

not including tests ↩︎

bwoodsend · October 3, 2024, 8:12pm

OK, that’s news to me!

I see on that page that it’s still evolving. There are some on there even saying Created: 8 August 2024. There’s quite a lot of latency in adding things to the standard library (one feature release per year, 5 more years before the versions without the feature end of life) so if you want up to date then stdlib is not the place to put this.

TexZK · October 3, 2024, 8:19pm

Excluding non-core CRC features (e.g. code generation, abstract classes, etc.), algorithm wise they all look like a subset of the drafted PEP, except arbitrary precision (which would be provided by a pure Python variant).

Yes, I agree that my branch isn’t trivial, but that’s just a reference implementation I made, with all the bells and whistles (including cached pre-calculations), yet as simple as I could.
I don’t have any ideas about the maintenance efforts for a Python code base itself, I can trust what core developers think about.

The core CRC algorithm itself is indeed simple, but extremely slow in pure Python: a bit-by-bit in C is over 20 times faster than a bytewise optimization in Python (which itself is 250 times faster in C)!

Perhaps a straight C function would do the full job itself, even with support for all the possible CRC configurations, even without pre-computed tables for an initial release.
I can think about that, and re-think the proposal with this focus.

TexZK · October 3, 2024, 8:28pm

I think that page has become the de-facto summary for CRC implementors in recent times.

By the way, that’s just a collection of coefficients; the user would have full flexibility in applying them to his own CRC configuration manually.

oscarbenjamin · October 3, 2024, 9:03pm

It is quite common that someone suggests some basic improvement for the Python stdlib and then there will be a suggestion to make a PyPI package and see if it gains traction. I usually dislike those suggestions because many things don’t make sense as a PyPI package because no one would want to add a dependency for some small quality of life improvement etc and in reality the PyPI-package-becomes-stdlib-module route does not really exist for small features.

In this particular case though I think it is very clear that this should just be a PyPI package. I see zero chance of this entering the stdlib without first being a PyPI package and I see very little chance of it entering the stdlib even if the PyPI package is popular and successful.

The reasons this doesn’t make sense for stdlib seem clear:

Few users need it.
There is no particular reason it can’t just be a PyPI package.
It is not sufficiently stable or well-defined to be baked into the stdlib.

Those core developers can pipe up if they disagree but I expect their position is:

They don’t want to maintain it.
They would want you to maintain it if it was in the stdlib.
They prefer that you do that outside of the stdlib.

TexZK · October 3, 2024, 10:06pm

Yeah, sadly that’s the perception I got after these feedbacks as well.

Perhaps a bare something.crc(...) function (without any optimizations) might make it into the stdlib, but at that point as an user I’d lose interest in such a sub-optimal thing, albeit less sub-optimal than a pure Python variant.
I had already simplified the minimal features making sense for a decent overall API, but I wouldn’t reduce them further, because there are already alternatives with more common lesser features.
Too bad, the existing hashing/digest algorithms are much more necessary for the average Python user (even without knowing it) than CRCs, I can understand the reasons for the pushback.

Eventually I think I’m going to create yet another third-party module somehow, with all the bells and whistles. Tinkering with PEP drafts, CPython, and discussions was big fun and educational though