Make a stdlib for Non-Cryptographic Hash Functions (NCHF)

pepoluan · February 8, 2024, 4:33am

The hashlib module currently provide cryptographically-secure hash functions.

However oftentimes we don’t need a hash to be cryptographically secure.

I propose creating a new stdlib to provide implementations of NCHF.

My recommendations will be to add the xxHash family and CityHash family, as these two are extremely fast and seems to be frequently used.

Rosuav · February 8, 2024, 5:08am

What are the advantages of these hashing functions, and how beneficial is it to have them in the standard library (as opposed to being able to pull them from PyPI when needed)?

pepoluan · February 8, 2024, 5:14am

It saves yet another dependency.

Once a correct NCHF implementation gets added, I don’t see it will ever change.

This potentially prevent supply-chain attack if the NCHF library is part of stdlib.

Rosuav · February 8, 2024, 5:17am

If that is a big enough problem for someone, wouldn’t it be easier to use hashlib? Is there a use-case where hashlib’s offerings are too slow AND you need to avoid PyPI?

pepoluan · February 8, 2024, 5:26am

By its very nature, a cryptographic hash function (CHF) is very slow when compared to NCHFs. Mostly due to the necessity of first building an initial inner state, and also how to make sure an avalanche effect happens to the inner state.

Supply-chain attacks have already happened many times. Most (in)famous ones are in JavaScript npm libraries, but there had been several attacks to PyPI as well.

Well-known and widely used libraries such as requests have lots of eyeballs keeping them safe. However less-widely used libraries don’t have such luxury, and there is still a chance that a threat actor managed to hijack those libraries.

It’s not a question of “avoiding PyPI”, it’s an issue of “reducing potential attack vectors”. If NCHFs got integrated into stdlib, that’s one less potential attack vector to be concerned with.

Rosuav · February 8, 2024, 5:45am

Yes, I’m aware of this. I’m also very aware that performance is relative, hence asking about a use-case (“it would be nice” isn’t a use-case).

On my system, SHA1 (not considered secure any more but still exhibiting the avalanche effect etc) clocks in at nearly 3GB/sec:

$ python3 -m timeit -s 'import hashlib; data = b"a" * 1048576' 'hashlib.sha1(data)'
1000 loops, best of 5: 351 usec per loop

And that’s on a single core; you can multithread this easily (the GIL is released during hashing).

(Okay, on my slow disk server, it clocks in at 4.42ms/loop, which is only about a quarter of a GB/s, but still, that’s a fair amount of throughput.)

So, what is the use-case where this sort of speed is insufficient, you’re willing to sacrifice true reliability, but you’re NOT willing to sacrifice entire-data hashing (see this thread about file hashing for some examples of how partial-data hashing can still be remarkably effective), AND you are unable/unwilling to either use PyPI or to copy in some open source code to use?

And if you can find one use-case, can you find enough to justify adding a specific group of algorithms to the standard library? What’s correct for YOUR use-case might not be correct for someone else’s, so either the stdlib module has to be enormous (and a nightmare to maintain), or there’s a good chance that the next person stil won’t find what they want there, and will have to look elsewhere.

This seems like a pretty narrow requirement to me.

tim.one · February 8, 2024, 7:23am

Just popping in to note that, while they’ve grown more ambitious over time, the original driver for the CityHash and xxHash families was to provide extremely fast, high-quality (randomized) short (like 32-bit) hashes even for short input strings. The primary use case was to speed ordinary hash tables using string keys.

Toward that end, cuirrent xxHash family member XXH3 uses SSE2 tricks to reach speeds claimed to be faster than the platform “do nothing” memcpy().

But I think PyPI is the right place for this. The natural audience is programmers using “close to the metal” languages where every cycle can matter when building their own higher-level primitives. Just building a Python int object from a raw bare-metal hash output probably costs more than the hash function itself.

BTW, a few years ago I took some of xxHash’s core tricks & used them to improve CPython’s tuple hashing. The end result wasn’t really any faster than what we had, but was far more robust against rare but extremely pathological real-life input cases. It had some great ideas for achieving high “avalanche” quickly.

tim.one · February 9, 2024, 6:10am

BTW, the most interesting thing CPython could do with XXH3 is experiment with adopting it as CPython’s built-in string hashing function. According to the XXhash developer, XXH3 produces 64-bit hashes at about 10x the rate of CPython’s current SipHash (31.5 GB/s on his box vs 3.0 GB/s - and 0.8 GB/s for SHA1).

Ya, I know, XXH3 isn’t concerned with thwarting adversarial inputs. But the code I’m running on my own boxes couldn’t care less about that either .

Rosuav · February 9, 2024, 6:32am

Those who run web applications may disagree though. Remember that big kerfuffle a while back when practically EVERY web framework in the world - in every language - was vulnerable to hash collision attacks? We’d want to at least not risk something that bad.

tim.one · February 9, 2024, 6:51am

Then let them continue using SipHash. Or anything else they want. As I said, experimenting with different ideas is “interesting”. Enforced “one size fits all, by decree” is the opposite of interesting .

Rosuav · February 9, 2024, 7:01am

That’s fair, but if CPython is to consider changing its algorithm, it WOULD be for everyone (or else there’d be a flag to control it, if that’s worth the complexity).

tim.one · February 9, 2024, 9:03pm

Sure. When choices are available, a mechanism to pick one must exist, and there’s usually a default.

There are many applications where string-keyed dicts are crucial to performance and adversarial users aren’t an issue. That’s why XXhash (etc) exist.

For an obvious example, CPython’s own implementation of module and class __dicts__ (if you’re running a web server that allows users to execute their own arbitrary Python modules and classes, you’ve got bigger worries than them running a module with thousands of global names carefujly constructed to collide ). “One size fits all” doesn’t necessarily make good sense even within a single application.

I won’t be pursuing this, but someone else might want to.

pepoluan · February 27, 2024, 10:27am

Well, we already have PYTHONHASHSEED, maybe can consider adding PYTHONHASHALGO.