Calling `hash()` at runtime and providing PYTHONHASHSEED or key material directly

In our work, we use stable hashing quite a lot for partitioning purposes. Because PYTHONHASHSEED, when not set, causes the underlying key material to be generated randomly on interpreter startup, we are unable to use the builtin hash() function to generate our hashes, which is a shame both because it is built-in (requires no dependencies) and because it is a very fast algorithm.

For hashing Python objects, I have been unable to find another existing implementation that is as fast in serial use (one call after another). I’ve tried several, including off-the-shelf cityhash and xxhash. My presumption is that this has something to do with CPython having optimized routines for selecting the relevant parts of an object recursively to pass to the underlying hash algorithm, but it seems to be a fact regardless. And even if it weren’t, the built-in availability weighs heavily for me as well.

I would like to have the ability to use the built-in hash algorithm on Python objects (including any strings or bytes objects they contain) while setting the seed/key material for that specific call (or rather, for the entire call stack, whether threaded, green/async threaded, etc.). I do not want to have to set PYTHONHASHSEED itself, which affects the entire interpreter, for several reasons, but one of those reasons is that I sometimes have web services (theoretically vulnerable to the attacks that motivated the introduction of a random seed) that nevertheless would derive great advantage from being able to compute the same partition value under certain non-attack-vulnerable use cases. Spawning a separate process with a given hash seed and asking it to compute the hashes introduces an unacceptable amount of latency, as you would expect.

Ideally, this would be an argument to the call to hash() itself, e.g. hash(foobar, seed=123). I can easily imagine that changes to builtin function signatures may be very difficult or completely unacceptable, though I assume this would be backward-compatible in the simplest sense. Perhaps an alternative would be to have hash use a context-var-like seed, though in general I would think we wouldn’t want to introduce new tests/branches in something so performance-sensitive, so it would seem preferable to have the value available in locals or something very like it.

I have looked at the pyhash.c implementation in CPython and what I believe I understand is that the key material for siphash and the other built-in hash implementations is set up very early during startup, and then there are wrappers that call into the underlying algorithms (cpython/Python/pyhash.c at 7595e6743ac78ac0dd19418176f66d251668fafc · python/cpython · GitHub) with the standard key material that is derived from the seed, if provided, or randomly generated if not. So in theory, the interface I have proposed would either have to derive the key material from the seed each time (affecting performance), or there would have to be a way to cache the key material based on the seed, or directly provide the key material to the algorithm. An alternative would then be hash(foobar, keymaterial=function_which_generates_keymaterial_from_seed(123)) - I do not know what part of the standard library would most reasonably provide such an interface.

I have not looked into optimizations that might or might not be happening at higher levels of the interpreter, including whether or not the return value of __hash__ for Python objects is cached in a way that would cause a second call with a different seed to incorrectly return the first value. I am not proposing to change the __hash__ interface itself; my hope/assumption is that the change could be made inside the interpreter, passing the key material “through” the calls to __hash__. I have not in any way validated this assumption.

A completely different approach might be to expose the existing algorithm with the same interface as hash via hashlib, though I fear that this would violate the principles of hashlib itself, since all those algorithms require your data to be converted into bytes prior to hashing, and one of the major advantages of hash is that no such conversion is required - the language is responsible for directly providing the underlying memory buffers in which the object is recursively defined.

TL;DR: it would be very useful for Python to expose the same built-in hash algorithm and object-hashing interface, but adding an explicitly provided seed/key. This would help use cases where a stable hash value is desired across interpreters, while maintaining the existing built-in security advantages offered by default calls (which would continue to use the randomly generated key material). I don’t believe such a change would violate any principles of least surprise (referential transparency is maintained for hash as long as the seed is being provided explicitly). My initial glance at the low level hash code, in which I am very much not an expert, suggests that something along these lines should be reasonably possible if interfaces could be agreed upon, but I may very well be missing important considerations.

Thanks for reading!

2 Likes

In my humble opinion, overriding __hash__ would be much easier:

import uuid

class CustomObject:
    def __init__(self, data):
        self.data = data
        # self.uuid = int(uuid.uuid4())  # Generate a random UUID
        # print(self.uuid)
        self.uuid = 145435541424994643513817475646028941014
    
    def __hash__(self):
        return hash(self.uuid)
    
print(hash(CustomObject('data')))

Result:

958826736164597413

I would need to override __hash__ for every object that I wanted this to work with. That’s a significant implementation burden, especially if I’m trying to hash objects that are already natively hashable, but that are defined by third-party libraries, etc.

Your example seems to suggest that caching a uuid would be sufficient, but it’s evident that this would not work across multiple processes, which would not know how to generate the same UUID. My principal use case is partitioning for parallelism, which is why I need a solution that works across parallel processes deterministically.

The implementation of this suggestion requires either a change in how __hash__ methods need to be implemented so that context can be passed down (i.e. def __hash__(self, keymaterial): return hash(self.innerards, keymaterial), or it requires hash to keep context var state. I don’t see any possible way to avoid one of these two. The first is a breaking change, or at least it would require the entire ecosystem to change their __hash__ implementations to be compatible, the second probably represents an unacceptably huge performance hit for such a fundamental operation in python.

The only real option I see is reimplementation of the hash walking independent of the builtin function, without relying on the existing __hash__ implementations - Which is something you can already do today. AFAIK, the only builtin data types effected by PYTHONHASHSEED is str and bytes, specfically the number data types are not effected, so you don’t need to reimplement that.

It doesn’t suggest that. It is commented, merely explaining how self.uuid was generated. You can use any deterministic algorithm to generate unique IDs, such as a simple incrementer.

I’m interested in this. Can you point me to an implementation of this that exactly mirrors what Python itself does today? Preferably one that uses the visitor pattern, such that there’s no copy-paste involved, and I can reuse the existing code. I have in the past (ab)used pickle for this, though I have never tested it for performance or for exhaustive correctness.

Then I could, in theory, call a hash function on each leaf node - though, again, my preference would be to use the fast and strong hash algorithm built in to Python rather than requiring a third party module, which returns us to the original problem to a certain extent.

fair enough - i think it is just a red herring.

Wrapping every object that I want to test in a different object with its own, bespoke, implementation of recursive __hash__ is, I think most would agree, not a trivial endeavor.

No, because I don’t know of one. I saying that you can already implement this, which yes, does involve reimplementing __hash__ for all relevant objects. If you have an implementation of this it might be possible to consider adding this to the stdlib. But crucially I can’t imagine a way forward that involves modifying or reusing the builtin hash.

1 Like

The built-in hash function will cache the hash values of strings, on the assumption that it will always and only be using that specific hash seed. So you’re not going to be able to take this shortcut. Ultimately, even if the default hashing algorithm were to be exposed, you’d still run into the problem that hash(("a", "b")) is going to use the default tuple hashing, which will use the default string hashing, which will use the globally-set seed value.

My recommendation would be to implement your own partitioning hash function, recognizing only the data types that you know you’ll need to support (strings, obviously, and then whichever aggregate types you actually use). It can then use whichever algorithm you like (possibly a cryptographic one) for strings, and combine that as needed.

Yeah for deterministic hashing purposes I agree you should just write our own. It’s simpler/easier then you think. You don’t need to support all types in the world as a key. Something like,

def _primitive_hash(obj: BasicTypes): # pick from hashlib or some other library. farmhash is nice one supported across many languages.
  ...

def _combine_two_hashes(hash1: str, hash2: str):
  ...

def fingerprint_hash(x: object):
  if isinstance(x, Sequence):
   return functools.reduce(map(fingerprint_hash, x), _combine_two_hashes)
  elif isinstance(x, Mapping):
    return fingerprint_hash(x.items())
  elif isinstance(x, BasicTypes):
    return _primitive_hash(x)

  raise NotImplementedError("Unsupported type given, " + str(type(x)))

Something roughly like that is what I’ve used at work for years. Add a few more cases (maybe dataclass use their _fields, namedtuple is already a sequence, etc) and it covers enough.

1 Like

I don’t quite understand. My suggestion involves caching the hash value of the object by subclassing it and overriding its __hash__ function. This approach aims to achieve stable hash values across processes without needing to modify every individual object.

Furthermore, implementing traversal algorithms based on your object’s data structure is straightforward, as demonstrated by @mdrissi

I appreciate the feedback in particular from @Rosuav , who has pointed out that there’s simply no way to make this happen without changing the signature of __hash__, since (presumably) the recursion itself is happening in pure Python, and therefore resorting to optimizations for where we ‘stash’ the key material won’t help. I do want this to ‘work’ on arbitrary Python objects, including those which implement their own __hash__ but defer to CPython for the hash of the all-important leaf strings and bytes.

My initial post did make clear that one of my primary goals here was to attain the same level of performance that we see in the built-in hash, and that I have not yet found an off-the-shelf hash library that provides the same performance that hash does. I again theorize (without evidence other than my own testing) that there are some optimizations that I cannot perform given that all other hash algorithms of which I am aware require explicit conversion of leaf nodes to bytes, whereas the built-in algorithm operates directly on the leaf items (presumably because it already knows where the ‘essential’ bytes of those leaf types in the CPython interpreter can be found. Plausibly someone could implement a CPython-optimized hash algorithm that can operate directly on CPython objects that have known byte layouts, but unless/until I find something like this, I believe I’m out of luck.

The outcome is disappointing, but was foreseeable, and I appreciate people taking the time to offer alternatives, etc.

1 Like

You can use siphashc to create hashes, cache hash values, and access them twice as fast as the built-in hash() function.

import timeit
from siphashc import siphash


class Str(str):
    def __new__(cls, object=''):
        instance = super().__new__(cls, object)
        instance.siphash = siphash("sixteencharstrng", instance)
        return instance
    
    def __hash(self):
        return self.siphash


short = 'a' * 32
long = 'a' * 65536
shortStr = Str(short)
longStr = Str(long)

number = 1000
loop = 1000


# Benchmarking function for siphashc
def benchmark_siphash(string):
    for i in range(loop):
        h = string.siphash  # 2x faster
        # h = hash(string)
    
    return h


# Benchmarking function for built-in hash
def benchmark_builtin_hash(string):
    for i in range(loop):
        h = hash(string)
    
    return h


# Timeit setup for siphashc with a short string
siphash_short_time = timeit.timeit(
    "benchmark_siphash(shortStr)",
    setup="from __main__ import benchmark_siphash, shortStr",
    number=number
)

# Timeit setup for built-in hash with a short string
builtin_hash_short_time = timeit.timeit(
    "benchmark_builtin_hash(short)",
    setup="from __main__ import benchmark_builtin_hash, short",
    number=number
)

# Timeit setup for siphashc with a large string
siphash_large_time = timeit.timeit(
    "benchmark_siphash(longStr)",
    setup="from __main__ import benchmark_siphash, longStr",
    number=number
)

# Timeit setup for built-in hash with a large string
builtin_hash_large_time = timeit.timeit(
    "benchmark_builtin_hash(long)",
    setup="from __main__ import benchmark_builtin_hash, long",
    number=number
)

print(f"Siphashc with short string: {siphash_short_time:.6f} seconds")
print(f"hash     with short string: {builtin_hash_short_time:.6f} seconds")
print(f"Siphashc with large string: {siphash_large_time:.6f} seconds")
print(f"hash     with large string: {builtin_hash_large_time:.6f} seconds")

Result:

Siphashc with short string: 0.055731 seconds
hash     with short string: 0.095711 seconds
Siphashc with large string: 0.039609 seconds
hash     with large string: 0.063063 seconds

Isn’t that just measuring the cost of a function call plus C-level slot access vs an attribute access? TBF, I don’t know exactly what performance difference OP is talking about, but surely it isn’t this? Since this doesn’t calculate any hashes inside the loop.

Yes, it was a follow-up to my argument.

Here are the timings for hashing:

import timeit
from siphashc import siphash


class Str(str):
    def __new__(cls, object=''):
        instance = super().__new__(cls, object)
        instance.siphash = siphash("sixteencharstrng", instance)
        return instance
    
    def __hash(self):
        return self.siphash

number = 1000
loop = 1000


# Benchmarking function for siphashc
def benchmark_siphash():
    for i in range(2**64, 2**64 + loop):
        h = Str(i).siphash

    return h


# Benchmarking function for built-in hash
def benchmark_builtin_hash():
    for i in range(2**64, 2**64 + loop):
        h = hash(str(i))
    
    return h


# Timeit setup for siphashc
siphash_time = timeit.timeit(
    "benchmark_siphash()",
    setup="from __main__ import benchmark_siphash",
    number=number
)

# Timeit setup for built-in hash
builtin_hash_time = timeit.timeit(
    "benchmark_builtin_hash()",
    setup="from __main__ import benchmark_builtin_hash",
    number=number
)

print(f"Siphashc: {siphash_time:.6f} seconds")
print(f"hash    : {builtin_hash_time:.6f} seconds")

Result:

Siphashc: 1.093921 seconds
hash    : 0.355740 seconds

The difference is insignificant if the object is created once and checked multiple times.

Shouldn’t that be __hash__?

Yes, but I didn’t make use of it.