Constant hash for None

It’s nice that you doubt it, but I’ve benchmarked mine before bringing up this topic. Do you have an implementation of OrderedSet that’s competitive (within 10-20%) with set comprehensions / sets / frozensets?

Bad enough that it’s not worth doing something straight-forward that solves a problem, on the basis that it doesn’t solve other problems? Remember, this isn’t asking for special code in the set type; it’s simply asking for None to have a predictable hash value. Nothing more. This is a self-contained change. Is it fundamentally bad to not solve other problems, and thus we must not do something that solves this problem?

I honestly can’t see the value of this logic.

1 Like

I care about reproducibility, not ordering. I don’t care what order the items come out of the set when you iterate it. Just that it’s the same every time I repeat precisely the same calculation from scratch on the same input. It’s a meaningfully weaker requirement then to completely dictate what the order should be.

Okay, but I asked about dataclasses, not tuples. Dataclasses haven’t been around since 1991 and it’s not clear that they promise a deterministic hash. If it was documented, then the behaviour you are complaining about would obviously be a bug.

CPython is the reference implementation for the language, so there are behaviours which are not documented, like tuples of ints having predictable and consistent hashes, which is not actually documented anywhere but we can rely on it. I don’t think that dataclasses fall into that category. They’re relative new, and still in active development, unlike ints and tuples.

I appreciate that this is subjective, which is why this discussion has gone on so long. If there was a 100% objective right or wrong answer, the discussion would be over long ago: either dataclass hashes are buggy, and you should raise a bug report, or they are behaving as designed, and the discussion is over.

The status quo wins a stalement.

I don’t particularly care one way or another whether your tree has a deterministic hash, but there are lots of ways you can solve this that doesn’t involve changing the language.

There are, I think, three factors here:

  1. You, apparently, choose to use address randomization, which is why the hash of None changes; if this address randomization is causing you grief, maybe you should stop using it.

  2. You choose to use None in your trees, which is why the hash of the trees change. Again, if your design of the tree using None is causing you problems, maybe you should use a different design. (See below.)

  3. And then you put the trees in a set, which has a non-deterministic iteration order.

You might remember that my very first response to your post was ask if you were sure that the hash of None varies from run to run, because I couldn’t replicate that result. So that’s your first problem: you have address randomisation enabled. Maybe you should just turn that off?

Second issue: you don’t have to use None in your tree. Sure, it is more convenient if you do, but if using None breaks things, then don’t use None.

Or you could continue to use None, but swap it out for some other value when you compute your tree’s hash. You don’t have to use None when computing the hash! A basic sketch of the idea:

class Node:
    def __hash__(self):
        left = self.left
        if left is None: left = -1
        right = self.right
        if right is None: right = -1
        return hash((left, right))

Nothing else needs to change.

Or you could convince the core developers that hash(None) should not depend on the address of None. I understand that’s your preferred solution. Then address randomization won’t matter – until such time as you use a different object, like Ellipsis, or NotImplemented, or object().

Third issue: even if you fix the hashing of your tree, even if we agree that None should have a consistent hash even when address randomization is on, it is still unsafe to assume that set iteration order will be predictable. If you care about program correctness, you should not assume that set order will always be the same given the same set input order. If you care about set order, you should not rely on the native, implementation-dependent set order.

Depending on your needs, that might mean using an actual ordered set, or it might just mean sorting the set elements before iterating over them.

Or just live dangerously. Its your code. But don’t come complaining to us if set iteration order unexpectedly changes in Python 3.12.3 :slight_smile:

1 Like

shrug Does this count?

$ python3.10 -m timeit -s "myset = set(range(10000))" "9134 in myset"
5000000 loops, best of 5: 51.9 nsec per loop

$ python3.10 -m timeit -s "myset = dict.fromkeys(range(10000)).keys()" "9134 in myset"
5000000 loops, best of 5: 54.9 nsec per loop

Without knowing what you benchmarked, and where your code bottlenecks are, it is hard to tell what works for you.

a dict is not a MutableSet
implement a class that supports the suitable interface, then measure performing actual set operations, not just constructing a set, then come back and state your argument. I did all that, I measured slowdowns of around 1.6x-2x.

Okay, but I asked about dataclasses, not tuples. Dataclasses haven’t been around since 1991 and it’s not clear that they promise a deterministic hash. If it was documented, then the behaviour you are complaining about would obviously be a bug.

Stop deflecting already. There is nothing wrong with dataclass hashes. They generate a deterministic digest from the hashes of all their fields, just as tuples do. It’s just that None ruins the party for everyone just because.

If I used nested tuples for my trees, I wouldn’t have nice static type hints for my code (since last I checked recursive type definitions are not supported by any current type checker).

You can trivialize my problem all you want, it’s not convincing. Yes, if you know that you got screwed again by None footgun, it is “easy” to work around it. That doesn’t mean that it’s a good idea for the footgun to exist.

This post was flagged by the community and is temporarily hidden.

Too argumentative. Closing due to multiple flags.