Newbie questions about hash

I’m replying here from Using pointer to return result of long_hash to not run OT

So now I’m confused. Previously Serhiy said that, if the “no hash” value was zero instead of -1, hashes could be unsigned:

Thinking about an hypothetical new programming language, maybe statically typed, can hashes be limited to be always an uint64 or an uint32? Can’t it lead to more collision probability?

(Consider that I have only a superficial knowledge of hashing)

1 Like

The difference between int64 and uint64 is just the interpretation–they both have the same number of possible values (2**64). A signed integer is just using one bit for the sign.

I don’t know why an unsigned type for hash values would be useful but presumably Serhiy does.

3 Likes

The number of collision only depends on the number of used bits, whether it’s signed or unsigned doesn’t matter at all. @storchaka’s preference for unsigned numbers is purely aesthetic AFAIK. (and I agree)

2^32 and especially 2^64 are big. hash collisions are not likely if the hash functions used are good enough (and those that python uses are).

However, note that e.g. dict’s implementation only uses some of those bits for reasons of practicality, and will therefore have hash collisions all the time. The languages API limitation on 2^32 or 2^64 only matters if you have a number of objects in your data structure that approaches it - and that just isn’t something modern Computers can do (for python at least).

2 Likes

…oh my :smiley: Thank you for remembering me why uint is used in the first place. Maybe I’m too tired from work :stuck_out_tongue:

2 Likes

Details can matter. If hashes are uniformly random, then the probability of a collision depends on the “load factor”, the percentage of available slots that are actually in use.

However, Python very deliberately tried to do “better than” random in some cases, because dicts are so very crucial to its operation.

Chiefly, the number of slots available in a CPython hash table is always a power of 2: 2**k for some int k, indexed (internally) by ints in range(2**k). The first slot to look at is gotten by taking just the last k bits of a hash code. No division needed, just a fast masking (“bitwise &”) operation. “Nearly all” the bits in the hash code are irrelevant to this outcome.

That’s unusual, but also exploitable: it’s “why” hash(i) == i for “small enough” ints i. Then a dict of size 2**k suffers no collisions when indexed across all the integers in range(j, j + 2**k). The last k bits of the ints in that range (regardless of which “small enough” j is used) are unique.

We can’t actually construct an example of that, because the dict would grow larger than 2**k slots (it will not allow the load factor to get “too big”) along the way, but it remains the case that there are no collisions regardless.

That much has nothing to do with how wide hash codes are.

Where the width can matter is when lots of hash codes have lots of identical trailing bits. They they’ll all collide on the first stab. Then details get involved. Higher-order bits of the hash code are folded in across stabs, and the wider hash codes are, the more stabs may be needed before finding bits that finally differ.

In contrast, a more common design is to use hash tables with a prime P number of slots, and use hash_code % P for the first stab, and then every bit of the hash code affects the result. Division is much slower, though, and especially so when dicts were first implemented, and the scheme we eventually settled on has stood up well over time.

3 Likes

What if an initial size of 7 is used? 7 is 0b111, and you said in the other thread:

even if I miss the mathematical competences here.

1 Like

That works fine, but division by a single digit goes pretty fast regardless. 64-bit CPython uses 30-bit “digits” under the covers, and multi-digit operations are much slower. For mod 7, this code works without division (& much the same works to compute mod 2**k-1 for any k > 1, prime or not):

def mod7(n):
    assert n >= 0
    result = 0
    while n:
        result += n & 0b111
        n >>= 3
        if result >= 7:
            result -= 7
    return result

from itertools import chain
start = 897987987987987987016874685654
for n in chain(range(100), range(start, start + 1_000_000)):
    assert n % 7 == mod7(n), n

but takes O(n.bit_length()) time. Just peeling off the last 3 bits (as CPython actually does) takes tiny constant time regardless of how large n is.

1 Like

I’m interested in this process. Do you have any resource that explain it mathematically?

1 Like

While it’s probably surprising at first, it’s actually easy and shallow. Note that 8 is congruent to 1 modulo 7. So

8*i + j == 1*i + j == i + j (modulo 7)

It follows that if you add adjacent 3-bit slices of the input together, the sum is congruent to the original mod 7.

More generally, 2**k is congruent to 1 modulo 2**k-1.

It’s the same idea behind “casting out 9’s” in decimal, because 10 % 9 = 1.

So, e.g.,

123456789 == 1+2+3+4+5+6+7+8+9 (modulo 9)
== 45 == 4+5 == 9 == 0 (modulo 9)
1 Like