Details can matter. If hashes are uniformly random, then the probability of a collision depends on the “load factor”, the percentage of available slots that are actually in use.
However, Python very deliberately tried to do “better than” random in some cases, because dicts are so very crucial to its operation.
Chiefly, the number of slots available in a CPython hash table is always a power of 2: 2**k
for some int k
, indexed (internally) by ints in range(2**k)
. The first slot to look at is gotten by taking just the last k
bits of a hash code. No division needed, just a fast masking (“bitwise &”) operation. “Nearly all” the bits in the hash code are irrelevant to this outcome.
That’s unusual, but also exploitable: it’s “why” hash(i) == i
for “small enough” ints i
. Then a dict of size 2**k
suffers no collisions when indexed across all the integers in range(j, j + 2**k)
. The last k
bits of the ints in that range (regardless of which “small enough” j
is used) are unique.
We can’t actually construct an example of that, because the dict would grow larger than 2**k
slots (it will not allow the load factor to get “too big”) along the way, but it remains the case that there are no collisions regardless.
That much has nothing to do with how wide hash codes are.
Where the width can matter is when lots of hash codes have lots of identical trailing bits. They they’ll all collide on the first stab. Then details get involved. Higher-order bits of the hash code are folded in across stabs, and the wider hash codes are, the more stabs may be needed before finding bits that finally differ.
In contrast, a more common design is to use hash tables with a prime P
number of slots, and use hash_code % P
for the first stab, and then every bit of the hash code affects the result. Division is much slower, though, and especially so when dicts were first implemented, and the scheme we eventually settled on has stood up well over time.