# What's the time complexity of set.add()?

I had a slight doubt that whenever we perform any `set.add(item)` does this happens to traverse the entire set for `item` equality with the elements already inside the set.

Code:

``````class Foo:
def __eq__(self, other):
print("Called me.")
return id(self) == id(other)

def __hash__(self):
return 1

def __repr__(self):
return "Dummy()"

s = {Foo(), Foo(), Foo(), Foo()}
print("==========")
``````

Output:

``````Called me.
Called me.
Called me.
Called me.
Called me.
Called me.
==========
Called me.
Called me.
Called me.
Called me.
``````

if you notice carefully after `print("==========")` the statement `s.add(Foo())` at this point the set `s` has 4 elements in it, does the fifth item that weâ€™re trying to add, the `Foo()` is it happens to being checked for equality with each items that are already in the set (which we can see clearly the `called me.` got to occur four times) does this mean `set.add(item)` is `O(1)` or `O(n)`.

Yes, itâ€™s O(n), because you have broken `__hash__`. Hashtable complexity relies on a good hashing function.

5 Likes

To make this explicit: If you have a functional `__hash__` implementation (like all builtin immutable types have) then `set.add` will be `O(1)` (average case).

A functional `__hash__` implementation needs to fullfill the following qualities:

• `a == b` implies `hash(a) == hash(b)` (the reverse doesnâ€™t have to be true)
• `hash(a)` is constant as long as `a` exists.

If you break these two, `dict` and `set` straight up will not work correctly and you can get bogus results.

• `hash(a)` over all possible values of `a` is approximately evenly distributed across the entire range of 64bit integers. In practice, the range can be a lot smaller, it should just be larger than the size of the dictionaries you are going to be working with.

If you break this one (as you noticed), you wont get the performance behaviors described above.

• `hash(a)` and `hash(b)` are different for all possible values of `a` and `b`

This is the ideal property, that ofcourse canâ€™t be reached when your objects have more possible states than there are 64bit numbers, and is also often computationally infeasible to achieve.

• `hash(a)` is fast. This can be reached with caching, like for example the builtin tuple does.

Self explanatory. The goal is to avoid expensive comparisons. If `hash` is slow, you lose that benefit.

4 Likes