What's the time complexity of set.add()?

RohitMugalya · March 27, 2024, 4:56pm

when I was reading the python docs at this part click here

I had a slight doubt that whenever we perform any set.add(item) does this happens to traverse the entire set for item equality with the elements already inside the set.

Code:

class Foo:
    def __eq__(self, other):
        print("Called me.")
        return id(self) == id(other)

    def __hash__(self):
        return 1

    def __repr__(self):
        return "Dummy()"


s = {Foo(), Foo(), Foo(), Foo()}
print("==========")
s.add(Foo())

Output:

Called me.
Called me.
Called me.
Called me.
Called me.
Called me.
==========
Called me.
Called me.
Called me.
Called me.

if you notice carefully after print("==========") the statement s.add(Foo()) at this point the set s has 4 elements in it, does the fifth item that we’re trying to add, the Foo() is it happens to being checked for equality with each items that are already in the set (which we can see clearly the called me. got to occur four times) does this mean set.add(item) is O(1) or O(n).

alicederyn · March 27, 2024, 4:58pm

Yes, it’s O(n), because you have broken __hash__. Hashtable complexity relies on a good hashing function.

MegaIng · March 27, 2024, 6:07pm

To make this explicit: If you have a functional __hash__ implementation (like all builtin immutable types have) then set.add will be O(1) (average case).

A functional __hash__ implementation needs to fullfill the following qualities:

a == b implies hash(a) == hash(b) (the reverse doesn’t have to be true)
hash(a) is constant as long as a exists.

If you break these two, dict and set straight up will not work correctly and you can get bogus results.

hash(a) over all possible values of a is approximately evenly distributed across the entire range of 64bit integers. In practice, the range can be a lot smaller, it should just be larger than the size of the dictionaries you are going to be working with.

If you break this one (as you noticed), you wont get the performance behaviors described above.

hash(a) and hash(b) are different for all possible values of a and b

This is the ideal property, that ofcourse can’t be reached when your objects have more possible states than there are 64bit numbers, and is also often computationally infeasible to achieve.

hash(a) is fast. This can be reached with caching, like for example the builtin tuple does.

Self explanatory. The goal is to avoid expensive comparisons. If hash is slow, you lose that benefit.