Does set membership use a different test than list membership

I have some Python code that has a user type called Person. I have tried making either a set of these objects or a list of these objects. The surrounding code generally creates new instances of this object. Person defines

    def __eq__(self, other):
        return isinstance(other, Person) and self.handle == other.handle

handle will be the same for different objects that actually refer to the same person, so this will check whether the different instances should actually be regarded as the same.

With a list, I say:

                    if child not in child_list:
                        child_list.append(child)

and this does not produce duplicates.

With a set I use update(), and this produces duplicates (i.e. multiple occurrences where the handle is the same).

    children = set(sdb.children(person))

    if additional_parents:
        for spouse in additional_parents:
            for family in sdb.parent_in(spouse):
                children.update(sdb.children(family))

This explanation says that set inclusion (and hence presumably update) tests the hash first (which is not defined for this user type, and hence is different for different instances of the user type).

HOWEVER, this Python documentation says that:

For container types such as list, tuple, set, frozenset, dict, or collections.deque, the expression x in y is equivalent to any(x is e or x == e for e in y) .

So I think that set append() should be using the same test as “not in” for lists (i.e. __eq__). Why does this not seem to be the case?

Part of the internals of sets (and dicts) is that they hash their elements (keys, respectively). Because of this, they will only work as expected if the __eq__ and __hash__ methods of whatever you put in them behave consistently, i.e. if x == y is only ever true if hash(x) == hash(y) is true. By default, Python will also throw an error if you try to hash an object whose class only defines __eq__ and not __hash__ since the default hash would otherwise most likely violate this requirement.

The easiest way to do this is to just add the following code to your Person class:

class Person:

    __hash__(self) -> int:
        return hash(self.handle)

Fwiw, another requirement of hashes is that the hash of an object does not change over its lifetime. Or at the very least that it does not change “while it’s being used”. If the hash (i.e. the Person’s handle when using the above implementation) changes while the object is in a set you will get very strange behaviour.

2 Likes

But that isn’t what the Python documentation says: It seems to say that x in y tests __eq__ for both sets and lists.

Is this documentation wrong?

The documentation isn’t exactly “wrong”, it just doesn’t cover the case of “you’re dealing with objects that have inconsistent (i.e., broken) __hash__ methods”.

The __hash__() comparison is just a quick check; if you have a collision, where two objects have the same __hash__(), then both sets and dicts fall back to the __eq__() check. (But the objects must at least have a __hash__() method in the first place, to be used as a set member or dict key.)

I think the documentation is wrong, because it says “the behavior is equivalent to this” and it is not actually equivalent to that. It should at least mention that hashes are involved, which it does not in that section.

2 Likes

I’ve tried doing a diagnostic print of hash(), and it shows that the objects have different hashes, and no exception is thrown, so presumably it is using the default definition as there is no definition for __hash__ for that object class.

Or to put it another way, perhaps the implementation of Python is wrong. Just shows you shouldn’t take shortcuts unless it is explicitly allowed in the definition of the language.

Should there be a bug report, if so, where?

The documentation is correct so long as your class is not broken. Python goes to some lengths to attempt to protect you from accidents, but if you define __eq__ and __hash__ and they aren’t matching, a lot of things will behave oddly.

Note that:

this isn’t the whole story. In order for your objects to be accepted in a set at all, you have to have defined __hash__. So the rule is quite simple: if you define __eq__, you have to also define __hash__ to match, or your objects won’t be accepted in sets.

1 Like

No, the implementation of the language is not wrong. It’s done what it can to protect you but it’s up to you to define __hash__ correctly if you define it at all.

1 Like

Are you sure that there is no definition of __hash__ in one of the parents of your class? Maybe in some library you’re using?

1 Like

There is. In Python 3 all classes are subclasses of object, and object.__hash__ is defined (in CPython, it happens to return the memory address of the object after rotating it right by 4 bits).

>>> class A:
>>>    pass

>>> a = A()
>>> hex(hash(a))
'0x1ce87a2266'
>>> hex(id(a))
'0x1ce87a22660'

I should have phrased things more explicitly, the question is whether there an implementation of __hash__ that isn’t object.__hash__, i.e. whether some parent that isn’t object is overriding it.

Of course, every class technically has an implementation of __hash__ in its MRO, but object.__hash__ will just raise TypeError if a clas only defines __eq__ and not __hash__. So effectively, classes that define __eq__ and not __hash__ don’t have an implementation of __hash__. (actually, I’m not sure whether that mechanism is part of object.__hash__ or some part of type’s class construction process, but I hope it’s clear what I’m trying to say)

1 Like

It almost doesn’t matter; if a specific class defines __eq__ and does not also define __hash__, it becomes unhashable.

Class construction process. It’s all documented here:

2 Likes

Then it should say that. There is no way for someone reading that section of the docs to know that their class is broken. hash is mentioned just above, but it’s in a list of bullet points that are introduced as applying to “the most important built-in types”, so even someone reading that wouldn’t know that they have to even know that hash exists.

You can create a post in the documentation topic, raise an issue on (I think) the cpython github page or create a pull request with your suggested improvement, core devs are happy to get external contributions but it is a volunteer maintained project.

Pretty much EVERYTHING assumes that your class is correctly behaved. The place to know whether your class is broken is the place where you defined the __hash__ method. If you do not explicitly define this method, your class will be correctly behaved. If you do define it, you need to do so in a way that is compatible with your __eq__ method.

Not everything has to be documented everywhere. At some point, you decided to create a __hash__ method. That is the point where you needed to be aware of the effect of that.

This is not behaviour specific to sets. It is also true of everything else that uses object hashes, because it is true of hashes, not sets.

True, but I would point out that this seems to be a relative newcomer to python experiencing a difficulty, the cause of which was unclear to them from the documentation that they’d read.

Be curious, not judgemental.

1 Like

I’m aware of that; but I’m pointing out WHERE the difficulty was actually encountered. It’s easy to say “this should have been documented in the one place that I would have looked for it”, but tell me, how many places should have a note added, saying “this doesn’t work if your object misbehaves”? Set and dict constructors, every method that adds to them, the in operator, any other object types that might use hashability… the list goes on and on. The correct place to learn about this is at __hash__, because that’s the one place that the problem actually happens.

If you put a note like that in every place that would misbehave if hash/eq are mismatched, will you also put a note in every place that misbehaves if a < b is not the same as b > a ? Because there are a LOT of places that assume that. Do we really need a Proposition 65 note against every method, or do we expect that people who write dunder methods can read the docs for the dunder method?

Okay, sure! I’ll be curious.

How many cautionary notes can be added to the docs before people stop reading them? I’m very curious what that limit is.

I’m not saying any proposed addition should be accepted and the same note added everywhere, but if there’s a specific bit of wording they find confusing then perhaps it can be improved. If several people raise the same problem, that’s a good indicator for improvement.