Does set membership use a different test than list membership

I’ve got a CS degree so the inner workings of sets and hashes are second nature to me, but you shouldn’t need one to use python. Maybe to use C++, but that’s more for your mental health.

1 Like

And you don’t! If you never define the __hash__ function, your object will behave correctly.

(1) I have done a search in my codebase for __hash__, and I don’t find any. (As a check I found many occurences of just the string hash, but only as part of other words). I have also looked through the Class inheritance hierarchy for __hash__, and don’t find any.

(2) Nevertheless, when I do a diagnostic print of hash(my_child), it prints a value, and that value is different for different instances that have the same underlying data, so compare equal because they have the same ‘handle’. Perhaps that is the wrong call, and I should have used my_child.__hash__()?

(3) From what is said, I think that the Class is NOT broken, because:

If a class defines mutable objects and implements an __eq__() method, it should not implement __hash__(), since the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).

My class is mutable, so defining __eq__, but not __hash__ is correct for the class.

(4) However, the documentation says:

A class that overrides __eq__() and does not define __hash__() will have its __hash__() implicitly set to None . When the __hash__() method of a class is None , instances of the class will raise an appropriate TypeError when a program attempts to retrieve their hash value, and will also be correctly identified as unhashable when checking isinstance(obj, collections.abc.Hashable) .

and this does not seem to be what happens (has this changed from older versions of Python, because this code was tested in Python 2.7.8)

Python 2.7 User-defined classes have __cmp__() and __hash__() methods by default; with them, all objects compare unequal (except with themselves) and x.__hash__() returns a result derived from id(x) .

(5) But I think this whole business of whether and how __hash__ is defined is a side issue. The documentation for Membership test operations clearly says:

For container types such as list, tuple, set, frozenset, dict, or collections.deque, the expression x in y is equivalent to any(x is e or x == e for e in y) .

(same in Python 2.7) And I think I am entitled to rely on this behaviour.

(6) In fact, for lists, the membership in a list does not seem to use hash as a preliminary shortcut, it just seems to use the == (__eq__) method.

TLDR: I think the documentation for 6.10.2 Membership test operations is wrong, it should say that the Python implementation is entitled to use __hash__ as a shortcut when determining “x in y”, (is that only for sets, and not for lists?)

1 Like

Post your code so we can run it. Without that, it’s hard to be sure. Here’s a simple cut-down example:

>>> class X:
...     def __eq__(self, other): return isinstance(other, X)
...     
>>> {X(), X()}
Traceback (most recent call last):
  File "<python-input-5>", line 1, in <module>
    {X(), X()}
TypeError: unhashable type: 'X'
>>> class Y:
...     def __eq__(self, other): return isinstance(other, Y)
...     def __hash__(self): return 42
...     
>>> {Y(), Y()}
{<__main__.Y object at 0x7f8af0bb81a0>}
>>> class Z:
...     def __eq__(self, other): return isinstance(other, Z)
...     def __hash__(self): return id(self)
...     
>>> {Z(), Z()}
{<__main__.Z object at 0x7f8af0bb81a0>, <__main__.Z object at 0x7f8af0daf4d0>}
>>> 

X is not hashable and behaves sanely. Y is hashable and behaves sanely. Z is broken, as equal objects can have different hashes, so you can end up with misbehaving sets, dicts, and anything else that relies on hashability.

It doesn’t matter what’s in the hierarchy, only in the class that defines __eq__ and __hash__. Post your code.

I have no idea, and the specifics may have changed, but the broad effect will be the same.

Would you be satisfied if every instance of “is equivalent to” in the documentation were replaced with “is roughly equivalent to”? There are ALWAYS ways that you can disprove the equivalence, such as redefining the any function. Do you really want to claim that it’s “not equivalent” because you have broken fundamental assumptions? That’s not the point of the documentation.

Post your code.

I can’t post the code, because it is not a toy example, but some code in the middle of a large system.

I didn’t define the hash function, and the code doesn’t behave correctly.

It really shouldn’t be necessary for a discussion this long to discover that x in y is not equivalent to x == e for e in y.

I think there needs to be a note in this section that the implementation will use hash to determine membership, especially because this is not the case for lists.

[ After all this long discussion, I think I have now determined that

  • in Python 2.7.8, if __hash__ is not defined for a user defined type, then all objects compare unequal (except with themselves), but
  • in later versions of Python, " A class that overrides __eq__() and does not define __hash__() will have its __hash__() implicitly set to None . When the __hash__() method of a class is None , instances of the class will raise an appropriate TypeError when a program attempts to retrieve their hash value".

Consequently, for a correctly defined mutable user class:

  • in Python 2.7.8 objects in sets compare unequal (except with themselves) but
  • in later versions of Python, instances are unhashable so cannot be in sets.]

I don’t think I have broken fundamental assumptions. I think that the fact that lists and sets etc are said to behave the same, but they don’t makes it clear that the documentation is incorrect.

1 Like

What output do you get if you try this?

>>> print(Person.__hash__)
2 Likes

It shouldn’t be necessary for a discussion to be this long with people trying to guess what your code is. Post some complete runnable code that demonstrates what you are saying.

So make a minimal stripped down version that demonstrates the same behaviour:

4 Likes

So basically, you’re telling me that I have been completely lying to you about the behaviour of __eq__ and __hash__. You’re telling me that your code is perfect and Python’s just misbehaving because it feels like it. But you can’t post your code.

I think I’m done discussing this with you. Feel free to hate on me, hate on Python, whatever you like, but until you can post some actual code that shows the problem, I’m going to write this one off as “the code IS broken, but the author doesn’t realise it”.

I’ll let you dig around in the list object code, but I would be surprised if it called hash() when testing for x in y. If an object’s __hash__ method is present, it can do anything the programmer darn well wants. Lists are more than happy to include multiple objects which happen to hash to the same value. That doesn’t mean they are equal. Sets behave more like dicts, not lists, when testing for inclusion.

Just as a gedanken experiment, suppose I implemented two classes (A and B), each with a __hash__ method like this:

def __hash__(self):
    return 2

If I created a list:

>>> mylist = [A(), B()]

and asked if 2 was in mylist, it darn well better return False. Might 2 in mylist call hash()? I sorta doubt it, but if it did, the full equality check would have to be made, as the hash values would produce a collision. If I later instantiated A again and asked about its presence in the list:

>>> a = A()
>>> a in mylist

I would similarly expect the result to be False unless I’d rigged instances of A to only ever produce a singleton or something similarly unlikely.

I think you might be confusing __hash__ with an object’s identity (often its machine address). There are no constraints on the programmer who defines a __hash__ method. Again, I’ve not looked at the C source in a long while, but is and __hash__ are two quite different beasts. I could well be missing something, having followed along with the thread, but maybe not digested something properly.

On the more general topic of asking for help, you really need to be able to produce a small example others can inspect and execute. If you can’t or won’t, there’s no way for people trying to help you (like @Rosuav) to guess what might be wrong. I will grant you that maybe something needs to be corrected in the documentation, but without a small example, we can’t tell if you’re misreading the existing documentation or found some case it really needs to explain better.

2 Likes

I think nobody mentioned it before, so for completeness, there is a paragraph just before the “membership test operations” sections saying:

  • The hash() result should be consistent with equality. Objects that are equal should either have the same hash value, or be marked as unhashable.

I don’t think there’s a need for an extra sentence stating “if this isn’t satisfied, other code using your class might behave in unexpected way”; at least in the earlier requirements for comparisons being transitive and symmetric it’s very intuitive that weird things may happen if you don’t follow this.

Similarly, while it would not be incorrect to state “this operation assumes the objects inside to have correctly implemented comparisons/hash”, it feels like every second sentence in standard containers’ documentation would need this kind of warning, which would become a bit silly. (C++ standard and cppreference does this to some degree, but that makes the text much, much more complex.)

As for your specific issue, I agree with others that it’s impossible to guess what’s going wrong without seeing an example. All the code I know behaves the way others said, you shouldn’t have been able to put the object in the set if it was comparable but not hashable.

And in general, if you consider your objects to be mutable (or more specifically, the fields that determine their equality/uniqueness), then they’re just not supposed to be put in sets.

4 Likes

OK, here’s the example:

class P(object):
    def __init__(self, name):
        self.name = name
    def __eq__(self, other):
        return self.name == other.name

mother_child = P("John")
father_child = P("John")

# make a list
child_list = []
for child in mother_child, father_child:
    if child not in child_list:
        child_list.append(child)
print(child_list)

# make a set
child_set = set()
for child in mother_child, father_child:
    if child not in child_set:
        child_set.add(child)
print(child_set)

So the type defines __eq__, but does not define __hash__ . Under python 2.7.10, I get

$ python -V
Python 2.7.10
$ python test.txt 
[<__main__.P object at 0x10aa05ad0>]
set([<__main__.P object at 0x10aa05b10>, <__main__.P object at 0x10aa05ad0>])
$ 

So mother_child and father_child are two different objects, but they are equal. For a list they are recognised as equal, so only one appears in the list. For the set they are not recognised as equal, even though the Python documentation says that “in” uses “==” in both cases. So I get two objects in the set.

I am not saying my code is perfect and I am not saying Python is misbehaving. Now that I think I understand what is going on, I just think that the documentation is unclear. I think that Membership Test Operations (MTO) really ought to say that hash is used for membership tests for sets, but not for lists. If it had said that I would have known where to look to understand the behaviour.

(BTW, for Python 3.9.5 I get “TypeError: unhashable type: ‘P’” for the set. The very fact that the behaviour was changed to a run time error might indicate that the behaviour was unclear. But I still think the MTO documentation should have the same change, because membership uses hash for set membership, not only “==”, so the documentation is arguably incomplete.)

1 Like

Under python 2.7.10 I get

<slot wrapper ‘hash’ of ‘object’ objects>

Under Python 3.9.5 I get

None

Python 2.7 reached end of life 5 years ago, I suggest you update to a more recent version as everyone here will assume you’re using something supported

5 Likes

This page lists support status of various python versions (3.9 is near end of life too)

2 Likes

If you want a “fix”, add this to your test class:

    def __hash__(self):
        return hash(self.name)

In Python 3, a user-defined class ultimately inherits from object, so you get an __eq__ and a __hash__ from there. Once you override __eq__, since the idea of hash and equality must be compatible, you no longer get the inherited hash method - it becomes None as you’ve seen, and you have to add one.

Not sure what the docs should say here - make a proposal? dicts and sets are built on hash lookups, so you’re nowhere if you don’t have a hash method.

To add in, the datamodel documentation ( 3. Data model — Python 3.13.5 documentation ) seems to me to be fairly clear on this.

Here’s the revised example code for python3:

class P(object):
    def __init__(self, name):
        self.name = name
    def __eq__(self, other):
        print("__eq__ called")
        return self.name == other.name
    def __hash__(self):
        print("__hash__ called")
        return hash(self.name)

mother_child = P("John")
father_child = P("John")
other_child = P("Jim")

print(P.__hash__)
# make a list
child_list = []
for child in mother_child, father_child:
    if child not in child_list:
        child_list.append(child)
print(child_list)

# make a set
child_set = set()
for child in mother_child, father_child:
    if child not in child_set:
        child_set.add(child)
print(child_set)

print("child in list uses:")
print(father_child in child_list)

print("child in set uses:")
print(father_child in child_set)

print("child in set false uses:")
print(other_child in child_set)

print()

The relevant output is:

<function P._hash_ at 0x109e7ec10>
_eq_ called
[<_main_.P object at 0x109e8c9d0>]
_hash_ called
_hash_ called
_hash_ called
_eq_ called
{<_main_.P object at 0x109e8c9d0>}
child in list uses:
_eq_ called
True
child in set uses:
_hash_ called
_eq_ called
True

child in set false uses:
__hash__ called
False

The problem is that this Python documentation says that:

For container types such as list, tuple, set, frozenset, dict, or collections.deque, the expression x in y is equivalent to any(x is e or x == e for e in y) .

I think that this is not correct (or at least not clear), because “in” is not equivalent to “==” at least for a set, because clearly in this case “==” is not being used.

I suggest that the documentation should have added at the end of the above sentence something to the effect that “Note that the implementation may use ‘hash(x) == hash(e)’ as a preliminary test for ‘x in y’.” Ideally the note should indicate to which of list, tuple, set, frozenset, dict, or collections.deque the use of hash() applies.

You’re reading far FAR too much into the “equivalent”. If it were truly nothing more than that, sets would have no benefit over lists, and you would need to linearly scan when doing membership tests. This simply isn’t the case, and nobody would expect it to be.

It’s been said before but I will say it again: More notes in the documentation does not improve its ability to convince people of things.

Do you believe me this time? No? See, it turns out that saying things more than once in a discussion thread has the exact same problem. Repeated messages don’t actually add any weight.

In fact, the more places you have to say the same thing, the more likely that it just creates noise. If I have to say a third time “adding extra notes to the documentation isn’t a solution”, it isn’t going to be more convincing, it’s just going to be more noise in the thread. And it might make it LESS likely that important information is seen, because it’s drowned out by the repetition.

It’s like Proposition 65 notices. Since virtually everything in California causes cancer, people don’t take any notice of the warnings, and so they are useless. It would be exactly the same with more notes in the documentation - people would gloss over them. Remember, this isn’t the only thing that would require such notes; anything that refers to iterators would need to mandate that iter() returns self, for example.

If I still haven’t convinced you, it’s probably because adding extra words to the discussion doesn’t actually improve their ability to carry information.

Maybe the word “equivalent” is a bit too strong. It could perhaps say “functionally equivalent” instead.

The purpose of this statement in the docs is not really to explain the mechanics of how the containment check is implemented. Rather it is to clarify that x is e is checked before x == e which affects the output in the edge case of nan:

In [1]: nan = float('nan')

In [2]: nan2 = float('nan')

In [3]: nan == nan
Out[3]: False

In [4]: nan in [nan]
Out[4]: True

In [5]: nan in [nan2]
Out[5]: False

In [6]: nan is nan
Out[6]: True

In [7]: nan is nan2
Out[7]: False

Usually code does not have multiple nan objects like this but it is good to know that e.g. x in [x] holds for all x even if x == x does not.

Another edge case is when bool(x == e) would raise an error:

In [9]: import numpy as np

In [10]: a = np.array([1, 2])

In [11]: a in [a]
Out[11]: True

In [12]: any(x == e for e in [a])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[12], line 1
----> 1 any(x == e for e in [a])

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
1 Like

Economy of expression (from the doc style guide) might apply:

More documentation is not necessarily better documentation. Err on the side of being succinct.

It is an unfortunate fact that making documentation longer can be an impediment to understanding and can result in even more ways to misread or misinterpret the text. Long descriptions full of corner cases and caveats can create the impression that a function is more complex or harder to use than it actually is.

2 Likes