Hashing & equality of sequences containing nan

I’m wondering about how the presence of IEEE-754 special values affects behaviour of tuples containing them in sets and dicts.

The python3 REPL session below shows that nan in sequences doesn’t prevent them from working as keys.

I tried to track down the spec behaviour, and found this:

Sequences (instances of tuple, list, or range) can be compared only within each of their types, …
Sequences compare lexicographically using comparison of corresponding elements. The built-in containers typically assume identical objects are equal to themselves.

iiuc, that last sentence means that (math.nan,) == (math.nan,) because it’s essentially doing math.nan is math.nan || math.nan == math.nan when comparing corresponding elements.

But the “typically” there seems like it’s an optional optimization; a conforming Python runtime could elect not do that in which case (math.nan,) != (math.nan,) would be spec-compliant.

Am I misreading the documentation? What can one rely on when trying to write Python code that might use tuples containing nan in sets/dicts?

$ python3
Python 3.10.8 (main, Oct 13 2022, 10:17:43) [Clang 14.0.0 (clang-1400.0.29.102)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from math import nan
>>> nan == nan
False
>>> (nan,) is (nan,)
False
>>> (nan,) == (nan,)
True
>>> s = set()
>>> s.add((nan,))
>>> (nan,) in s
True
>>> s.add(0.0)
>>> 0.0 in s
True
>>> -0.0 in s
True
>>>

There are two related things going on. Firstly, the “typically” refers to normal, expected behaviour of most objects - that is, they are equal to themselves. But separately to that, many forms of comparison are actually defined as “is-or-equals”; for example, searching a list for an element uses this comparison logic. It’s not simply an optimization, and there is no way that something would be compliant without following this rule, but it’s almost never important because typically objects are equal to themselves. So you can mentally think about it as “find the first thing that’s equal to the thing you’re searching for”, with the caveat that it will find itself even if not equal.

1 Like

Thanks for explaining “typically”, and yes, that jibes with what I’ve been seeing.
Is the reference documentation the best place to look for these things, or is there a better resource?

So that means that nan is not problematic in key position, but that positive/negative zeroes will be consistently conflated.
I suppose that kind of 754-related conflation of notionally distinct values below is no different than cross-type conflation like set((1, True, 1.0)).

>>> sp = set((0.0, -0.0))
>>> '%r' % sp
'{0.0}'
>>> sn = set((-0.0, 0.0))
>>> '%r' % sn
'{-0.0}'
>>> sp == sn
True

With floating-point values other than NaN, this would indeed behave like an optimization. You can easily find two floats which are equal but not identical; in fact, most likely, this will happen simply by constructing them:

>>> x = 1234.5
>>> y = 1234.5
>>> x is y
False
>>> x == y
True

(Obviously this isn’t guaranteed, and specifics may vary. You should never rely on floats being distinct objects.)

So with these two objects, can we put them both in a set?

>>> s = {x, y}
>>> s
{1234.5}
>>> next(iter(s)) is x
True
>>> next(iter(s)) is y
False

As makes obvious sense, no you can’t. And the first one referenced is the one that’s kept. The “default” rule of using object equality to define set membership is the natural way to interpret this.

The identity check is really just a special case to handle peculiar situations with objects that aren’t equal to themselves, but it does give rise to some very useful guarantees. For example:

for value in s:
    print(value in s)

You can be confident that, for any set (or list or tuple or dict), this will print out nothing but Trues. Without the identity-or-equality comparison, this could occasionally give Falses, which would be extremely annoying. Consider this function and tell me whether it will behave sanely:

def show_dict(d):
    for key in d:
        print("%r ==> %r" % (key, d[key]))

Of course it will; you can subscript a dictionary with the keys you got from iterating over that dictionary. And equally obviously, key in d has to be True if d[key] has a value, and False if d[key] will raise. That’s just common sense! And the identity-or-equality check guarantees this.

So, yes, it IS an optimization (it’s way faster to check object identity than to ask if the objects are equal), but it is also a well-defined guarantee of the language.

1 Like

Thanks for explaining the value interning. Yes, it is a common source of under-specified behaviour and I should have remembered that.

Reading that made me realize I was making some assumptions about nan being singleton.
Now I see the discussion on Making NaN a singleton

And the below shows that nan is not singleton on CPython.

$ python3
Python 3.10.8 (main, Oct 13 2022, 10:17:43) [Clang 14.0.0 (clang-1400.0.29.102)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from math import nan, isnan, acos
>>> nan0 = nan
>>> try:
...   nan1 = acos(2.0)
... except:
...   nan1 = float('nan')
...
>>> nan0 is nan1
False
>>> isnan(nan0)
True
>>> isnan(nan1)
True
>>> s = set((nan0, nan1))
>>> len(s)
2
1 Like

I believe that the behaviour shown by CPython is considered an optimization, not a language feature. There is nothing stopping another implementation from failing to implement it, and that should be one of the things you consider when porting from CPython to another interpreter.

Sure enough, here’s IronPython 2.7:

>>> a = [1, float('nan'), 2]
>>> b = [1, a[1], 2]        
>>> a == b
False

Micropython 1.9.4 does the same. Whereas Jython 2.7 returns True, so it does implement the same optimization.

This optimization was not designed with NANs in mind. It assumes that equality obeys the reflexivity property, which says that x == x for all x. NANs are “wierd” and difficult because they violate reflexivity. It is best to avoid using NANs as dict keys or when comparing collections, as they may behave unexpectedly.

Not much.

In practical terms, CPython is unlikely to change its behaviour, but in theory it could. (But it probably won’t.) Other implementations may or may not follow CPython.

You should be able to rely on math.nan always returning the same object, but you cannot rely on float('nan') always returning the same object.

Nor even that all NANs have the same bit pattern. There are over 9000 trillion possible bit patterns for float NANs (that’s 0.05% of the full float space) and there is nothing promising that all the NANs you see will have the same bit pattern.

You cannot rely on being able to create signalling NANs in Python – some OS or CPUs prohibit signalling NANs and silently flip them into quiet NANs.

You should be able to rely on x != x and x != y for x,y any two NANs, but you cannot rely on anything beyond that.

But in practical terms, it probably doesn’t matter too much. Write the code that works, and don’t worry about looking for problems before they come up naturally :slight_smile:

2 Likes

It might have been mere optimization at one point, but it was cemented as intended behaviour quite some time ago. I can’t find a direct reference to when containers were first documented as using identity-or-equality, but here’s where the ABCs were modified to align with this behaviour:

Are there any Python 3 implementations that misbehave on this point? I tried PyPy3 v7.3.5 and uPy v1.6, and both of them give True here. (No idea what uPy 1.9.4 that you tested was, so I have no idea how to describe its version numbers.)

So, no, CPython will not be changing its behaviour, not now.

1 Like

This isn’t all that relevant to NANs, incidentally, since float("1") doesn’t return the same object every time either. It might happen to, but it isn’t required to. Any data type not specifically defined as having a limited number of instances (NoneType, bool, and a small handful of others) cannot be assumed to return the same object more than once.

A compliant Python implementation IS permitted to cache these (eg CPython caches small ints), but this is pure optimization.

1 Like

You should be able to rely on math.nan always returning the same object, but you cannot rely on float('nan') always returning the same object.

Nor even that all NANs have the same bit pattern. There are over 9000 trillion possible bit patterns for float NANs (that’s 0.05% of the full float space) and there is nothing promising that all the NANs you see will have the same bit pattern.

This sounds like a “let libm do what libm does” requirement.

incidentally, since float("1") doesn’t return the same object every time either

Caching all float boxes would require a global, thread-safe weak map which could be a big source of contention. Maybe the thread-safety is not an issue if you’ve got a GIL, but still a bottleneck.

Whereas Jython 2.7 returns True

I wonder if that’s related to the caveats in java.lang.Double.equals: around boxed-float equality vs primitive float equality:

Note that in most cases, for two instances of class Double, d1 and d2, the value of d1.equals(d2) is true if and only if

d1.doubleValue() == d2.doubleValue()

also has the value true. However, there are two exceptions:

  • If d1 and d2 both represent Double.NaN, then the equals method returns true, even though Double.NaN==Double.NaN has the value false.
  • If d1 represents +0.0 while d2 represents -0.0, or vice versa, the equal test has the value false, even though +0.0==-0.0 has the value true.

This definition allows hash tables to operate properly.