NaN and Inf as missing values

Having read the ‘Infinity’ constant in Python, towards the end there was a mention that if NaN and Inf were to be made singletons users would do something like the following

x is NaN

or

x is Inf

And that it would be bad as a test.

I would like to pose why I think this isn’t so bad from the data science perspective. While NaN and Inf are floating values there is a current trend in the data science community to use NaN in particular as a method for indicating a missing value. This is most notable in Numpy and Pandas.

The issue with using NaN as a missing value is that suppose a column of DataFrame is an integer type, then of course, that column now becomes cast to a float, as it should. However, many people do not actually want that behavior. So Pandas has taken to making arrays and dtypes to handle this. They call the API the ExtensionArrays and ExtensionDtypes and from them they have created objects such as IntegerArray and BooleanArray which can hold the respective type in the name and NaN values.

Now to my naïve understanding, the way this work is that they have a logical mask along side the underlying Numpy array that indicates whether a value is NaN or not.

According to [1] Numpy is trying to get around to it. R currently has it.

So my point is that a large part of the community currently does

x is NaN

or

x is Inf

as a check (well the former much much more than the latter), in a sense.

It would be very nice to be able to have a built in that could be used universally as a missing value indicator. Now having said that, I realize None is supposed to act as this of sorts. Additionally, I realize that the reason None is not used in Pandas or Numpy is because it would turn their dtypes into Object type which makes things considerably slower.

So perhaps what I am asking for could never happen anyway and the entire burden of this falls completely on Numpy/Pandas. However, if it could, it would be quite amazing in my opinion. I feel like I may have just wasted my time typing this up but on the off chance it might happen, and core python isn’t aware of the current use case in data science, it will have made it worth it.

Thank you for your time and patience.

[1] Stackoverflow [NumPy or Pandas: Keeping array type as integer while having a NaN value]

PS:

I tried to add more links for ease of understanding but I am being limited to 2 as of right now : (

1 Like

I don’t want to rehash that long thread you linked to, but just to focus in on “… Numpy is trying to get around to it …”:

The situation with NumPy is not as simple as the answer in that stack overflow question might suggest. See the various NEPS about missing data: NEPS 12, 24, 25 and NEP 26 which summarizes the others. None of these NEPs were accepted, which demonstrates there is not a consensus in the NumPy community around these ideas.

I will go so far as to say I think use of x is Nan is very fragile, so I would be surprised to see robust code that uses it:

>>> n1 = float('nan')
>>> n2 = float('nan')
>>> n1 is n2
False

but maybe I misunderstood you.
Matti

No you do not seem to be misunderstanding.

I am being naïve

In [1]: import numpy as np

In [2]: a = np.nan

In [3]: b = np.nan

In [4]: a is b
Out[4]: True

In [5]: a == b
Out[5]: False

In [6]: aa = float('nan')

In [7]: bb = float('nan')

In [8]: aa is bb
Out[8]: False

In [9]: aa == bb
Out[9]: False

In [10]: aa is a
Out[10]: False

In [11]: aa == a
Out[11]: False

In [12]: import pandas as pd

In [13]: df = pd.DataFrame([1, np.nan, 2], index=[0,1,2], columns=["A"])

In [14]: df
Out[14]:
     A
0  1.0
1  NaN
2  2.0

In [15]: df.isna()
Out[15]:
       A
0  False
1   True
2  False

In [16]: df.applymap(lambda x: x is np.nan)
Out[16]:
       A
0  False
1  False
2  False

In [17]: np.nan()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-7af2d6d1e813> in <module>
----> 1 np.nan()

TypeError: 'float' object is not callable

In [18]: for x in np.array([1,np.nan,2]):
   ...:     print(x is np.nan)
   ...:
False
False
False

I would have bet good money that line 15 would have equaled the same as line 16 due to the way line 4 worked. Especially because line 17 reveals that numpy.nan is a singleton. I thought perhaps it was Pandas doing it but that appears not to be so as shown by line 18.

Have a happy thanksgiving, or weekend otherwise :slight_smile:
.

You raise an interesting point about using NANs as “missing values”. I

have softened my attitude towards this over the years, mostly on the

basis of “practicality beats purity”.

Personally, I don’t think that there is any advantage to making float

infinities to be singletons. It is just as easy to say

x == Inf

as to use the is operator, and it will work even if x is a subclass of

float, or a Decimal:

>>> import decimal

>>> decimal.Decimal('infinity') == float('inf')

True

Using is for infinities in current Python is an anti-pattern, it

simply is wrong. Anyone doing that is living in sin:

>>> a = float('inf')

>>> b = float('inf')

>>> a is b

False

NANs are trickier, since they are equal to nothing, not even themselves.

Given the “practicality beats purity”, what would you think about having

a named constant “NA” in the statistics module for missing values?

NA == NA

NA != x for all other numeric x

This would be similar to R.

If there is interest, I will write a PEP in my copious spare time.

(This might not be of use to numpy people. I expect that if numpy wants

their own missing value, they will prefer their own, rather than import

it from the std lib. But users of the statistics module might find it

helpful.)

In R, comparisons involving NA always return NA, never False or True:

> NA == NA
[1] NA
> NA == 1
[1] NA
> NA != NA
[1] NA
> NA != 1
[1] NA

While NaN and Inf are floating values there is a current trend in the data science community to use NaN in particular as a method for indicating a missing value. This is most notable in Numpy and Pandas.

I think the opposite is true. NaN was a practical compromise to keep vectorized operations fast. But Pandas 1.0 introduced (experimentally) pd.NA as a better alternative.