Nice haggling here, but you could be much more productive. What about adding some note into python documentation regarding sorted() function called on an iterable with unorderable object, like list of floats with some NaNs. Right now the function returns a partially ordered array and pretends everything is fine. As an example I had a list of ~10M floats, sorted it and selected N-lowest numbers. Having 10 NaNs in the list resulted in <0.01% of data being not sorted, but randomly split in the semi-sorted set, so this was hard to debug! I can understand it breaks on NaN, but hell, how can the non-NaN floats not be sorted???
Hello Svobora,
A note in the documentation under sorted()
and list.sort()
is an
excellent idea.
Pythonâs sort function, like most sort algorithms, requires the values
to define a total order.
Unfortunately, floats do not, because of NANs. However floats without
NANs do define a total order. NANs are considered unordered:
>>> from math import nan
>>> nan < 1.0
False
>>> nan > 1.0
False
>>> nan == 1.0
False
This behaviour is mandated by the IEEE-754 standard, and there are good
reasons for it, but sometimes it can be inconvenient. Sorting is one of
those times.
Having sort check every item to see whether it is a NAN or not would
unnecessarily slow sorting down too much. Python will never return a NAN
from builtin float operations unless you already pass in a NAN (or
sometimes an infinity), so generally speaking the only way to get NANs
in your data is to either put them there yourself, or to read them from
external data.
Pythonâs sort function, like most sort algorithms, requires the values to define a total order.
I donât think this is correct. Sort functions in Python are explicitly guaranteed to use __lt__
. The values do not need to define equality, IOW not required to satisfy total ordering.
Yes, sorting only invokes __lt__
, but it nevertheless assumes that the results of calling that are consistent with a total ordering. Hereâs an example:
>>> import math
>>> xs = [4, math.nan, 3, math.nan, 5, math.nan]
>>> sorted(xs)
[4, nan, 3, nan, 5, nan]
Nothing changed. Under the covers, the sort ends after a single pass identifying the first natural run. It first asks nan < 4?
. Which returns false. So it assumes 4 <= nan
must be true: the first two items are already in order. Then it asks 3 < nan?
'. Again false, so it assumes nan <= 3
must be true, and so the first 3 items are already in order.
And so on. At every step, xs[i+1] < xs[i]?
returns false, so it concludes xs[i] <= xs[i+1]
for every i
in range: the entire list consists of a single non-decreasing run, so thereâs nothing for sorting to do.
Whereas I dispute there are âgood reasonsâ for it . There are good reasons for Python to follow a standard, even if what a standard demands is a poor idea. I canât think of any good consequence that came from making NaNs âunorderedâ, except for one: the âx != x
â language-level trick for spelling âis x
a NaN
â? Which, in the early days, didnât work at all across C implementations (many compilers optimized it to plain 0
), and remains a poor idea even now in contrast to invoking an explicit isnan()
or fpclassify()
kind of function. Which few C environments supplied at first either, and so the dumbass âx != x
â trick spread.
If I had it to do over again, Pythonâs float comparison functions would impose a total order, and Python would have grown a math.ieee_compare(x, y, flags)
function for the small segment of people who actually want âunorderedâ outcomes.
Ah, well I wasnât party to the committeeâs internal deliberation, and I
canât afford to buy a copy of the standard to see if they give the
rationale for the unordered behaviour.
So perhaps I should have said âI assume there is a good reason for itâ
and leave arguments which reasons count as âgoodâ for another day.
Out of curiosity, how would you order NANs if you were in charge?
Presumably signalling NANs would just signal on comparison. How would
you order quiet NANs, and would you distinguish between those with the
sign bit set and those without?
The basic point is that it would be incorrect to give True for any comparison like nan < 1.0
. There are some computational advantages like automatically exiting a while loop on encountering a nan:
while g(x) < tol:
x = f(x)
(This is partly speculative butâŚ) they wanted to standardise the behaviour of all operations with floats across different hardware and it seemed important to specify that in low-level terms, basically in operations that map directly to assembly instructions. That made it clear that hardware vendors had the primary responsibility to implement this at the base of the stack. I think they were thinking very much about the lowest common denominator of computing environments rather than high-level languages with richer features like exceptions. For those reasons they were specifying concrete behaviour for operations like <
. As Tim says getting consistent support from C environments was difficult so inserting some well defined behaviour at the hardware level forced it into everything even if the compiler didnât provide an isnan
function. Otherwise there could have been different functions for comparing floats in different ways and in an environment with exceptions a plain <
comparison could just raise like 1 < 1j
does in Python. In the absence of any language support at least a while loop as I showed above is capable of handling nans provided the hardware follows the standard for <
.
They also carefully defined the binary layout of floats so that treating the same bits as representing integers would give an ordering that was consistent with floating point ordering. The difference is that in that comparison nans end up comparing greater than everything else:
>>> import struct
>>> as_int = lambda f: struct.unpack('>l', struct.pack('>f', f))
>>> import math
>>> xs = [4, math.nan, 3, math.nan, 5, math.nan]
>>> sorted(xs, key=as_int)
[3, 4, 5, nan, nan, nan]
(Special handling is needed to make this work for negative floats including -0.) I think this was partly designed to ease hardware implementation of comparisons but also intentionally so that there would be an easy way to sort a list of floats into a unique order.
Python chose to have <
correspond to IEEE float comparison and also to ensure that the default behaviour of sort
is determined by <
. I think the IEEE standard intended to support sorting floats but it was not supposed to be done with the float version of <
.
Itâs also questionable whether Python needed to follow IEEE for <
given that the IEEE standards were mainly trying to enforce consistent behaviour at the hardware level.
For good or ill, standards generally donât give rationales for anything - just âthe rulesâ. Sometimes they go on to produce a related ârationaleâ document too, but the 754 committee didnât.
As mentioned in a BPO report referenced earlier in this thread, a later version of 754 did go on to define a totalOrder()
function.
In this message from that report:
https://bugs.python.org/issue36095#msg336487
I gave a simple Python implementation of my best guess as to what that later function required. The example I gave then:
xs = [2.0, 1.0, 0.0, -0.0, -1.0, -2.0,
math.inf, -math.inf, math.nan, -math.nan]
print(sorted(xs, key=totkey))
displays
[nan, -inf, -2.0, -1.0, -0.0, 0.0, 1.0, 2.0, inf, nan]
While itâs not visible from the output, itâs the -math.nan
that appears first.
No rationale was given for that either. My best guess is that they defined the total order to be as consistent as possible with merely viewing âthe usualâ 64-bit layout of IEEE doubles as signed 64-bit integers, then using signed int comparisons that know nothing about doubles. Doing just that would leave negative finite floats in reversed order, but, as the linked message spells out, a simple preprocessing bit-fiddling trick on negative floats is enough to repair that:
import struct
def totkey(x, *, MASK=(1 << 63) - 1, pack=struct.pack):
asint = int.from_bytes(pack(">d", x), "big", signed=True)
if asint < 0:
asint ^= MASK
return asint
Ya, but thatâs the same cherry-picked example everyone gives in hindsight . In general, you just canât guess:
repeat
x = f(x);
until fabs(x - target) <= tol;
is now a nearly-guaranteed infinite loop if x
is (or becomes) a NaN
.
for x in xs:
if abs(x) >= 1e6:
# skip outliers - only want smallish finites here
continue
# lots of code that has no frickin' idea of what to do with a NaN
The standard applies to a âcomputing environmentâ, and actually says basically nothing about either hardware or programming languages. The people on the committee overwhelmingly had HW backgrounds, though, so were more attuned to what was possible in HW than to what was reasonable for programming languages to support with dedicated syntax.
Consider, for example, that there are 32 possible IEEE binary comparison predicates. 2**4 = 16 for which subset of the 4 possible {less, equal, greater, unordered} outcomes youâre interested in, and then each of those in 2 flavors depending on whether or not you want a signaling NaN comparand to signal. Thatâs easy for HW to support, and for a library ieee_compare(x, y, bitmask_of_flags)
function API to expose, but it would be plain nuts to expect a programming language to grow 32 different infix binary operator spellings for that.
At the time, CPython didnât have any particular drive to âsupport 754â, but instead to mimic what the platform C compiler did about it. It was only over time that deliberate attempts to âbe more 754-likeâ spread in the CPython implementation, as it became ever clearer that 754 was here to stay, and platform C compilers became more predictable in what their generated code did (at the start, the result of âNaN != NaN
â in C was a cross-platform crapshoot, and CPython intended to reproduce whatever the platform C did - Python tried to improve on Câs story for integers, but not for floats).
Thereâs a difference between having a âcomputing environmentâ that tries to automatically make your code correct even if you donât know what youâre doing and having an environment that make it possible to write portable correct code. At the time of IEEE 754 people were more concerned about the latter because even if you knew what you were doing it was impossible to get it right cross-platform.
At the time, CPython didnât have any particular drive to âsupport 754â, but instead to mimic what the platform C compiler did about it. It was only over time that deliberate attempts to âbe more 754-likeâ spread in the CPython implementation, as it became ever clearer that 754 was here to stay, and platform C compilers became more predictable in what their generated code did (at the start, the result of â
NaN != NaN
â in C was a cross-platform crapshoot, and CPython intended to reproduce whatever the platform C did - Python tried to improve on Câs story for integers, but not for floats).
This is the âforced it into everythingâ from the âbase of the stackâ effect I was referring to: Python inherited this behaviour from below even if it had no particular desire for it. Iâm pretty sure that was the intended effect from the authors of 754.
captious
Reduced to name calling.
Paddy, youâre responding to a comment made over a year ago. December
2019. I think we can safely ignore impolite things stated a year ago and
move on
Whoops. Ta.
I have to say that, rereading what I wrote an year ago, I too find my posts really annoying. Sorry for that.