Minmax function alongside min and max

Ah excuse me, I never read the docs and I wrongly assumed that it was an accelerated version of math with C :smiley:

Well, so I suggest math :stuck_out_tongue:

PS: just for curiosity, why a different package for complex numbers? math should not simply used function overloading and duck typing?

Well, I know that Raymond Hettinger is a optimum programmer… but I do not understand at all the usefulness of zip_longest:

  1. it zips the same iterator. So the length is the same, zip is sufficient
  2. even if you zip, you’re zipping the same iterator… so val0 is val1. It’s impossible that val0 > val1
  3. since val0 is val1, if val0 < vmin, it’s impossible that val1 > vmax, so it should be preceded by an elif, to accelerate the code.
  4. If you change all of this, you simply return to your previous version of the code :smiley:

Am I missing something?

PS: furthermore, my hope is this function will be translated in CPython, so the use of itertools is problematic.

The most important difference is in the handling of inputs of type float (or int). The vast majority of Python code does not expect to work with complex numbers, and so for example for most users it makes sense for math.sqrt(-2.0) to raise ValueError. Users who work in a domain where complex numbers make sense can use cmath.sqrt instead, which will return an appropriate complex output for cmath.sqrt(-2.0).

(Yes, I know that the ** operator already returns complex outputs for non-complex inputs in some cases. IMO, that was a (minor) design mistake, but we’re getting off-topic for this thread.)

Yes!
The two zipped iterators are not the same! The first is always one step ahead of the second. Therefore, the zip_longest function returns pairs in sequence. Clever trick, but it’s not magic …

If this is ever to be implemented in CPython as Python, then itertools is no problem at all, since it is part of the standard library. If it will be a C-function, it will be completely different anyway.

Being of stdlib of Python does not mean that exists a CPython API for itertools… I don’t know this. Anyway, I’ve done benchmark of my solution and the Hettinger’s solution, and the Hettinger’s solution, even if brilliant, is slower:

marco@buzz:~$ python3.9 -m pyperf timeit "minmax(range(100000))" --rigorous --affinity 0 -s 'MY_CODE'
Mean +- std dev: 5.15 ms +- 0.16 ms
marco@buzz:~$ 
marco@buzz:~$ python3.9 -m pyperf timeit "minmax(range(100000))" --rigorous --affinity 0 -s 'HETTINGER_CODE'
Mean +- std dev: 6.35 ms +- 0.23 ms

and I don’t understand how can it be faster zipping two iterators and do a triple comparison instead of simply iterate one object and do one or at max two comparisons…

A little modification: max() and min() suffers by a subtle bug: if the first elements are NaNs, NaN is returned.

Why? Because the first element is taken as the default value for max and min. After the functions check if the second element is greater or lesser, and if yes, they substitute the max or min value with the second value, and so on.

The problem is NaN returns always zero for every comparison. So the first element, a NaN, will be never greater or lesser than the other numbers in the sequence, and will be never replaced.

This is the code that fixes this bug:

_sentinel = object()

def minmax(*args, key=None):
    args_len = len(args)
    
    if args_len == 0:
        fname = minmax.__name__
        raise ValueError(f"{fname}() expected 1 argument, got 0")
    elif args_len == 1:
        seq = args[0]
    else:
        seq = args
    
    it = iter(seq)

    vmax = vmin = next(it, _sentinel)
    
    if vmax is _sentinel:
        fname = minmax.__name__
        raise ValueError(f"{fname}() arg is an empty sequence")
    
    first_comparable = False
    
    if key is None:
        for val in it:
            if val > vmax:
                vmax = val
                first_comparable = True
            elif not first_comparable and not val <= vmax:
                # not comparable object, like NaN
                vmax = vmin = val
                continue
            elif val < vmin:
                vmin = val
                first_comparable = True
    else:
        fmax = fmin = key(vmax)
        
        for val in it:
            fval = key(val)
            
            if fval > fmax:
                fmax = fval
                vmax = val
                first_comparable = True
            elif not first_comparable and not fval <= fmax:
                fmax = fmin = fval
                vmax = vmin = val
                continue
            elif fval < fmin:
                fmin = fval
                vmin = val
                first_comparable = True
    
    return (vmin, vmax)

This slows down the function, but only a little (5.57 ms ± 0.24 ms)

I hope this simple fix will be applied also to min() and max().

I wouldn’t consider this to be a bug, or at least not one that should be fixed by altering the implementation of min or max. If the elements given to max or min aren’t totally ordered, then there isn’t a meaningful result to be given; consider this a case of undefined behaviour. sort and sorted have the same issue.

1 Like

numpy does not suffer by this bug. And sorry, it’s a bug and IMHO should be fixed. And the fix, as you can read, is very simple.

It’s not true. This happens only if the first elements have comparison operators that returns always False. If there’s a NaN in the middle of the iterator, there’s no problem. If this is not a bug… and it’s an old known bug.

In the general case of some arbitrary values without total ordering, I
would agree, but NANs are somewhat special.

The 2008 revision to IEEE-754 specifies that min and max should ignore
NANs:

min(NAN, x) = min(x, NAN) = x

and similarly for max. Only in the case that both arguments are NANs
should a NAN be returned.

One might also argue for min/max functions that propogate NANs,
returning NAN if any of the arguments are NAN.

Python’s status quo, where the result depends on which argument is a
NAN, matches neither of these behaviours, and is surely not desired by
anyone.

I think that it is easy to fix: just skip over any leading NANs, then
proceed as normal from that point on. In pseudo-code:

def minNum(values):
    while values[0] is a NAN:
        del values[0]
    if not values:
        return NAN
    return min(values)

Making it efficient and robust is left as an exercise for the
reader :slight_smile:

IEEE 754 isn’t super helpful here. In the current version (the 2019 text), it doesn’t define anything called either min or max. It defines minimum and maximum, which propagate NaNs, and minimumNumber and maximumNumber, which treat NaNs like missing data. There’s nothing in IEEE 754 which provides guidance on which of those operations Python’s min and max should map to.

Independent of that, I don’t think it’s worth complicating the currently completely generic min and max (or sorted, or list.sort, which suffer from exactly the same issues) implementations with NaN-awareness, and I’m not even sure it’s possible to reliably check for NaNs across the variety of types that might present them (NumPy floats, decimal.Decimal, Python floats, gmpy2 NaNs, etc.).

Possibly we need a documentation update pointing out that min, max, sorted and list.sort only make sense in the presence of a total ordering, and that results are undefined otherwise.

@mdickinson Can you please see my code? The fix is really simple and works not only for NaNs, but for every object that return always False in comparisons:

    first_comparable = False
    
    if key is None:
        for val in it:
            if val > vmax:
                vmax = val
                first_comparable = True
            elif not first_comparable and not val <= vmax:
                # not comparable object, like NaN
                vmax = vmin = val
                continue
            elif val < vmin:
                vmin = val
                first_comparable = True

@Marco_Sulla
I here present a version that handles nan as expected (and demonstrated in your code), but without the need to test for first_comparable for each value.
Also, this version supports key and default:

_sentinel = object()
def minmax(*args, key=None, default=_sentinel):
     args_len = len(args)
    
    if args_len == 0:
        if default is _sentinel:
            raise ValueError("minmax expected 1 or more argument(s), got 0")
        return default        
    elif args_len == 1:
        seq = args[0]
    else:
        seq = args
    
    it = iter(seq)

    vmax = vmin = v0 = next(it, _sentinel)
    if vmax is _sentinel:
        if default is _sentinel:
            raise ValueError("minmax arg is an empty sequence")
        return default
            
    if key is None:            
        while not vmax <= vmax: # this essentially a test for nan
            vmax = vmin = next(it, _sentinel)
            if vmax is _sentinel:
                return v0, v0
        for val in it:
            if val > vmax:
                vmax = val
            elif val < vmin:
                vmin = val
        return vmin, vmax
                
    fmax = fmin = key(vmax)
        
    while not fmax <= fmax: # this is essentially a test for nan
        vmax = vmin = next(it, _sentinel)
        if vmax is _sentinel:
            return v0, v0
        fmax = fmin = key(vmax)
    
    for val in it:
        fval = key(val)
        
        if fval > fmax:
            fmax = fval
            vmax = val
        elif fval < fmin:
            fmin = fval
            vmin = val

    return vmin, vmax

I remember some discussion about adding one-pass statistics methods to the statistics module. Perhaps an API that allows computing combinations of min/max/mean etc. in a single pass would also work for the case here. It would make sense to define nan-handling there.

Marco, I think you are missing Mark’s point that it doesn’t matter how
“simple” the patch is (assuming it works: I see no test suite to
demonstrate that it does what you say it does). Mark’s argument is that
we should not “fix” this by changing the behaviour.

min and max currently make no promises about what they will do when
given NANs. Since they don’t guarantee what the result will be
when passed a NAN, the current result might be surprising but it’s not
buggy.

We have no consensus on what the correct behaviour should be. Should
they:

  • return NAN?
  • raise an exception?
  • treat it as missing data and ignore it?
  • something else?

If we make guarantees about NANs, do we have to also make guarantees
about other unordered objects?

How about objects which always return True for ordering comparisons?

Although I don’t completely agree with Mark, I am sympathetic to his
argument. As the author of the statistics module, I intentionally didn’t
make any promises what result you would get if you passed a NAN as
a data point.

Changing the behaviour of min or max will probably need a PEP.

@steven.daprano: Ok, let me show you an example:

>>> max((float("nan"), 0))
nan
>>> max((0, float("nan")))
0

It’s the same sequence, the only difference is the position of their items. This is a bug. Point.

@ruud Mh, I don’t know if your change will work always. Essentially you’re testing equality, since vmax can’t be lesser than itself. But there are objects that does not support __eq__, like numpy.ndarray.

The only good reliable test IMHO is to test that > and <= returns both False.

Not true. This is the precise definition of “undefined behaviour”.

2 Likes

It’s not poop… It’s “made by the dog” :smiley:

Seriously, you can’t have an undefined behavior only because the position of the elements in the iterable changes.

You can have an undefined behavior only if all the object in the iterable have an undefined behavior with the comparison operators.

Otherwise, I suppose IMHO that this is a… bug.

No, that’s not how it works. We can just say that the output of any operation with NaN is undefined and get away with it. CPython is the reference implementation and we make the rules. You could argue that it’s a stupid rule, but that’s it.

This is not the case, since if NaN is not the first object, the operation is not undefined, it gives you the correct answer. If the operation will give you an undefined behavior with NaN, or any other object like it, at any position, I would agree with you. But this is not the case.

The fact is: what is expected by the coder? For what I know, Python empathizes simplicity and readability, and that the code make the things “like a human expects”. This is not the case.