Minmax function alongside min and max

steven.daprano · December 14, 2019, 12:37am

Ruud van der Ham suggested:

if key is None:
    key = lambda x:x

Using an identity function lambda x: x is needlessly slow. At least
that’s what I found when I benchmarked it a few years ago, and I don’t
expect that the situation has changed enough for me to bother
benchmarking it again

I have to step away from the keyboard now, but when I return I’ll post
the version of minmax I’ve been using.

Marco_Sulla · December 14, 2019, 12:48am

Steven, see my solution above.

ruud · December 14, 2019, 2:32am

I think we can just return default as such, which should be a tuple of the default minimum and maximum.

_sentinel = object()
def minmax(*args, key=None, default=_sentinel):
    args_len = len(args)
    
    if args_len == 0:
        if default is _sentinel:
            raise ValueError("minmax expected 1 or more argument(s), got 0")
        return default        
    elif args_len == 1:
        seq = args[0]
    else:
        seq = args
    
    it = iter(seq)

    vmax = vmin = next(it, _sentinel)
    if vmax is _sentinel:
        if default is _sentinel:
            raise ValueError("minmax arg is an empty sequence")
        return default
    
    if key is None:
        for val in it:
            if val > vmax:
                vmax = val
            elif val < vmin:
                vmin = val
        return vmin, vmax
        
    fmax = fmin = key(vmax)
    
    for val in it:
        fval = key(val)
        
        if fval > fmax:
            fmax = fval
            vmax = val
        elif fval < fmin:
            fmin = fval
            vmin = val

    return vmin, vmax

Then we can say:

minimum, maximum = minmax(seq, default=(math.inf, -math,inf))

EDIT Fixed an error in the code.

Marco_Sulla · December 14, 2019, 2:48am

…naaaaaaaaaaaaaa! I don’t like it

Anyway, why do you need that strange _sentinel instead of a simple None? It’s good for the next(), but I don’t understand why it’s used also for default.

PS: max(default=5) returns error, not the default value.

ruud · December 14, 2019, 3:06am

Because one might want None as the default. This the same behaviour as min and max.

Thank you for pointing that out. I have fixed it.

steven.daprano · December 14, 2019, 7:18am

Serhiy Storchaka wrote:

“I would want to see several examples. I have not found any use of this
pattern in the stdlib.”

Do you accept that people sometimes want the minimum or the maximum of
a data set? If your answer is “Yes”, then it shouldn’t be hard to
believe that sometimes people might want the minimum AND the maximum.

Essentially any time you want to know both the min and max, using
seperate function calls is wasteful.

(But not as wasteful as one solution I’ve seen, which is to sort the
data then return the first and last item.)

Calculating the min and max in a single pass, instead of two passes, may
be especially useful whenever:

the cost of comparisons is high: a single pass can calculate both
values in approximately 25% fewer comparisons.
the cost of iteration is significant (making two passes is twice as
expensive as making one pass);
or the first pass exhausts the iterator, leaving nothing for the
second pass.

This thread has already given several use-cases:

the range of data
the midrange
figuring out line lengths for wrapping
using a dict as a sparse array
dealing with time intervals
plus additional examples of working code found by Ava.

In addition, minmax is provided by the C++ Boost library:

https://www.boost.org/doc/libs/1_51_0/libs/algorithm/minmax/

Eight years ago, Raymond Hettinger posted a recipe:

http://code.activestate.com/recipes/577916-fast-minmax-function/

LINQ has accumulate, which can be used to calculate the minimum and
maximum:

although I’m unsure whether that works in a single pass or two passes. I
believe it is a single pass, but I may be mistaken.

Here are examples of people asking how to do this using Spark and numpy:

ruud · December 14, 2019, 8:07am

Here is my implementation using @rhettinger 's algorithms to use on an average 1.5 comparisons per item. This version supports the keyword parameters key and default.


import itertools 
_sentinel = object()
def minmax(*args, key=None, default=_sentinel):
    args_len = len(args)
    
    if args_len == 0:
        if default is _sentinel:
            raise ValueError("minmax expected 1 or more argument(s), got 0")
        return default        
    elif args_len == 1:
        seq = args[0]
    else:
        seq = args
    
    it = iter(seq)

    vmax = vmin = next(it, _sentinel)
    if vmax is _sentinel:
        if default is _sentinel:
            raise ValueError("minmax arg is an empty sequence")
        return default
    
    if key is None:
        for val0, val1 in itertools.zip_longest(it, it, fillvalue=vmin):
            if val0 > val1:
                val0, val1 = val1, val0
            if val0 < vmin:
                vmin = val0
            if val1 > vmax:
                vmax = val1
        return vmin, vmax
        
    fmax = fmin = key(vmax)
    
    for val0, val1 in itertools.zip_longest(it, it, fillvalue=vmin):
        fval0 = key(val0)
        fval1 = key(val1)
        if fval0 > fval1:
            val0, val1 = val1, val0
            fval0, fval1 = fval1, fval0            
        if fval0 < fmin:
            fmin = fval0
            vmin = val0
        if fval1 > fmax:
            fmax = fval1
            vmax = val1

    return vmin, vmax

storchaka · December 14, 2019, 9:18am

Thank you Ava! Seems your search skills are better than mine. Indeed, there are few use cases for minmax() in the stdlib and in the third-party code. Not much, but there are.

This one is not related because min() and max() are called with different arguments.

Some of these examples are not related too. But there are related.

storchaka · December 14, 2019, 9:34am

We do not add all functions that might be used by some people. The question is how common is such problem. Ava found 4 cases in the stdlib. Most of your arguments are not applicable to these cases, because in all cases we iterate lists/tuples of strings/numbers. Although it might speed up two cases (commonpath() and commonprefix()) by few percents. Are 4 cases enough to add a new function in itertools? I left this on to Raymond.

Marco_Sulla · December 14, 2019, 3:19pm

If I can suggest, I think a good place is cmath.

mdickinson · December 14, 2019, 3:44pm

cmath is for functions operating on, or returning, complex numbers. So it wouldn’t be a good fit for a minmax function.

xtreak · December 14, 2019, 3:48pm

There was also similar proposal with some discussion on an efficient implementation :

https://mail.python.org/pipermail/python-ideas/2010-October/008221.html

Marco_Sulla · December 14, 2019, 3:58pm

Ah excuse me, I never read the docs and I wrongly assumed that it was an accelerated version of math with C

Well, so I suggest math

PS: just for curiosity, why a different package for complex numbers? math should not simply used function overloading and duck typing?

Marco_Sulla · December 14, 2019, 5:36pm

Well, I know that Raymond Hettinger is a optimum programmer… but I do not understand at all the usefulness of zip_longest:

it zips the same iterator. So the length is the same, zip is sufficient
even if you zip, you’re zipping the same iterator… so val0 is val1. It’s impossible that val0 > val1
since val0 is val1, if val0 < vmin, it’s impossible that val1 > vmax, so it should be preceded by an elif, to accelerate the code.
If you change all of this, you simply return to your previous version of the code

Am I missing something?

PS: furthermore, my hope is this function will be translated in CPython, so the use of itertools is problematic.

mdickinson · December 14, 2019, 5:51pm

The most important difference is in the handling of inputs of type float (or int). The vast majority of Python code does not expect to work with complex numbers, and so for example for most users it makes sense for math.sqrt(-2.0) to raise ValueError. Users who work in a domain where complex numbers make sense can use cmath.sqrt instead, which will return an appropriate complex output for cmath.sqrt(-2.0).

(Yes, I know that the ** operator already returns complex outputs for non-complex inputs in some cases. IMO, that was a (minor) design mistake, but we’re getting off-topic for this thread.)

ruud · December 14, 2019, 5:56pm

Yes!
The two zipped iterators are not the same! The first is always one step ahead of the second. Therefore, the zip_longest function returns pairs in sequence. Clever trick, but it’s not magic …

If this is ever to be implemented in CPython as Python, then itertools is no problem at all, since it is part of the standard library. If it will be a C-function, it will be completely different anyway.

Marco_Sulla · December 14, 2019, 6:11pm

Being of stdlib of Python does not mean that exists a CPython API for itertools… I don’t know this. Anyway, I’ve done benchmark of my solution and the Hettinger’s solution, and the Hettinger’s solution, even if brilliant, is slower:

marco@buzz:~$ python3.9 -m pyperf timeit "minmax(range(100000))" --rigorous --affinity 0 -s 'MY_CODE'
Mean +- std dev: 5.15 ms +- 0.16 ms
marco@buzz:~$ 
marco@buzz:~$ python3.9 -m pyperf timeit "minmax(range(100000))" --rigorous --affinity 0 -s 'HETTINGER_CODE'
Mean +- std dev: 6.35 ms +- 0.23 ms

and I don’t understand how can it be faster zipping two iterators and do a triple comparison instead of simply iterate one object and do one or at max two comparisons…

Marco_Sulla · December 14, 2019, 8:16pm

A little modification: max() and min() suffers by a subtle bug: if the first elements are NaNs, NaN is returned.

Why? Because the first element is taken as the default value for max and min. After the functions check if the second element is greater or lesser, and if yes, they substitute the max or min value with the second value, and so on.

The problem is NaN returns always zero for every comparison. So the first element, a NaN, will be never greater or lesser than the other numbers in the sequence, and will be never replaced.

This is the code that fixes this bug:

_sentinel = object()

def minmax(*args, key=None):
    args_len = len(args)
    
    if args_len == 0:
        fname = minmax.__name__
        raise ValueError(f"{fname}() expected 1 argument, got 0")
    elif args_len == 1:
        seq = args[0]
    else:
        seq = args
    
    it = iter(seq)

    vmax = vmin = next(it, _sentinel)
    
    if vmax is _sentinel:
        fname = minmax.__name__
        raise ValueError(f"{fname}() arg is an empty sequence")
    
    first_comparable = False
    
    if key is None:
        for val in it:
            if val > vmax:
                vmax = val
                first_comparable = True
            elif not first_comparable and not val <= vmax:
                # not comparable object, like NaN
                vmax = vmin = val
                continue
            elif val < vmin:
                vmin = val
                first_comparable = True
    else:
        fmax = fmin = key(vmax)
        
        for val in it:
            fval = key(val)
            
            if fval > fmax:
                fmax = fval
                vmax = val
                first_comparable = True
            elif not first_comparable and not fval <= fmax:
                fmax = fmin = fval
                vmax = vmin = val
                continue
            elif fval < fmin:
                fmin = fval
                vmin = val
                first_comparable = True
    
    return (vmin, vmax)

This slows down the function, but only a little (5.57 ms ± 0.24 ms)

I hope this simple fix will be applied also to min() and max().

mdickinson · December 14, 2019, 9:45pm

I wouldn’t consider this to be a bug, or at least not one that should be fixed by altering the implementation of min or max. If the elements given to max or min aren’t totally ordered, then there isn’t a meaningful result to be given; consider this a case of undefined behaviour. sort and sorted have the same issue.

Marco_Sulla · December 14, 2019, 10:10pm

numpy does not suffer by this bug. And sorry, it’s a bug and IMHO should be fixed. And the fix, as you can read, is very simple.

It’s not true. This happens only if the first elements have comparison operators that returns always False. If there’s a NaN in the middle of the iterator, there’s no problem. If this is not a bug… and it’s an old known bug.