Using an identity function lambda x: x is needlessly slow. At least
that’s what I found when I benchmarked it a few years ago, and I don’t
expect that the situation has changed enough for me to bother
benchmarking it again
I have to step away from the keyboard now, but when I return I’ll post
the version of minmax I’ve been using.
I think we can just return default as such, which should be a tuple of the default minimum and maximum.
_sentinel = object()
def minmax(*args, key=None, default=_sentinel):
args_len = len(args)
if args_len == 0:
if default is _sentinel:
raise ValueError("minmax expected 1 or more argument(s), got 0")
return default
elif args_len == 1:
seq = args[0]
else:
seq = args
it = iter(seq)
vmax = vmin = next(it, _sentinel)
if vmax is _sentinel:
if default is _sentinel:
raise ValueError("minmax arg is an empty sequence")
return default
if key is None:
for val in it:
if val > vmax:
vmax = val
elif val < vmin:
vmin = val
return vmin, vmax
fmax = fmin = key(vmax)
for val in it:
fval = key(val)
if fval > fmax:
fmax = fval
vmax = val
elif fval < fmin:
fmin = fval
vmin = val
return vmin, vmax
Then we can say:
minimum, maximum = minmax(seq, default=(math.inf, -math,inf))
Anyway, why do you need that strange _sentinel instead of a simple None? It’s good for the next(), but I don’t understand why it’s used also for default.
PS: max(default=5) returns error, not the default value.
“I would want to see several examples. I have not found any use of this
pattern in the stdlib.”
Do you accept that people sometimes want the minimum or the maximum of
a data set? If your answer is “Yes”, then it shouldn’t be hard to
believe that sometimes people might want the minimum AND the maximum.
Essentially any time you want to know both the min and max, using
seperate function calls is wasteful.
(But not as wasteful as one solution I’ve seen, which is to sort the
data then return the first and last item.)
Calculating the min and max in a single pass, instead of two passes, may
be especially useful whenever:
the cost of comparisons is high: a single pass can calculate both
values in approximately 25% fewer comparisons.
the cost of iteration is significant (making two passes is twice as
expensive as making one pass);
or the first pass exhausts the iterator, leaving nothing for the
second pass.
This thread has already given several use-cases:
the range of data
the midrange
figuring out line lengths for wrapping
using a dict as a sparse array
dealing with time intervals
plus additional examples of working code found by Ava.
In addition, minmax is provided by the C++ Boost library:
Here is my implementation using @rhettinger 's algorithms to use on an average 1.5 comparisons per item. This version supports the keyword parameters key and default.
import itertools
_sentinel = object()
def minmax(*args, key=None, default=_sentinel):
args_len = len(args)
if args_len == 0:
if default is _sentinel:
raise ValueError("minmax expected 1 or more argument(s), got 0")
return default
elif args_len == 1:
seq = args[0]
else:
seq = args
it = iter(seq)
vmax = vmin = next(it, _sentinel)
if vmax is _sentinel:
if default is _sentinel:
raise ValueError("minmax arg is an empty sequence")
return default
if key is None:
for val0, val1 in itertools.zip_longest(it, it, fillvalue=vmin):
if val0 > val1:
val0, val1 = val1, val0
if val0 < vmin:
vmin = val0
if val1 > vmax:
vmax = val1
return vmin, vmax
fmax = fmin = key(vmax)
for val0, val1 in itertools.zip_longest(it, it, fillvalue=vmin):
fval0 = key(val0)
fval1 = key(val1)
if fval0 > fval1:
val0, val1 = val1, val0
fval0, fval1 = fval1, fval0
if fval0 < fmin:
fmin = fval0
vmin = val0
if fval1 > fmax:
fmax = fval1
vmax = val1
return vmin, vmax
Thank you Ava! Seems your search skills are better than mine. Indeed, there are few use cases for minmax() in the stdlib and in the third-party code. Not much, but there are.
This one is not related because min() and max() are called with different arguments.
Some of these examples are not related too. But there are related.
We do not add all functions that might be used by some people. The question is how common is such problem. Ava found 4 cases in the stdlib. Most of your arguments are not applicable to these cases, because in all cases we iterate lists/tuples of strings/numbers. Although it might speed up two cases (commonpath() and commonprefix()) by few percents. Are 4 cases enough to add a new function in itertools? I left this on to Raymond.
The most important difference is in the handling of inputs of type float (or int). The vast majority of Python code does not expect to work with complex numbers, and so for example for most users it makes sense for math.sqrt(-2.0) to raise ValueError. Users who work in a domain where complex numbers make sense can use cmath.sqrt instead, which will return an appropriate complex output for cmath.sqrt(-2.0).
(Yes, I know that the ** operator already returns complex outputs for non-complex inputs in some cases. IMO, that was a (minor) design mistake, but we’re getting off-topic for this thread.)
Yes!
The two zipped iterators are not the same! The first is always one step ahead of the second. Therefore, the zip_longest function returns pairs in sequence. Clever trick, but it’s not magic …
If this is ever to be implemented in CPython as Python, then itertools is no problem at all, since it is part of the standard library. If it will be a C-function, it will be completely different anyway.
Being of stdlib of Python does not mean that exists a CPython API for itertools… I don’t know this. Anyway, I’ve done benchmark of my solution and the Hettinger’s solution, and the Hettinger’s solution, even if brilliant, is slower:
marco@buzz:~$ python3.9 -m pyperf timeit "minmax(range(100000))" --rigorous --affinity 0 -s 'MY_CODE'
Mean +- std dev: 5.15 ms +- 0.16 ms
marco@buzz:~$
marco@buzz:~$ python3.9 -m pyperf timeit "minmax(range(100000))" --rigorous --affinity 0 -s 'HETTINGER_CODE'
Mean +- std dev: 6.35 ms +- 0.23 ms
and I don’t understand how can it be faster zipping two iterators and do a triple comparison instead of simply iterate one object and do one or at max two comparisons…
A little modification: max() and min() suffers by a subtle bug: if the first elements are NaNs, NaN is returned.
Why? Because the first element is taken as the default value for max and min. After the functions check if the second element is greater or lesser, and if yes, they substitute the max or min value with the second value, and so on.
The problem is NaN returns always zero for every comparison. So the first element, a NaN, will be never greater or lesser than the other numbers in the sequence, and will be never replaced.
This is the code that fixes this bug:
_sentinel = object()
def minmax(*args, key=None):
args_len = len(args)
if args_len == 0:
fname = minmax.__name__
raise ValueError(f"{fname}() expected 1 argument, got 0")
elif args_len == 1:
seq = args[0]
else:
seq = args
it = iter(seq)
vmax = vmin = next(it, _sentinel)
if vmax is _sentinel:
fname = minmax.__name__
raise ValueError(f"{fname}() arg is an empty sequence")
first_comparable = False
if key is None:
for val in it:
if val > vmax:
vmax = val
first_comparable = True
elif not first_comparable and not val <= vmax:
# not comparable object, like NaN
vmax = vmin = val
continue
elif val < vmin:
vmin = val
first_comparable = True
else:
fmax = fmin = key(vmax)
for val in it:
fval = key(val)
if fval > fmax:
fmax = fval
vmax = val
first_comparable = True
elif not first_comparable and not fval <= fmax:
fmax = fmin = fval
vmax = vmin = val
continue
elif fval < fmin:
fmin = fval
vmin = val
first_comparable = True
return (vmin, vmax)
This slows down the function, but only a little (5.57 ms ± 0.24 ms)
I hope this simple fix will be applied also to min() and max().
I wouldn’t consider this to be a bug, or at least not one that should be fixed by altering the implementation of min or max. If the elements given to max or min aren’t totally ordered, then there isn’t a meaningful result to be given; consider this a case of undefined behaviour. sort and sorted have the same issue.
numpy does not suffer by this bug. And sorry, it’s a bug and IMHO should be fixed. And the fix, as you can read, is very simple.
It’s not true. This happens only if the first elements have comparison operators that returns always False. If there’s a NaN in the middle of the iterator, there’s no problem. If this is not a bug… and it’s an old known bug.