More ilen
implementations with benchmark (using your length 100000).
Thanks, I like the innovativeness of the recipe. However, it suffers from one extra parameter, which I would not want to ever think about when using such a basic function. To add it to current benchmarks:
a = range(100000)
import more_itertools
%timeit sum(1 for _ in iter(a)) # 6.1 ms
%timeit sum(map(lambda i: 1, iter(a))) # 5.5 ms
%timeit coll.deque(enumerate(iter(a)), maxlen=1) # 4.4 ms
%timeit more_itertools.ilen(iter(a)) # 3.5 ms
%timeit sum(map(len, batched(iterable, 5))) # 2.9 ms
%timeit len(list(iter(a))) # 2.6 ms
%timeit sum(map(len, batched(iterable, 10))) # 2.5 ms
%timeit sum(map(len, batched(iterable, 1000))) # 2.2 ms
%timeit ilen(iter(a)) # 1.4 ms
I think the utility of consuming an iterator immediately to find its length, without regard to the contents, is greater than some might expect. After all, thereâs a Unix utility (wc
) dedicated to exactly that.
Meh, 1000 is my go-to number for such things :-). I also just added two more solutions also using 1000, although thatâs just ints, not elements from the iterable (1000 of those could cause a memory problem, but at least itâs not unlimited like the list solution).
The very fact that there are already numerous ways invented to do this by now, 2 main iterator libraries implemented it (one had a process of selecting the best pure python recipe, another implemented it in C)âŚ
There still is a live benchmarking thread on stack and this thread.
To me it seems like a fairly good indicator that âone obvious way to do itâ is not. Not sure if its ilen
, but its my best guess ATM.
One more use case for it - microbenchmarking iterators:
%timeit deque(map(bool, range(10)), maxlen=0) # 771 # high construction overhead (possible to address this of course, but ilen is nicer to work with)
%timeit list(map(bool, range(10))) # 480 # memory becomes a factor for large iterators
%timeit ilen(map(bool, range(10))) # 460 # Ideal
As a side note, it could also be used as a iterator consumer.
ilen(func(i) for i in range(1000)) # func(0) ... func(999) will be executed, while the result is dumped.
In the above example, there is not much gain over the for
statement, but for some iterators it can be better than for
.
Could you please provide benchmark to support your statement? Also, try:
ilen(map(func, range(1000))
I have been thinking about this and I am strongly leaning to +1 for this.
Note, for my own use I have found a good solution. It is possible to make a cython
function in âpure pythonâ mode, which when compiled performs almost as well as extension. So I can use it in dev mode and it is efficient in released libraries.
However, I think that this function should be part of standard library:
- Development and maintenance cost is close to none.
- It is a fundamental function. Although it is not used as much as
len
oflist
, but nevertheless the operation it performs is one of the basics to cover the landscape. - It can (and would) be used as
iterator
consumer with the low overhead (lower bound + integer counting).deque
provides similar performance, but it is unnecessarily verbose and is somewhat unrelated import. - 2 of the most popular libraries implement this. One side of it is âjust use them in this caseâ, but another side is âmaybe it is useful enoughâ. Also, one of them implements it in pure python with more than 2x worse performance. Another, at least in my case, would be a big dependency with a lot of code and extensions purely for this 1 function. Also, it has somewhat inconvenient function name
count_items
, where I would be strongly forilen
.
I have been using this for a while now and by now it is one of the most used functions from my iterator
library in my development process.
And I am not very happy when I need to do from collections import deque; deque(..., maxlen=0)
when I want to share performance results to someone else.
As I said, I donât think this one should be judged on how many use cases it has, but more in a light of it potentially being a building block which fills several small gaps in iterator
space at minimal cost.
How much (when you minimize that cost)?
a = iter([])
%timeit ilen2(a) # 37.9 ns
from collections import deque
%timeit deque(a, maxlen=0) # 426 ns
But youâd create that deque only once. I meant the cost of using the deque for consuming an iterator.
Benchmarking your earlier example, itâs faster than list
(times in ns):
576 list(map(bool, range(10)))
461 consume(map(bool, range(10)))
576 list(map(bool, range(10)))
460 consume(map(bool, range(10)))
576 list(map(bool, range(10)))
464 consume(map(bool, range(10)))
And your new example with a = iter([])
:
149 list(a)
29 consume(a)
148 list(a)
29 consume(a)
148 list(a)
29 consume(a)
Script
from timeit import repeat
setup = '''
from collections import deque
consume = deque(maxlen=0).extend
a = iter([])
'''
for code in [
'list(map(bool, range(10)))',
'consume(map(bool, range(10)))',
] * 3:
print(round(min(repeat(code, setup, number=10**3, repeat=100)) * 10**6), code)
for code in [
'list(a)',
'consume(a)',
] * 3:
print(round(min(repeat(code, setup, number=10**3, repeat=100)) * 10**6), code)
Benchmarks of it are in initial post.
The speed of actual consumption of deque
and ilen
(extension or cython) is almost the same (3% difference in favour of deque
. Either because of integer counting or my imlplementation is not the most optimal. But this is insignificant.).
What I was referring to is initialization cost for short iterators
.
Yes, deque
for iterator
consumption is a very performant solution. The âissuesâ of it are:
a) verbosity
b) unrelated import
c) initiailzation cost for micro-benchmarks on short iterators
But you wouldnât create a new deque for each short iterator! Youâd create one deque, get its extend
method, and use that. Like I just did in my previous comment and like I and others have been doing for years.
Can you run the script of my previous comment but with your ilen
included?
I love how you find elegant solutions to my problems. Yes, initialization cost is not an issue. Obviously, it can be pre-constructed!
a = iter([])
%timeit ilen2(a) # 37.9 ns
%timeit dq(a) # 36.7
Good point, so performance is not an issue here.
You sure many people do this? E.g. more_itertools doesnât seem to be doing it this way. more_itertools.recipes â more-itertools 10.2.0 documentation
I donât know about âmanyâ, but Iâve seen others do it as well. Likely thatâs where I got it from. Not sure why more-itertools doesnât, Iâm guessing because it doesnât prioritize speed over simplicity.
set().isdisjoint
is another fast consumer of iterables