Itertools.ilen(iterable)

More ilen implementations with benchmark (using your length 100000).

Thanks, I like the innovativeness of the recipe. However, it suffers from one extra parameter, which I would not want to ever think about when using such a basic function. To add it to current benchmarks:

a = range(100000)
import more_itertools
%timeit sum(1 for _ in iter(a))                     # 6.1 ms
%timeit sum(map(lambda i: 1, iter(a)))              # 5.5 ms
%timeit coll.deque(enumerate(iter(a)), maxlen=1)    # 4.4 ms
%timeit more_itertools.ilen(iter(a))                # 3.5 ms
%timeit sum(map(len, batched(iterable, 5)))         # 2.9 ms
%timeit len(list(iter(a)))                          # 2.6 ms
%timeit sum(map(len, batched(iterable, 10)))        # 2.5 ms
%timeit sum(map(len, batched(iterable, 1000)))      # 2.2 ms
%timeit ilen(iter(a))                               # 1.4 ms
1 Like

I think the utility of consuming an iterator immediately to find its length, without regard to the contents, is greater than some might expect. After all, there’s a Unix utility (wc) dedicated to exactly that.

2 Likes

Meh, 1000 is my go-to number for such things :-). I also just added two more solutions also using 1000, although that’s just ints, not elements from the iterable (1000 of those could cause a memory problem, but at least it’s not unlimited like the list solution).

1 Like

The very fact that there are already numerous ways invented to do this by now, 2 main iterator libraries implemented it (one had a process of selecting the best pure python recipe, another implemented it in C)…

There still is a live benchmarking thread on stack and this thread.

To me it seems like a fairly good indicator that “one obvious way to do it” is not. Not sure if its ilen, but its my best guess ATM. :slight_smile:

2 Likes

One more use case for it - microbenchmarking iterators:

%timeit deque(map(bool, range(10)), maxlen=0)   # 771   # high construction overhead (possible to address this of course, but ilen is nicer to work with)
%timeit list(map(bool, range(10)))              # 480   # memory becomes a factor for large iterators
%timeit ilen(map(bool, range(10)))              # 460   # Ideal

As a side note, it could also be used as a iterator consumer.

ilen(func(i) for i in range(1000)) # func(0) ... func(999) will be executed, while the result is dumped.

In the above example, there is not much gain over the for statement, but for some iterators it can be better than for.

Could you please provide benchmark to support your statement? Also, try:

ilen(map(func, range(1000))

I have been thinking about this and I am strongly leaning to +1 for this.

Note, for my own use I have found a good solution. It is possible to make a cython function in “pure python” mode, which when compiled performs almost as well as extension. So I can use it in dev mode and it is efficient in released libraries.

However, I think that this function should be part of standard library:

  1. Development and maintenance cost is close to none.
  2. It is a fundamental function. Although it is not used as much as len of list, but nevertheless the operation it performs is one of the basics to cover the landscape.
  3. It can (and would) be used as iterator consumer with the low overhead (lower bound + integer counting). deque provides similar performance, but it is unnecessarily verbose and is somewhat unrelated import.
  4. 2 of the most popular libraries implement this. One side of it is “just use them in this case”, but another side is “maybe it is useful enough”. Also, one of them implements it in pure python with more than 2x worse performance. Another, at least in my case, would be a big dependency with a lot of code and extensions purely for this 1 function. Also, it has somewhat inconvenient function name count_items, where I would be strongly for ilen.

I have been using this for a while now and by now it is one of the most used functions from my iterator library in my development process.

And I am not very happy when I need to do from collections import deque; deque(..., maxlen=0) when I want to share performance results to someone else.

As I said, I don’t think this one should be judged on how many use cases it has, but more in a light of it potentially being a building block which fills several small gaps in iterator space at minimal cost.

2 Likes

How much (when you minimize that cost)?

a = iter([])
%timeit ilen2(a)    # 37.9 ns

from collections import deque
%timeit deque(a, maxlen=0)   # 426 ns

But you’d create that deque only once. I meant the cost of using the deque for consuming an iterator.

Benchmarking your earlier example, it’s faster than list (times in ns):

576 list(map(bool, range(10)))
461 consume(map(bool, range(10)))
576 list(map(bool, range(10)))
460 consume(map(bool, range(10)))
576 list(map(bool, range(10)))
464 consume(map(bool, range(10)))

And your new example with a = iter([]):

149 list(a)
29 consume(a)
148 list(a)
29 consume(a)
148 list(a)
29 consume(a)
Script
from timeit import repeat 

setup = '''
from collections import deque
consume = deque(maxlen=0).extend
a = iter([])
'''

for code in [
    'list(map(bool, range(10)))',
    'consume(map(bool, range(10)))',
] * 3:
    print(round(min(repeat(code, setup, number=10**3, repeat=100)) * 10**6), code)

for code in [
    'list(a)',
    'consume(a)',
] * 3:
    print(round(min(repeat(code, setup, number=10**3, repeat=100)) * 10**6), code)

Attempt This Online!

Benchmarks of it are in initial post.

The speed of actual consumption of deque and ilen (extension or cython) is almost the same (3% difference in favour of deque. Either because of integer counting or my imlplementation is not the most optimal. But this is insignificant.).

What I was referring to is initialization cost for short iterators.

Yes, deque for iterator consumption is a very performant solution. The “issues” of it are:
a) verbosity
b) unrelated import
c) initiailzation cost for micro-benchmarks on short iterators

But you wouldn’t create a new deque for each short iterator! You’d create one deque, get its extend method, and use that. Like I just did in my previous comment and like I and others have been doing for years.

Can you run the script of my previous comment but with your ilen included?

I love how you find elegant solutions to my problems. Yes, initialization cost is not an issue. Obviously, it can be pre-constructed!

a = iter([])
%timeit ilen2(a)    # 37.9 ns
%timeit dq(a)    # 36.7

Good point, so performance is not an issue here.

You sure many people do this? E.g. more_itertools doesn’t seem to be doing it this way. more_itertools.recipes — more-itertools 10.2.0 documentation

I don’t know about “many”, but I’ve seen others do it as well. Likely that’s where I got it from. Not sure why more-itertools doesn’t, I’m guessing because it doesn’t prioritize speed over simplicity.

1 Like

set().isdisjoint is another fast consumer of iterables

1 Like