Challenge: Quest for counting iterator length at C speed

Backstory

All of my efforts to implement it in CPython have failed:
a) ilen was rejected as it would be the only function in itertools returning scalar so doesn’t fit.
b) count.consume was rejected too (and reasonably so). Although I was happy with functionality (atomic counting at C speed), but itertools is just not the good place for it. I would have probably retracted it myself even if it was accepted tbh.

Challenge

But the question remains! How to count iterator items while consuming at C speed?

There are 2 main paths (and I failed on both so far):
a) A clever recipe using existing Python toolkit.
b) Addition to CPython, which would provide it (directly or indirectly) and would be sensible. I.e. would not be rejected.

Criteria. Memory and Performance

Memory Optimal: Should not hold more than 1 iterator element in memory at a time.
Performance: The faster the better

Examples:

a = [1] * 100_000
%timeit more_itertools.ilen(iter(a))
# 2.8 ms (currently fastest optimal memory pure python recipe)
%timeit len(list(iter(a)))
# 0.27 ms (C speed, memory inefficient)
%timeit sum(map(len, itertools.batched(iter(a), 1000)))
# 0.25 ms (C speed, memory inefficient, although not fully)

See `itertools.ilen` addition · Issue #120478 · python/cpython · GitHub for full list of python recipes and performance comparisons.

Comparing Results

Will be compared on Python 3.12 “Attempt This Online” or if any CPython extensions, then all will be compared on 3.12 (compiled locally on Unix).

import math, timeit
import statistics as sts


def time_to_count(ilen_func, weights=(0.7, 0.2, 0.1)):
    lengths = [100_000, 100, 10]
    results = list()
    for n in lengths:
        norm = 100_000 // n
        lst = [1] * n
        test_func = lambda: ilen_func(iter(lst))
        num = norm * 100
        res = timeit.repeat(test_func, number=num)
        res = sts.mean(res) / num * 1000    # 1000 -> ms
        res = res * norm                    # -> time per 100K
        results.append(res)
    print([round(r, 2) for r in results])
    print(math.sumprod(results, weights))


def ilen(a):
    return sum(1 for _ in a)

time_to_count(ilen)

Attempt This Online!

1 Like

It is 11x slower than 2 alternatives I proposed (that do exactly that and nothing more in C). Namely, ilen and count.consume.

There is also c) an implementation in C, available as a library on PyPI for those that need this very specific functionality. I would argue that this is the most popular path.

2 Likes

Unfortunately, not in this quest.

By the way, it already exists in external library if anyone needs it. There is count_items in iteration_utilities implemented as extension.

1 Like

Some ideas:

from operator import countOf
from itertools import *

def ilen(a):
    return sum(compress(repeat(1), zip(a)))

def ilen(a):
    return countOf(map(type, zip(a)), tuple)

def ilen(a):
    return countOf(map(bool, zip(a)), True)
2 Likes

This is new best. 1.6 ms vs 2.7ms more_itertools.

From 11x to only 6x slower.

Two more:

def ilen(a):
    return sum(map(len, zip(a)))
from itertools import *
from collections import deque

lows = repeat(tuple(range(1000)))
consume = deque(maxlen=0).extend

def ilen(a):
    high = count(1)
    low = chain.from_iterable(compress(lows, high))
    consume(zip(a, low))
    return next(low) + (next(high)-2) * 1000

This one is fast by avoiding to create a new int object for each element.

1 Like

One of mine:

import operator as opr
IG0 = opr.itemgetter(0)
REP_ONES = itl.repeat(1)

def ilen(a):
    return sum(map(IG0, zip(REP_ONES, a)))

Pretty fast for all sizes:

def ilen(a):
    high = 0
    low = 255
    for _ in a:
        if low:
            low = low - 1
        else:
            high += 1
            low = 255
    return high * 256 + (255 - low)

(Note that I do low = low - 1 intentionally, don’t change it to low -= 1.)

Optimized for very short iterables:

def ilen(a):
    high = low = 256
    for _ in a:
        if low:
            low = low - 1
        else:
            high += 256
            low = 255
    return high - low

And I like it better. Less code, and you can more easily see that the value of high - low starts with 0 and increases by 1 for each element.

_MARKER = object()

def ilen_d2(iterable):
    it = iter(iterable)
    size = 32
    nb = size - 1
    total = 0
    while 1:
        total += size
        it_rep = repeat(None, size)
        next(islice(zip(it, it_rep), nb, nb), None)
        if next(it, _MARKER) is _MARKER:
            return total - len(list(it_rep))
        if size < 2048:
            size *= 2
            nb = size - 1

Chart time. Divided by more_itertools.len.

Figure_1

Wander if green/brown boundary can be crossed. It is roughly 2x slower than sum(ones). And 3-4x slower than iteration_utilities.count_items.

from itertools import *
from operator import indexOf

def ilen_s5(iterable):
    milestones = cycle((False,) * 99 + (True,))
    total = 99
    for _ in compress(iterable, milestones):
        del _
        total += 100
    return total - indexOf(milestones, True)
1 Like

This doesn’t seem necessary. Is it?

You’re the one who requested “Should not hold more than 1 iterator element in memory at a time” :slight_smile:

Ah ok.

Don’t they get dereferenced (as new value is assigned) on every loop anyway?

Not sure if explicit del would do it more efficiently.

As long as a container with many items at the same time is not held all is good :slight_smile:

Yes, but before the assignment, the old and the new element are both held in memory. And in that solution, they’re not even adjacent elements but up to 100 apart.

Ok then:

def ilen_s6(a):
    return sum(map(len, batched(a, many - 1)))

Just plug in your definition of “many” :smiley:

But in this case, doesn’t it also apply to simple loop?

total = 0
for _ in iterable:
    total += 1

These 2 are just 1 apart, instead of a 100 apart, but the situation is exactly the same. Or am I missing something?

In CPython, yes, but the current binding of _ keeps its referent alive until _ is rebound by the next loop value. So, after it gets started, two objects from the iterable can be kept alive.

A subtlety is that this can actually have a major effect on speed, although rarely so. For a very few iterables, like itertools.combinations(), the C code reuses the “immutable” result tuple if its reference count is 1. This saves the expense of allocating a new tuple object, which, while cheap, has significant cost compared to to the typically very little computation it takes to advance to the next combination.

For example, on my box, this:

for x in combinations(range(100), 95):
    pass

takes nearly 3 times longer than

for x in combinations(range(100), 95):
    del x

In the first loop, a new tuple with 95 elements has to be completely filled in on each iteration. In the second loop, the very same 95-element tuple object is mutated in-place, and typically only needs to replace the final element. The latter is “safe” because the implementation knows it owns the only reference. The client can’t see the difference, unless it’s looking at the tuple’s memory address (id()).

Should you worry about that? Up to you. I would not :wink:.

3 Likes