A product function which supports large / infinite iterables

MarkCBell · November 15, 2020, 12:09am

The itertools.product function computes the Cartesian product of a collection of iterables. However the implementation works by completely consuming the input iterables and keeping pools of values in memory to generate the products. The documentation notes that means that “it only useful with finite inputs” but in practice these iterables must also be small enough to convert into tuples. This can result in a variety of issues including:

>>> from itertools import product
>>> next(product(range(1 << 30), repeat=2))
MemoryError
>>> next(product(range(1 << 31), repeat=2))
OverflowError: Python int too large to convert to C ssize_t
>>> from itertools import count
>>> next(product(count(), count())
# Hangs

I would like to suggest adding a variant of the product function to the itertools library which can handle such cases, however I suspect that this would most likely need to be a separate function. As a concrete example to consider I would propose the following implementation which takes an item from each iterable in turn and yields all the tuples that can be formed from it and all the previous items that have been seen before adding it to the store of items that have been seen.

from itertools import cycle, product, tee

def iproduct(*iterables, repeat=1):
    iterables = [item for row in zip(*(tee(iterable, repeat) for iterable in iterables)) for item in row]
    N = len(iterables)
    saved = [[] for _ in range(N)]  # All the items that we have seen of each iterable.
    exhausted = set()  # The set of indices of iterables that have been exhausted.
    for i in cycle(range(N)):
        if i in exhausted:  # Just to avoid repeatedly hitting that exception.
            continue
        
        try:
            item = next(iterables[i])
            yield from product(*saved[:i], [item], *saved[i+1:])  # Finite product.
            saved[i].append(item)
        except StopIteration:
            exhausted.add(i)
            if not saved[i] or len(exhausted) == N:  # Product is empty or all iterables exhausted.
                return
    yield ()  # There are no iterables.

This handles a wide array of different circumstances but importantly it produces the same pairs as itertools.product (up to ordering)

>>> from itertools import islice, product

>>> X = [i+1 for i in range(20)]
>>> Y = [2*i for i in range(5)]
>>> print(*islice(iproduct(X, Y), 10))
(1, 0) (2, 0) (1, 2) (2, 2) (3, 0) (3, 2) (1, 4) (2, 4) (3, 4) (4, 0)
>>> print(sorted(iproduct(X, Y)) == sorted(product(X, Y)))
True

But since it doesn’t start by building tuples of the inputs it can handle very large iterables.

>>> X = (i + 7 for i in range(2 << 100) if str(i).startswith('2'))
>>> Y = (i for i in range(10**10))
>>> print(*islice(iproduct(X, Y), 10))
(9, 0) (27, 0) (9, 1) (27, 1) (28, 0) (28, 1) (9, 2) (27, 2) (28, 2) (29, 0)

This also means that it also handles infinite iterables perfectly well.

>>> from itertools import count

>>> X = (i for i in range(20))
>>> Y = count()
>>> print(*islice(iproduct(X, Y), 10))
(0, 0) (1, 0) (0, 1) (1, 1) (2, 0) (2, 1) (0, 2) (1, 2) (2, 2) (3, 0)

>>> X = count()
>>> Y = count()
>>> print(*islice(iproduct(X, Y), 10))
(0, 0) (1, 0) (0, 1) (1, 1) (2, 0) (2, 1) (0, 2) (1, 2) (2, 2) (3, 0)

Empty iterables result in an empty product and no iterables result in the empty tuple.

>>> X = (i for i in range(0))
>>> Y = count()
>>> next(iproduct(X, Y))
StopIteration

>>> print(*iproduct())
()
>>> X = count()
>>> print(*iproduct(X, repeat=0))
()

And has repeat which works like itertools.product(..., repeat=...)

>>> X = count()
>>> print(*islice(iproduct(X, repeat=3), 10))
(0, 0, 0) (1, 0, 0) (0, 1, 0) (1, 1, 0) (0, 0, 1) (0, 1, 1) (1, 0, 1) (1, 1, 1) (2, 0, 0) (2, 0, 1)

>>> X = count()
>>> Y = 'ABC'
>>> print(*islice(iproduct(X, Y, repeat=2), 10))
(0, 'A', 0, 'A') (1, 'A', 0, 'A') (0, 'B', 0, 'A') (1, 'B', 0, 'A') (0, 'A', 1, 'A') (0, 'B', 1, 'A') (1, 'A', 1, 'A') (1, 'B', 1, 'A') (0, 'A', 0, 'B') (0, 'A', 1, 'B')

The only notable difference between this and itertools.product is that the tuples produced are in a different order. So while itertools.product creates its pairs so that if the input’s iterables are sorted then the product tuples are emitted in sorted order (lexicographic ordering), if the input iterables to iproduct are sorted then it produces its tuples sorted by the following:

>>> def grade(x):
>>>     m = max(x)
>>>     return (m, [i for i, y in enumerate(x) if y == m][-1], x)

>>> X = range(5)
>>> Y = range(10)
>>> print(list(iproduct(X, Y)) == sorted(product(X, Y), key=grade))
True

Due to this difference I don’t think that it would be appropriate for iproduct to replace itertools.product but this could be included as a separate function.

oscarbenjamin · November 15, 2020, 2:58am

I ran into this problem in sympy and added an implementation of product for infinite iterators that you can see here:

github.com

sympy/sympy/blob/bd327ec88c7fdec7791874da6c1529361daf7eb5/sympy/utilities/iterables.py#L234-L299


def _iproduct2(iterable1, iterable2):
    '''Cartesian product of two possibly infinite iterables'''
    it1 = iter(iterable1)
    it2 = iter(iterable2)
    elems1 = []
    elems2 = []
    sentinel = object()
    def append(it, elems):
        e = next(it, sentinel)
        if e is not sentinel:
            elems.append(e)
    n = 0
    append(it1, elems1)
    append(it2, elems2)
    while n <= len(elems1) + len(elems2):

This file has been truncated. show original

MarkCBell · November 15, 2020, 10:05am

Excellent, thank you for letting me know about sympy.utilities.iterables.iproduct.

I have to confess that personally I still prefer my implementation since, for example, it is not recursive and so can handle more than sys.getrecursionlimit() input iterables. However I think that this is a separate discussion / pull request.

In which case I think the important question is: is sympy is the right place for this function to live? If so then I think my proposal is simply to update the itertools.product documentation to include a reference to this since I searched and asked on multiple people for any existing implementation. For example:

Before product() runs, it completely consumes the input iterables, keeping pools of values in memory to generate the products. Accordingly, it only useful with small, finite inputs. If you need to handle large or infinite input iterables then consider sympy.utilities.itertools.iproduct.