Add batching function to itertools module

rhettinger · September 29, 2022, 9:45pm

I think this is a reasonable suggestion and will put it on my list to add to CPython 3.12. Prior to that, I can add a recipe to the docs, perhaps something like:

def batched(iterable, n):
    it = iter(iterable)
    while (batch := list(islice(it, n))):
        yield batch

Will put some thought into the API:

Possible names are “chunked”, “batch”, and “batched”. The word “chunked” is familiar because of its use in HTTP headers and in `more-itertools (though some of that module’s name choices are questionable). “Batch” is nice because it is defined as "“a quantity or consignment of goods produced at one time”. The name “batched” is more consistent with “sorted” and “reversed”.
Possible chunk data types are tuple, list, or iterator. A tuple makes more sense when the input size is an exact multiple of the chunk size. Expect for the case of *args, we mostly don’t use tuples for variable length outputs. A list better communicates possible variable width and it is convenient when the result list needs to be altered. An iterator is a better fit with the theme of the module, but as itertools.groupby() has shown, it is awkward to work with. Also, the data will already be in memory, so using an iterator wouldn’t save memory even if the chunk size is huge.
There may be a use case for a zip-like argument strict=True to enforce even sized outputs when that is what is expected.
The norm is for itertools to terminate when the input iterable is exhausted. However, a user may need to force the batched iterator instance to flush a partial batch. We may need to support batcher_instance.throw(Flush) so they have a way to get their data without waiting on a potentially infinite input iterator to terminate.
We need to limit n to ints or objects with __index__. The value must be greater than zero. Also, we should think about what happens when a user inevitably sets n to an unreasonably large size and posts a bug report saying, “it hangs”.
The argument order can be (iterable, n) or (n, iterable). The former matches the argument order for islice and for the combinatoric itertools. We have some experience with the latter in the case of heapq.nsmallest and it does have some advantages. For example, it works well with partial function evaluation: dozens = partial(batched, 12). Also, it reads better when the iterable argument is a large expression: batched((expr(var) for var in someiterable if somecondition(var)), 64). In that example, the 64 dangles too far away from the batched call that it modifies. We’ve seen this problem in the wild with random.sample for example. Making n a keyword-only argument might help.