Add batching function to itertools module

ecotner · September 21, 2022, 10:48pm

Batching of lists, arrays, iterables, etc. is a pretty common operation, which is why I was surprised that this functionality has not already been implemented as part of the itertools module when I recently went looking for it! I feel like batching functionality is simple enough that it wouldn’t be too difficult to implement, and could be used across a wide variety of tasks, but complex enough that it would be nice to have a known solid reference implementation as part of the standard library rather than having to roll your own every time it comes up. It should be able to handle rare/edge cases like trying to batch an empty iterable, handling memory footprint and allocation efficiently when the batch size is very large, situations where the batch size is not a perfect divisor of the iterable’s length, etc.

I have a somewhat decent implementation in python as a generator function; maybe this can serve as a conversation starter, as I am sure there are improvements that can be made:

def batch(iterable: Iterable, max_batch_size: int):
    """ Batches an iterable into lists of given maximum size, yielding them one by one. """
    batch = []
    for element in iterable:
        batch.append(element)
        if len(batch) >= max_batch_size:
            yield batch
            batch = []
    if len(batch) > 0:
        yield batch

I assume this will also need to be re-implemented in C in order to be compatible with the actual module code itself. What do people think?

15r10nk · September 21, 2022, 10:52pm

I was also looking for this function recently

jcgoble3 · September 21, 2022, 11:13pm

This is indeed a common thing I have needed quite a bit. The docs currently provide two ways to do this, one via a convoluted “idiom” in the zip() docs and one via a recipe in the itertools module docs (Ctrl+F for “def grouper”), which is basically a more robust and readable version of the “idiom” in the zip() docs.

But I think the need for this is probably common enough that it ought to be added to the itertools module proper.

EpicWink · September 22, 2022, 12:37am

Some questions to hash out:

Do you specify batch size or number of batches?
How do you handle indivisible batch size? Does the last batch become smaller so all the others can stay the same, or do batches regularly have one less element?
How do users who want different too the answer to the previous two questions change the behaviour?

saaketp · September 22, 2022, 12:40am

Normally itertools stays “minimal” and only provides these useful functions as recipes in the docs.

You can use more-itertools to get those recipes and lots more. https://more-itertools.readthedocs.io/
There are lots of ways to do batching in more-itertools, the chunked function seems closest to the batch function in this post.

See also Why does itertools have recipes?

jcgoble3 · September 22, 2022, 1:11am

Always batch size in my usages. I don’t even know how you would split into a fixed number of batches without knowing the exact number of elements in the first place (and iterators can be infinite).

In my usage, either I can guarantee that the number of elements is always a multiple of the batch size, or I’m working with an infinite iterator, or in the case that it’s not divisible, I just have a smaller final batch. The latter most often arises when processing a dataset of unknown/variable size (such as a Django queryset) and I simply want to process the data in groups that aren’t too large but without the overhead of doing it one at a time. (Real-world example: I have code at work that creates a Django queryset, then sends the results of that query via messages on a RabbitMQ message bus. We batch the results in groups of 10 records per message, to cut down on the number of messages but keep individual messages at a manageable size.)

The first question is one, as noted, that I don’t even think one option is possible, and it seems that if it was, it should be a totally different function (it’s more of a partition than a batch) that probably wouldn’t cross the threshold for inclusion in itertools proper.

The second one could be easily adjustable via a keyword argument or by having two related functions. I would imagine at minimum having an option to either raise an exception or return a partial batch if the iterator is exhausted without ending on a full batch.

brettcannon · September 22, 2022, 7:22pm

There’s a recipe for this in the itertools module’s documentation.


def grouper(iterable, n, *, incomplete='fill', fillvalue=None):
    "Collect data into non-overlapping fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, fillvalue='x') --> ABC DEF Gxx
    # grouper('ABCDEFG', 3, incomplete='strict') --> ABC DEF ValueError
    # grouper('ABCDEFG', 3, incomplete='ignore') --> ABC DEF
    args = [iter(iterable)] * n
    if incomplete == 'fill':
        return zip_longest(*args, fillvalue=fillvalue)
    if incomplete == 'strict':
        return zip(*args, strict=True)
    if incomplete == 'ignore':
        return zip(*args)
    else:
        raise ValueError('Expected fill, strict, or ignore')

pf_moore · September 22, 2022, 7:41pm

Amusingly, that doesn’t offer the behaviour being discussed in this thread:

So that’s four potential behaviours…

NeilGirdhar · September 22, 2022, 8:29pm

Is the behaviour that you’re looking for available in more itertools?

Perhaps, `more_itertools.chunked: chunked?

jcgoble3 · September 22, 2022, 8:58pm

Yes, chunked and grouper would satisfy my use cases. But the argument at least I was making is that batching like this is common enough that a function for it deserves promotion from a recipe to being offered in itertools itself. I have needed batching more often than all other itertools recipes and more-itertools offerings combined. Moreover, I have had needs for it at work, where getting approval to use something like more-itertools is more of a hassle than it’s worth for a single function.

cameron · September 22, 2022, 10:18pm

By Jonathan Goble via Discussions on Python.org at 22Sep2022 21:08:

Yes, chunked and grouper would satisfy my use cases. But the
argument at least I was making is that batching like this is common
enough that a function for it deserves promotion from a recipe to being
offered in itertools itself.

I think the argument here and in the thread about recipes is that there
are usually enough reasonable variations on how batching might be
implemented that a single function won’t do (or would need many tuning
knobs, which would have a large test matrix), and having many functions
gets… complicated.

batching more often than all other itertools recipes and more-itertools
offerings combined. Moreover, I have had needs for it at work, where
getting approval to use something like more-itertools is more of a
hassle than it’s worth for a single function.

If you add tests for the uses of the single function you want should
approval be fairly easy? I cannot speak for what happens in your
environment, but surely code reuse and PyPI should be made at least not
very difficult. I can see that it can rapidly brings in a huge amount of
“foreign” or untested/untrusted code.

Cheers,
Cameron Simpson cs@cskk.id.au

saaketp · September 23, 2022, 8:19am

Note that you likely have more-itertools already, bundled with the standard library ensurepip, because setuptools vendors it (setuptools/setuptools/_vendor/more_itertools at main · pypa/setuptools · GitHub).

Of course it would be better if you fix the approval process instead so it is not so difficult to get approvals, at least for popular packages that lots of other popular packages depend on (Packages that depend on more-itertools on PyPI - Libraries.io).

NeilGirdhar · September 23, 2022, 3:33pm

Yes, I understand. Maybe a simpler way to word your proposal is to simply say that you want to promote chunked from More-Itertools to Itertools.

Are you sure you want that though? There are significant benefits to the algorithms living in More-Itertools, namely that they can iterate faster if bugs are found or new exciting features are requested. Imagine having to check the Python version before using grouper or whatever.

jcgoble3 · September 23, 2022, 7:29pm

I agree that moving something into the stdlib doesn’t always make sense, but I believe it does make sense here. The blame view of more-itertools on GitHub shows that chunked() has been touched only once (excluding a commit related solely to static typing) in the last five years, specifically to add the strict argument, and even that was two years ago. I’d say that pretty convincingly shows that it’s more than stable enough to safely live in the stdlib.

brettcannon · September 23, 2022, 7:38pm

It just has to make sense to the core developer who is signing up to own that something for the next decade or so.

The fact that it’s a recipe suggests at least Raymond Hettinger already views it as something not to include in itertools outright.

pochmann · September 24, 2022, 4:07pm

Seems similar to groupby, so I’d expect it to likewise yield iterators instead of lists like your sample implementation.

rhettinger · September 29, 2022, 9:45pm

I think this is a reasonable suggestion and will put it on my list to add to CPython 3.12. Prior to that, I can add a recipe to the docs, perhaps something like:

def batched(iterable, n):
    it = iter(iterable)
    while (batch := list(islice(it, n))):
        yield batch

Will put some thought into the API:

Possible names are “chunked”, “batch”, and “batched”. The word “chunked” is familiar because of its use in HTTP headers and in `more-itertools (though some of that module’s name choices are questionable). “Batch” is nice because it is defined as "“a quantity or consignment of goods produced at one time”. The name “batched” is more consistent with “sorted” and “reversed”.
Possible chunk data types are tuple, list, or iterator. A tuple makes more sense when the input size is an exact multiple of the chunk size. Expect for the case of *args, we mostly don’t use tuples for variable length outputs. A list better communicates possible variable width and it is convenient when the result list needs to be altered. An iterator is a better fit with the theme of the module, but as itertools.groupby() has shown, it is awkward to work with. Also, the data will already be in memory, so using an iterator wouldn’t save memory even if the chunk size is huge.
There may be a use case for a zip-like argument strict=True to enforce even sized outputs when that is what is expected.
The norm is for itertools to terminate when the input iterable is exhausted. However, a user may need to force the batched iterator instance to flush a partial batch. We may need to support batcher_instance.throw(Flush) so they have a way to get their data without waiting on a potentially infinite input iterator to terminate.
We need to limit n to ints or objects with __index__. The value must be greater than zero. Also, we should think about what happens when a user inevitably sets n to an unreasonably large size and posts a bug report saying, “it hangs”.
The argument order can be (iterable, n) or (n, iterable). The former matches the argument order for islice and for the combinatoric itertools. We have some experience with the latter in the case of heapq.nsmallest and it does have some advantages. For example, it works well with partial function evaluation: dozens = partial(batched, 12). Also, it reads better when the iterable argument is a large expression: batched((expr(var) for var in someiterable if somecondition(var)), 64). In that example, the 64 dangles too far away from the batched call that it modifies. We’ve seen this problem in the wild with random.sample for example. Making n a keyword-only argument might help.

jcgoble3 · September 29, 2022, 11:57pm

My $0.02:

I prefer “batch” or “batched”, not partial to either. Whenever I need this, batch is the first word I think of. I could live with “chunked” though.

Prefer a list, as usually I want to access batch elements directly (consider a data file with multiple choice questions and answers where the lines are a cycle of question-choice-choice-choice-choice-correct answer-repeat) and/or mutate the list. Tuples are okay for the former but need cast to a list for the latter, and iterators need cast to a list either way.

Strong +1. In the multiple choice question file example, it’s an error to have an incomplete question at the end, so I want to get an exception. I’m sure there are many other cases where this would be needed or else every iteration needs to check the length of the batch.

+0. I have no use for this myself but it feels like it would be useful to others. Not quite sure how it would be used though.

“Don’t do that.” Principle of consenting adults applies. A cautionary note in the docs would not be unreasonable, but -1 on any actual code to defend against it.

(Come to think about this, perhaps this is a use case for the flush feature?)

Slight preference for n first, but it’s a weak opinion. I do like the interaction n first has with partial.

This is a bit of a wild idea that I simultaneously really like and am not sure about. It keeps consistent argument order within itertools and works with partial, but I wonder about whether forcing the keyword creates an obstacle to usability since people (especially itertools users) aren’t used to that.

pochmann · September 30, 2022, 3:28am

What makes you say that? I don’t think it has to be in memory, so it could save memory.

Possible iterator version:

def batched(iterable, n):
    it = iter(iterable)
    for first in it:
        batch = chain((first,), islice(it, n-1))
        yield batch
        next(islice(batch, n, n), None)

EpicWink · September 30, 2022, 3:56am

An extremely common case with cloud databases is a batch operation (usually batch put or get), which are usually limited to a certain number of operations. The final batch is the remaining operations to perform.

On this use-case though, it’s possible for certain operations in the batch to fail, so those operations need to be added back to the input to be retried. This means it would be convenient if batched can pick up changes to the input.

Does n mean number of batches or batch size? Perhaps a more descriptive keyword could be used.