Add batching function to itertools module

Note that you likely have more-itertools already, bundled with the standard library ensurepip, because setuptools vendors it (setuptools/setuptools/_vendor/more_itertools at main · pypa/setuptools · GitHub).

Of course it would be better if you fix the approval process instead so it is not so difficult to get approvals, at least for popular packages that lots of other popular packages depend on (Packages that depend on more-itertools on PyPI - Libraries.io).

1 Like

Yes, I understand. Maybe a simpler way to word your proposal is to simply say that you want to promote chunked from More-Itertools to Itertools.

Are you sure you want that though? There are significant benefits to the algorithms living in More-Itertools, namely that they can iterate faster if bugs are found or new exciting features are requested. Imagine having to check the Python version before using grouper or whatever.

1 Like

I agree that moving something into the stdlib doesn’t always make sense, but I believe it does make sense here. The blame view of more-itertools on GitHub shows that chunked() has been touched only once (excluding a commit related solely to static typing) in the last five years, specifically to add the strict argument, and even that was two years ago. I’d say that pretty convincingly shows that it’s more than stable enough to safely live in the stdlib.

It just has to make sense to the core developer who is signing up to own that something for the next decade or so. :wink:

The fact that it’s a recipe suggests at least Raymond Hettinger already views it as something not to include in itertools outright.

1 Like

Seems similar to groupby, so I’d expect it to likewise yield iterators instead of lists like your sample implementation.

I think this is a reasonable suggestion and will put it on my list to add to CPython 3.12. Prior to that, I can add a recipe to the docs, perhaps something like:

def batched(iterable, n):
    it = iter(iterable)
    while (batch := list(islice(it, n))):
        yield batch

Will put some thought into the API:

  • Possible names are “chunked”, “batch”, and “batched”. The word “chunked” is familiar because of its use in HTTP headers and in `more-itertools (though some of that module’s name choices are questionable). “Batch” is nice because it is defined as "“a quantity or consignment of goods produced at one time”. The name “batched” is more consistent with “sorted” and “reversed”.

  • Possible chunk data types are tuple, list, or iterator. A tuple makes more sense when the input size is an exact multiple of the chunk size. Expect for the case of *args, we mostly don’t use tuples for variable length outputs. A list better communicates possible variable width and it is convenient when the result list needs to be altered. An iterator is a better fit with the theme of the module, but as itertools.groupby() has shown, it is awkward to work with. Also, the data will already be in memory, so using an iterator wouldn’t save memory even if the chunk size is huge.

  • There may be a use case for a zip-like argument strict=True to enforce even sized outputs when that is what is expected.

  • The norm is for itertools to terminate when the input iterable is exhausted. However, a user may need to force the batched iterator instance to flush a partial batch. We may need to support batcher_instance.throw(Flush) so they have a way to get their data without waiting on a potentially infinite input iterator to terminate.

  • We need to limit n to ints or objects with __index__. The value must be greater than zero. Also, we should think about what happens when a user inevitably sets n to an unreasonably large size and posts a bug report saying, “it hangs”.

  • The argument order can be (iterable, n) or (n, iterable). The former matches the argument order for islice and for the combinatoric itertools. We have some experience with the latter in the case of heapq.nsmallest and it does have some advantages. For example, it works well with partial function evaluation: dozens = partial(batched, 12). Also, it reads better when the iterable argument is a large expression: batched((expr(var) for var in someiterable if somecondition(var)), 64). In that example, the 64 dangles too far away from the batched call that it modifies. We’ve seen this problem in the wild with random.sample for example. Making n a keyword-only argument might help.

My $0.02:

I prefer “batch” or “batched”, not partial to either. Whenever I need this, batch is the first word I think of. I could live with “chunked” though.

Prefer a list, as usually I want to access batch elements directly (consider a data file with multiple choice questions and answers where the lines are a cycle of question-choice-choice-choice-choice-correct answer-repeat) and/or mutate the list. Tuples are okay for the former but need cast to a list for the latter, and iterators need cast to a list either way.

Strong +1. In the multiple choice question file example, it’s an error to have an incomplete question at the end, so I want to get an exception. I’m sure there are many other cases where this would be needed or else every iteration needs to check the length of the batch.

+0. I have no use for this myself but it feels like it would be useful to others. Not quite sure how it would be used though.

“Don’t do that.” Principle of consenting adults applies. A cautionary note in the docs would not be unreasonable, but -1 on any actual code to defend against it.

(Come to think about this, perhaps this is a use case for the flush feature?)

Slight preference for n first, but it’s a weak opinion. I do like the interaction n first has with partial.

This is a bit of a wild idea that I simultaneously really like and am not sure about. It keeps consistent argument order within itertools and works with partial, but I wonder about whether forcing the keyword creates an obstacle to usability since people (especially itertools users) aren’t used to that.

What makes you say that? I don’t think it has to be in memory, so it could save memory.

Possible iterator version:

def batched(iterable, n):
    it = iter(iterable)
    for first in it:
        batch = chain((first,), islice(it, n-1))
        yield batch
        next(islice(batch, n, n), None)
1 Like

An extremely common case with cloud databases is a batch operation (usually batch put or get), which are usually limited to a certain number of operations. The final batch is the remaining operations to perform.

On this use-case though, it’s possible for certain operations in the batch to fail, so those operations need to be added back to the input to be retried. This means it would be convenient if batched can pick up changes to the input.


Does n mean number of batches or batch size? Perhaps a more descriptive keyword could be used.

This is a good point and an argument against allowing it to be a positional arg. I am now +0.5 on making it a keyword-only parameter named “size”.

I think this is a reasonable suggestion and will put it on my list to
add to CPython 3.12. Prior to that, I can add a recipe to the docs,
perhaps something like:

def batched(iterable, n):
   it = iter(iterable)
   while (batch := list(islice(it, n))):
       yield batch

+1 Simple and broadly applicable.

Will put some thought into the API:

  • Possible names are “chunked”, “batch”, and “batched”. The word “chunked” is familiar because of its use in HTTP headers and in `more-itertools (though some of that module’s name choices are questionable). “Batch” is nice because it is defined as "“a quantity or consignment of goods produced at one time”. The name “batched” is more consistent with “sorted” and “reversed”.

I like “batched”. I’m -1 on “hunked”; I associate “chunked” which chunks
of bytes, probably from personal use, but the HTTP use is also
effectively chunks of bytes.

  • Possible chunk data types are tuple, list, or iterator. A tuple makes more sense when the input size is an exact multiple of the chunk size. Expect for the case of *args, we mostly don’t use tuples for variable length outputs. A list better communicates possible variable width and it is convenient when the result list needs to be altered. An iterator is a better fit with the theme of the module, but as itertools.groupby() has shown, it is awkward to work with. Also, the data will already be in memory, so using an iterator wouldn’t save memory even if the chunk size is huge.

A list seems most flexible to me - the user can use them almost
anywhere, whereas a tuple needs converting to a list for many purposes.
Are there performance or other costs to a list?

  • There may be a use case for a zip-like argument strict=True to enforce even sized outputs when that is what is expected.

No opinion, but +1 if it is very cheap.

  • We need to limit n to ints or objects with __index__. The value
    must be greater than zero. Also, we should think about what happens
    when a user inevitably sets n to an unreasonably large size and posts
    a bug report saying, “it hangs”.

What does islice() do for “unreasonably large size”? Do you really
mean “hangs”, or do you mean “user misinterprets expensive and slow as
hangs”?

  • The argument order can be (iterable, n) or (n, iterable). The former matches the argument order for islice and for the combinatoric itertools. We have some experience with the latter in the case of heapq.nsmallest and it does have some advantages. For example, it works well with partial function evaluation: dozens = partial(batched, 12). Also, it reads better when the iterable argument is a large expression: batched((expr(var) for var in someiterable if somecondition(var)), 64). In that example, the 64 dangles too far away from the batched call that it modifies. We’ve seen this problem in the wild with random.sample for example. Making n a keyword-only argument might help.

I’d be inclined to match islice() for consistency.

My own recent exercise had the iterator first, but it also had a default
for the batch size, and batch_size was a keyword only argument.

I think a keyword only argument is overkill for something this small,
especially as there’s no default; it makes things pretty verbose. It
isn’t like the user can get the arguments wrong; an int isn’t
iterable, and iterables don’t look like ints. Instant TypeError or
correct behaviour.

Cheers,
Cameron Simpson cs@cskk.id.au

2 Likes

Testing @rhettinger’s with lists and mine with iterators on an iterator of 10000 values each 1 MB large, using batch size n=1000:

 18.82 seconds  batched_as_lists
  0.47 seconds  batched_as_iterators
2000.1 MB peak  batched_as_lists
   2.0 MB peak  batched_as_iterators
Code

Try it online!

import tracemalloc
from timeit import default_timer as time
from itertools import islice, chain

def big_objects():
    for _ in range(10**4):
        yield '.' * 10**6

def batched_as_lists(iterable, n):
    it = iter(iterable)
    while (batch := list(islice(it, n))):
        yield batch

def batched_as_iterators(iterable, n):
    it = iter(iterable)
    for first in it:
        batch = chain((first,), islice(it, n-1))
        yield batch
        next(islice(batch, n, n), None)

funcs = batched_as_lists, batched_as_iterators

# Small demo for correctness
for f in funcs:
    print(*map(list, f(range(10), 4)))

# Speed
for f in funcs:
    t = time()
    for _ in f(big_objects(), 1000):
        pass
    print(f'{time()-t:6.2f} seconds  {f.__name__}')

# Memory
for f in funcs:
    tracemalloc.start()
    for _ in f(big_objects(), 1000):
        pass
    print(f'{tracemalloc.get_traced_memory()[1] / 1e6:6.1f} MB peak  {f.__name__}')
    tracemalloc.stop()
    del _

Unfortunately it’s broken:

>>> batches = list(batched(range(25), 4))
>>> [list(b) for b in batches]
[[], [], [], [], [], [], []]

Given the sequential nature of Python’s iterators, it doesn’t seem possible to skip ahead without collecting all intermediate items first.

It’s not broken. You’re just not using it right. It works like groupby. With which you get the same result if you don’t use it right:

>>> groups = list(groupby(range(25), lambda i: i//4))
>>> [list(g[1]) for g in groups]
[[], [], [], [], [], [], []]
1 Like

This sort of suggests that it is easy to misuse. I accept that working with iterators always requires care about when they’re consumed (and conversely, when they haven’t been consumed). But that means that often iterators make code using them a little fragile (to misuse of the iterator), particularly when you’ve got multiple iterators whose accuracy/correctness depends on the order in which they’re used?

My (entirely personal) expectation of a “batch” is that it is a prefilled sequence, such as a list.

I can see that your iterator based approach hands out iterators quite quickly and defers the creation of the big_objects until the end user consumption. But is this really a win if the iterators are actually consumed? And doesn’t the first iterator need to be consumed before using the second?

Cheers,
Cameron

3 Likes

Well, it’s called itertools, not stortools, and in my opinion it’s for iteration, not for storage. If you iterate over something and don’t keep it, it’s naturally gone.

I think you might have misunderstood my code. Note this:

        next(islice(batch, n, n), None)

That does consume that previous batch and thus does lead to the creation of all the big_objects. My iterator version does all the work that the list version does, except storing the objects in lists.

You can add an inner loop to my benchmark if you want the consumption to be done by the outside consumer (which is the realistic thing to do) but it doesn’t noticeably affect the times (which is why I didn’t bother).

    for batch in f(big_objects(), 1000):
        for element in batch:
            pass

Ah… Right, that makes sense. I think I need to go back and reread the
misuse example then…

Thanks,
Cameron Simpson cs@cskk.id.au

I believe it’s very much open for debate whether:

  • your suggested implementation is “correct” and my usage is a misuse because someone shouldn’t be allowed to cherry pick items from various batches out of order,
  • or my usage is legitimate and the implementation is insufficient to support this usage.

I believe a very valid use case for batches is to distribute work across threads/processes, hence it might be of practical interest to allow the distinct batches to be iterable separately without invalidating one another.

Of course it is a choice whether to support this functionality or not. I just wanted to point out that the naive implementation can lead to unexpected surprises if someone is not careful enough.

Optional: comparison with Rust

For a comparison with Rust, see this demonstration of the possibility to process batches out of order. The example is based on Itertools::chunks, which explicitly says in its documentation:

Return an iterable that can chunk the iterator.
Yield subiterators (chunks) that each yield a fixed number elements, determined by size. […]
[…] it only buffers if several chunk iterators are alive at the same time.

It may be interesting to provide something similar for Python.

Yes, we’re still talking about what it should do. You just can’t call it “broken” because it doesn’t do what you would like it to do. It works fine if you use it as intended.

If your use case needs lists, simply call list() on them. No need to force lists or other “buffering” on everybody.

Not sure why you call it “naive”. It’s very intentionally like groupby, as their tasks are so similar.

3 Likes

Earlier, Raymond wrote:

“An iterator is a better fit with the theme of the module, but as itertools.groupby() has shown, it is awkward to work with. Also, the data will already be in memory, so using an iterator wouldn’t save memory even if the chunk size is huge.”

I think that groupby’s returning of subiterators is a mistake we should not repeat.

We might just as easily say that if your use-case requires iterators, simply call iter() on the batched items.