Add batching function to itertools module

jcgoble3 · September 30, 2022, 4:09am

This is a good point and an argument against allowing it to be a positional arg. I am now +0.5 on making it a keyword-only parameter named “size”.

cameron · September 30, 2022, 4:15am

I think this is a reasonable suggestion and will put it on my list to
add to CPython 3.12. Prior to that, I can add a recipe to the docs,
perhaps something like:
def batched(iterable, n):
   it = iter(iterable)
   while (batch := list(islice(it, n))):
       yield batch

+1 Simple and broadly applicable.

Will put some thought into the API:

Possible names are “chunked”, “batch”, and “batched”. The word “chunked” is familiar because of its use in HTTP headers and in `more-itertools (though some of that module’s name choices are questionable). “Batch” is nice because it is defined as "“a quantity or consignment of goods produced at one time”. The name “batched” is more consistent with “sorted” and “reversed”.

I like “batched”. I’m -1 on “hunked”; I associate “chunked” which chunks
of bytes, probably from personal use, but the HTTP use is also
effectively chunks of bytes.

Possible chunk data types are tuple, list, or iterator. A tuple makes more sense when the input size is an exact multiple of the chunk size. Expect for the case of *args, we mostly don’t use tuples for variable length outputs. A list better communicates possible variable width and it is convenient when the result list needs to be altered. An iterator is a better fit with the theme of the module, but as itertools.groupby() has shown, it is awkward to work with. Also, the data will already be in memory, so using an iterator wouldn’t save memory even if the chunk size is huge.

A list seems most flexible to me - the user can use them almost
anywhere, whereas a tuple needs converting to a list for many purposes.
Are there performance or other costs to a list?

There may be a use case for a zip-like argument strict=True to enforce even sized outputs when that is what is expected.

No opinion, but +1 if it is very cheap.

We need to limit n to ints or objects with __index__. The value
must be greater than zero. Also, we should think about what happens
when a user inevitably sets n to an unreasonably large size and posts
a bug report saying, “it hangs”.

What does islice() do for “unreasonably large size”? Do you really
mean “hangs”, or do you mean “user misinterprets expensive and slow as
hangs”?

The argument order can be (iterable, n) or (n, iterable). The former matches the argument order for islice and for the combinatoric itertools. We have some experience with the latter in the case of heapq.nsmallest and it does have some advantages. For example, it works well with partial function evaluation: dozens = partial(batched, 12). Also, it reads better when the iterable argument is a large expression: batched((expr(var) for var in someiterable if somecondition(var)), 64). In that example, the 64 dangles too far away from the batched call that it modifies. We’ve seen this problem in the wild with random.sample for example. Making n a keyword-only argument might help.

I’d be inclined to match islice() for consistency.

My own recent exercise had the iterator first, but it also had a default
for the batch size, and batch_size was a keyword only argument.

I think a keyword only argument is overkill for something this small,
especially as there’s no default; it makes things pretty verbose. It
isn’t like the user can get the arguments wrong; an int isn’t
iterable, and iterables don’t look like ints. Instant TypeError or
correct behaviour.

Cheers,
Cameron Simpson cs@cskk.id.au

Stefan2 · September 30, 2022, 4:53am

Testing @rhettinger’s with lists and mine with iterators on an iterator of 10000 values each 1 MB large, using batch size n=1000:

 18.82 seconds  batched_as_lists
  0.47 seconds  batched_as_iterators
2000.1 MB peak  batched_as_lists
   2.0 MB peak  batched_as_iterators

Code

Try it online!

import tracemalloc
from timeit import default_timer as time
from itertools import islice, chain

def big_objects():
    for _ in range(10**4):
        yield '.' * 10**6

def batched_as_lists(iterable, n):
    it = iter(iterable)
    while (batch := list(islice(it, n))):
        yield batch

def batched_as_iterators(iterable, n):
    it = iter(iterable)
    for first in it:
        batch = chain((first,), islice(it, n-1))
        yield batch
        next(islice(batch, n, n), None)

funcs = batched_as_lists, batched_as_iterators

# Small demo for correctness
for f in funcs:
    print(*map(list, f(range(10), 4)))

# Speed
for f in funcs:
    t = time()
    for _ in f(big_objects(), 1000):
        pass
    print(f'{time()-t:6.2f} seconds  {f.__name__}')

# Memory
for f in funcs:
    tracemalloc.start()
    for _ in f(big_objects(), 1000):
        pass
    print(f'{tracemalloc.get_traced_memory()[1] / 1e6:6.1f} MB peak  {f.__name__}')
    tracemalloc.stop()
    del _

ITellYouForFree · October 3, 2022, 6:06pm

Unfortunately it’s broken:

>>> batches = list(batched(range(25), 4))
>>> [list(b) for b in batches]
[[], [], [], [], [], [], []]

Given the sequential nature of Python’s iterators, it doesn’t seem possible to skip ahead without collecting all intermediate items first.

Stefan2 · October 3, 2022, 10:33pm

It’s not broken. You’re just not using it right. It works like groupby. With which you get the same result if you don’t use it right:

>>> groups = list(groupby(range(25), lambda i: i//4))
>>> [list(g[1]) for g in groups]
[[], [], [], [], [], [], []]

cameron · October 4, 2022, 9:46pm

This sort of suggests that it is easy to misuse. I accept that working with iterators always requires care about when they’re consumed (and conversely, when they haven’t been consumed). But that means that often iterators make code using them a little fragile (to misuse of the iterator), particularly when you’ve got multiple iterators whose accuracy/correctness depends on the order in which they’re used?

My (entirely personal) expectation of a “batch” is that it is a prefilled sequence, such as a list.

I can see that your iterator based approach hands out iterators quite quickly and defers the creation of the big_objects until the end user consumption. But is this really a win if the iterators are actually consumed? And doesn’t the first iterator need to be consumed before using the second?

Cheers,
Cameron

Stefan2 · October 5, 2022, 12:30am

Well, it’s called itertools, not stortools, and in my opinion it’s for iteration, not for storage. If you iterate over something and don’t keep it, it’s naturally gone.

I think you might have misunderstood my code. Note this:

        next(islice(batch, n, n), None)

That does consume that previous batch and thus does lead to the creation of all the big_objects. My iterator version does all the work that the list version does, except storing the objects in lists.

You can add an inner loop to my benchmark if you want the consumption to be done by the outside consumer (which is the realistic thing to do) but it doesn’t noticeably affect the times (which is why I didn’t bother).

    for batch in f(big_objects(), 1000):
        for element in batch:
            pass

cameron · October 5, 2022, 6:28am

Ah… Right, that makes sense. I think I need to go back and reread the
misuse example then…

Thanks,
Cameron Simpson cs@cskk.id.au

ITellYouForFree · October 5, 2022, 12:36pm

I believe it’s very much open for debate whether:

your suggested implementation is “correct” and my usage is a misuse because someone shouldn’t be allowed to cherry pick items from various batches out of order,
or my usage is legitimate and the implementation is insufficient to support this usage.

I believe a very valid use case for batches is to distribute work across threads/processes, hence it might be of practical interest to allow the distinct batches to be iterable separately without invalidating one another.

Of course it is a choice whether to support this functionality or not. I just wanted to point out that the naive implementation can lead to unexpected surprises if someone is not careful enough.

Optional: comparison with Rust

For a comparison with Rust, see this demonstration of the possibility to process batches out of order. The example is based on Itertools::chunks, which explicitly says in its documentation:

Return an iterable that can chunk the iterator.
Yield subiterators (chunks) that each yield a fixed number elements, determined by size. […]
[…] it only buffers if several chunk iterators are alive at the same time.

It may be interesting to provide something similar for Python.

Stefan2 · October 5, 2022, 7:09pm

Yes, we’re still talking about what it should do. You just can’t call it “broken” because it doesn’t do what you would like it to do. It works fine if you use it as intended.

If your use case needs lists, simply call list() on them. No need to force lists or other “buffering” on everybody.

Not sure why you call it “naive”. It’s very intentionally like groupby, as their tasks are so similar.

steven.daprano · October 6, 2022, 7:37am

Earlier, Raymond wrote:

“An iterator is a better fit with the theme of the module, but as itertools.groupby() has shown, it is awkward to work with. Also, the data will already be in memory, so using an iterator wouldn’t save memory even if the chunk size is huge.”

I think that groupby’s returning of subiterators is a mistake we should not repeat.

We might just as easily say that if your use-case requires iterators, simply call iter() on the batched items.

Stefan2 · October 6, 2022, 12:34pm

I wonder what he meant with “awkward to work with”. I never had a problem with it, and I’ve used it often. And a while ago he wrote (emphasis mine):

The groupby() tool is about 13 years old and has served its users well.

and

I have no interest in #2. Itertools are all about not storing data in memory unless that is an intrinsic requirement for the operation (tee() and cycle() for example).

(#2 was a way to buffer group data.)

Your quote also included his “using an iterator wouldn’t save memory”. Why? Does my code/benchmark not show that that’s not true? (Maybe I misunderstand what he meant, he hasn’t replied to my clarification request yet.)

Why? I like it…

That’s a false equivalence. The point is that creating lists can come at a very expensive price, both memory and time. See my benchmark above again. Creating lists would force that price on everyone, and calling iter() on them doesn’t make that price go away. With non-saving iterators, it’s your choice to pay it if you want.

ITellYouForFree · October 7, 2022, 1:26pm

I concede that I was imprecise/inappropriate in saying that the iterable implementation is “broken”: it simply adheres to a different contract with its user. Nonetheless, I would still argue against that implementation. What I should have said is that the iterable version lends itself to be very easily misused. The reason is that the exposed API, which is quite liberal, coupled with the simplistic implementation can be employed in seemingly valid ways which turn out to be surprising instead:

batches = batched(iterable, n)
b = next(batches) # the first batch
next(batches) # CAUTION: this invalidates (empties) `b`
next(b) # `b` is exhausted

In practical terms this is not good, because then a noob arrives and misuses it, and as a consequence a train derails and a couple hundred people are killed.

As a side note about “[…] it doesn’t do what you would like it to do”, I’d point out that maybe that is not “what I like to do”, but possibly “what I need to do”. A simplistic implementation is capable of treating only a subset of the interesting use cases. In the other ones, if we don’t want to call it “broken”, let’s call it “useless” then (i.e. “unusable”).

+0.5 on that. My personal take on that is that groupby should return subiterators indeed, but which are not flushed automatically when you step forward the main iterator.

As Stefan said, no. The iterator version does indeed save memory, because items are produces one by one and can be discarded right away if not stored somewhere else.

Once you have produced an intermediate list to store the batch, there is no way to undo the memory allocation by calling iter on it. In this sense the iterator version is more flexible because it leaves the choice of whether to store the entire chunk in a list or not to the user.

It is more flexible, but not as safe as, because it allows an inattentive user to mess up the computation by jumping around.

The implementation I referenced from Rust’s Itertools::chunks doesn’t force buffering on everybody. As I quoted from its documentation, it clearly says that:

it only buffers if several chunk iterators are alive at the same time.

This means that, in every circumstance where your iterator version is sufficient, this other implementation is equivalent in terms of memory usage (no buffering, no extra allocation).

On the other hand, if someone happens to use various batches at the same time out of order because they like/need to jump around, then this buffered implementation allows that with no nasty surprises.

I think it’s worth noting that the popular library itertools for Rust, a language generally associated with correctness and safety, has chosen this particular implementation to decrease the chances of misuses on the part of the user (“misuses” which cannot be ruled out by the API itself, only by a docstring).

This consideration is vaguely in the spirit of the motto Make impossible states impossible.

To conclude, I would ask the following.

Is the buffered version generally useful, memory efficient in the standard cases, safe and efficient in the corner cases? Yes.
Is its implementation trivial enough that we can expect the average user to be able to code it on his own in a few lines? No.

Hence I believe it’s the perfect example of a functionality to be include in a (std)library.

Paddy3118 · October 7, 2022, 3:44pm

The Elixir language has chunk_every that does this.
I implemented that in Python and discuss several of the corner cases in my blog entry here: Go deh!: Elixirs' chunk_every in Python

It would be good to reuse an existing name if we are to provide very similar capabilities.

Paddy3118 · October 7, 2022, 4:28pm

Hi Raymond. I posted an old implementation of mine and use case based on Elixirs functionality below (#34?)
I have extra arguments to deal with edge cases - small chunks, fill-values, overlapping chunks…

Stefan2 · October 7, 2022, 5:13pm

I don’t think catering to “noobs” and “inattentive users” is that important.

How many people got killed because of groupy having this “issue”?

And what if an inattentive noob exhausts the memory and causes trouble with that?

I don’t think it’s useless for those cases, it still offers the batching functionality, i.e, the structural iteration.

Right, I should’ve spoken only about the eager listification and similarly eager buffering there.

I don’t think that’s true. For example, what if I just want to know the number of batches and thus do len(list(batched(range(25), 4)))? That would needlessly buffer everything.

rhettinger · October 7, 2022, 8:18pm

I saw this. Thank you for the nice blog post. ISTM that Elixir’s chunk_every tries to be all things to all people. I’m aiming for much simpler API that addresses the common cases I’ve seen with Github’s code search.

We want to be able to unflatten records that have been serialized:

data = ['raymond', 'red', 'rachel', 'blue', 'matthew', 'yellow']
for name, color in batched(data, 2):
    ...

And as originally proposed in this thread, we also want to be able to group data up to a specified maximum size. Handling a final odd-sized lot is essential for that task:

for page_number, page in enumerate(batched(lines, 60), start=1):
    print('\n'.join(page))
    print()
    print(page_number)

I don’t see much use for a variation that discards the odd-lot. Presumably, the input data is there for a reason and it would usually be a mistake to ignore it. We could also support padding an odd lot with a fill value, but that isn’t a common case and it isn’t hard to do with existing itertools such as zip_longest().

The problem being solved is that when people want to process data in batches, it doesn’t occur to them to call islice() in a loop. We just need to make that bridge.

Do you have an opinion on the argument order, batched(iterable, n) versus batched(n, iterable)? There is a related recipe for take(n, iterable) that puts n first. However, islice(iterable, stop) puts the limit in the second parameter. I’m unclear about what makes the most sense for batched().

Paddy3118 · October 7, 2022, 8:54pm

I would prefer the iterable first, because you can’t do anything without the iterable wheras you might/could have a default value for n, (say, n=2).

rhettinger · October 7, 2022, 9:06pm

With n first, you can partial the function: dozens = partial(batched, 12).

Paddy3118 · October 7, 2022, 9:28pm

Hmm,

dropwhile, filterfalse, starmap, takewhile, have one main sequence/iterable that comes after any other arguments.
accumulate, groupby, islice, tee, permutations, combinations, combinations_with_replacement, as well as, arguably, zip_longest, and product have main sequences/iterables that come before other arguments.

itertools is not consistent.

I tend to use functions from that second camp more often which I think drives my prejudice

P.S. and the new grouper would then more closely match groupby.