Add batching function to itertools module

pf_moore · October 7, 2022, 9:52pm

But dozens = partial(batched, n=12) is also fine, surely? And if not, dozens = lambda it: batched(it, 12) would be just as good.

I don’t really have a preference regarding argument order, myself.

Paddy3118 · October 7, 2022, 9:52pm

Maybe there should be an option to raise an exception on detection of that “odd-lot” as in your de-serialisation of fixed length records example, it would indicate an error. Zip has the strict argument for similar.

ITellYouForFree · October 11, 2022, 8:00am

Analyzing a Python implementation that nobody has written yet is admittedly difficult; nonetheless I have a few objections.

First of all, I don’t see why one would use such method to count the number of batches instead of this:

def num_batches(tot_len, batch_len):
    return 1 + (tot_len - 1) // batch_len

The tot_len is either known in advance or can be computed with

def count(iterable):
    """Return the number of items in an iterator."""
    n = 0
    for _ in iterable:
        n += 1
    return n

count(batched(iterable, n))

The problem in your example
```
len(list(batched(iterable, n)))
#   ^^^^ this is the problem
```
is that you are forced to materialize all sub-iterators in a list because iterators generally do not implement __len__. This issue is addressed in the previous point, or can be worked around either by implementing __len__ on the object returned by batched so that len(batched(iterable, n)) becomes valid.

Now, while it’s true that this example causes buffering, I don’t see it as batched’s fault, but rather list’s fault. After all, you are collecting things in a list just to count their number: even disregarding the discussion about batched this is clearly sub-optimal.
I don’t think the Rust version would cause buffering in a similar situation, if I understand it correctly, unless one unnecessarily collects the sub-iterators in a Vec. The analogous code would be the following:
```
let batches = iterator.chunks(5);
batches.count()
```
Thanks to the deterministic nature of Drop, at any time during the iteration which occurs implicitly in the count method there is a single sub-iterator in scope, hence no buffering occurs because, again,

it only buffers if several chunk iterators are alive at the same time.

Achieving the same behavior in Python might be harder because of the “non-determinism” of the garbage collector, which can defer the destruction of objects beyond their last point of use, however I’m sure there are ways to achieve the same effect in Python.

Let me rephrase what I said earlier to be more precise then:

in every circumstance where your iterator version is sufficient, this other implementation is equivalent in terms of memory usage (no buffering, no extra allocation), unless one does unnecessarily keep alive more than one sub-iterator.

Topic		Replies	Views
List slice feature that allows to create sub-lists Ideas	7	1139	October 29, 2022
Add `flatten_list` to itertools Ideas	6	801	July 7, 2023
Itertools groupby should be renamed to chunk_by Ideas	11	790	January 19, 2024
`array` module: use `__len__` and `__length_hint__` for more efficient allocation from iterables Ideas	2	298	September 3, 2022
Make mypy happy across Python versions Python Help typing	1	519	December 5, 2023

Add batching function to itertools module

Related Topics