But dozens = partial(batched, n=12)
is also fine, surely? And if not, dozens = lambda it: batched(it, 12)
would be just as good.
I don’t really have a preference regarding argument order, myself.
But dozens = partial(batched, n=12)
is also fine, surely? And if not, dozens = lambda it: batched(it, 12)
would be just as good.
I don’t really have a preference regarding argument order, myself.
Maybe there should be an option to raise an exception on detection of that “odd-lot” as in your de-serialisation of fixed length records example, it would indicate an error. Zip has the strict argument for similar.
Analyzing a Python implementation that nobody has written yet is admittedly difficult; nonetheless I have a few objections.
First of all, I don’t see why one would use such method to count the number of batches instead of this:
def num_batches(tot_len, batch_len):
return 1 + (tot_len - 1) // batch_len
The tot_len
is either known in advance or can be computed with
def count(iterable):
"""Return the number of items in an iterator."""
n = 0
for _ in iterable:
n += 1
return n
count(batched(iterable, n))
The problem in your example
len(list(batched(iterable, n)))
# ^^^^ this is the problem
is that you are forced to materialize all sub-iterators in a list
because iterators generally do not implement __len__
. This issue is addressed in the previous point, or can be worked around either by implementing __len__
on the object returned by batched
so that len(batched(iterable, n))
becomes valid.
Now, while it’s true that this example causes buffering, I don’t see it as batched
’s fault, but rather list
’s fault. After all, you are collecting things in a list just to count their number: even disregarding the discussion about batched
this is clearly sub-optimal.
I don’t think the Rust version would cause buffering in a similar situation, if I understand it correctly, unless one unnecessarily collects the sub-iterators in a Vec
. The analogous code would be the following:
let batches = iterator.chunks(5);
batches.count()
Thanks to the deterministic nature of Drop
, at any time during the iteration which occurs implicitly in the count
method there is a single sub-iterator in scope, hence no buffering occurs because, again,
it only buffers if several chunk iterators are alive at the same time.
Achieving the same behavior in Python might be harder because of the “non-determinism” of the garbage collector, which can defer the destruction of objects beyond their last point of use, however I’m sure there are ways to achieve the same effect in Python.
Let me rephrase what I said earlier to be more precise then:
in every circumstance where your iterator version is sufficient, this other implementation is equivalent in terms of memory usage (no buffering, no extra allocation), unless one does unnecessarily keep alive more than one sub-iterator.