Nwise itertools

lsproule · April 23, 2024, 3:44pm

gh-118179: Added implementation for nwise to itertools

python:main ← lsproule:nwise

opened 09:02PM - 22 Apr 24 UTC

* Issue: gh-118179

I created this pulll request to add nwise to the itertools module. I saw that we had implemented pairwise, so I implemented nwise. I would have it marked as a draft. I would love some code review and some suggestions. However, I think this would be an excellent edition to the standard library.

ZeroIntensity · April 23, 2024, 4:36pm

Not qualified to do a code review, but I would put the function name and return type on seperate lines, per PEP 7.

bschubert · April 23, 2024, 4:47pm

I found at least one previous proposal for this: [Python-Dev] Proposal for a new itertools function: iwindow

Additions to itertools (and in particular converting recipes to module functions) are a fairly common type of suggestion. It would help your proposal if you could find and address some of the concerns that have been expressed in the past. See for example the discussion thread that lead to itertools.batched:

Also:

lsproule · April 23, 2024, 5:02pm

You’re absolutely right. I will make those changes to formatting.

lsproule · April 23, 2024, 5:06pm

I use it for time_series stuff when I don’t have enough data to warrant numpy. I think it is a very common operation as well, so much so, that we have already allowed pairwise. I think nwise is just the generic case, and would be useful. I personally think it would be a great addition to the standard library.

def nwise(lst, n):
    iters = tee(lst, n)
    for idx, itr in enumerate(iters):
        next(islice(itr, idx, idx), None)

    for group in zip(*iters):
        yield group


print(list(nwise([1, 2, 3, 4, 5], 3)))

This isn’t hard is just boilerplate when it would be nice to exist in the standard library.

Especially since we already have pairwise so why not just add the generic nwise

JamesParrott · April 23, 2024, 5:16pm

more_itertools has had version of this (with bells and whistles) under a different name for the last 4 years:

https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.windowed

and has provided the recipe that the docs state is currently under consideration:

https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.sliding_window

def sliding_window(iterable, n):
    "Collect data into overlapping fixed-length chunks or blocks."
    # sliding_window('ABCDEFG', 4) → ABCD BCDE CDEF DEFG
    it = iter(iterable)
    window = collections.deque(islice(it, n-1), maxlen=n)
    for x in it:
        window.append(x)
        yield tuple(window)

Stefan2 · April 23, 2024, 5:58pm

If you want less boilerplate, you could change your

    for group in zip(*iters):
        yield group

to

    yield from zip(*iters)

or

    return zip(*iters)

elis.byberi · April 23, 2024, 6:02pm

I couldn’t find a link to this discussion thread in the pull request or issue.

lsproule · April 23, 2024, 7:42pm

I mean I agree however. Lowering boilerplate vs

from itertools import nwise

and having to use generator incantations to achieve these things.
I just personally think that since I wrote this is C if we write a nice convience function that is general like nwise, since we have pairwise. I think it makes good sense.

MegaIng · April 23, 2024, 7:51pm

Generally, the answer for proposals to add something to itertools is something like “add it to more_itertools first, check if it’s used a lot and wait till it has a stable interface, then port it over if it isn’t a 3-line implementation or if the C implementation is significantly faster.” Seems like more_itertools.windowed is used quite a bit, I am not really sure where the cutoff is. But I would strongly prefer the more general interface of windowed instead of nwise.

lsproule · April 23, 2024, 8:24pm

Windowed exists now. We have batched, so I would say I would tend to agree that of the more_itertools pieces windowed would be used more. However, that’s functionally equivalent to batched. Which we just added. I think that sliding_window in more_itertools is most definitely used. I certainly think there is a use case for nwise as well, especially since we have pairwise, and I would personally love to have the general case. I also think that we would see speed ups from going from a python implementation in more_itertools to a C implementation in itertools. I think we should think about what sort of keyword arguments we’d like to see maybe take some inspiration from more_itertools. However, I am definitely for taking the more use functions like nwise → sliding_window and batched → windowed and implementing them in itertools. I love reaching for the itertools library whenever I am iterating the first thing I think is. Can I use something from itertools? Is there something in there that will make my life easier? However, if it comes to installing a dependency, I will usually forego that, and implement it myself.

before batched was released I don’t know how many times I wrote

def batched(it, n):
    itr = iter(it)
    while batch := tuple(islice(itr, n)):
          yield batch

I personally think that this is a 3 line implementation, but do I think that batched shouldn’t be added into itertools. I personally am very glad to see this make it’s way into itertools. I think nwise would be an excellent edition too. I would also like to benchmark it and see if we can get it to be as performant as possible.

Stefan2 · April 23, 2024, 9:18pm

The pairwise recipe was only three lines and often faster than the C implementation, though.

MegaIng · April 23, 2024, 9:28pm

That means the C implementation needs to be fixed. By definition of CPython, it is always possible to write a C implementation that is at least as fast as the python implementation (by copying the exact sequence of API calls that would happen anyway for example)
I guess the policy is less strict than I would have thought. @rhettinger added it with very little extra discussion from what I can tell.

lsproule · April 23, 2024, 9:34pm

I am very curious as to how the pure python implementation of pairwise can be faster than the C version…

static PyObject *
pairwise_next(pairwiseobject *po)
{
    PyObject *it = po->it;
    PyObject *old = po->old;
    PyObject *new, *result;

    if (it == NULL) {
        return NULL;
    }
    if (old == NULL) {
        old = (*Py_TYPE(it)->tp_iternext)(it);
        Py_XSETREF(po->old, old);
        if (old == NULL) {
            Py_CLEAR(po->it);
            return NULL;
        }
        it = po->it;
        if (it == NULL) {
            Py_CLEAR(po->old);
            return NULL;
        }
    }
    Py_INCREF(old);
    new = (*Py_TYPE(it)->tp_iternext)(it);
    if (new == NULL) {
        Py_CLEAR(po->it);
        Py_CLEAR(po->old);
        Py_DECREF(old);
        return NULL;
    }
    /* Future optimization: Reuse the result tuple as we do in enumerate() */
    result = PyTuple_Pack(2, old, new);
    Py_XSETREF(po->old, new);
    Py_DECREF(old);
    return result;
}

he is only loading two values and returning them… I would like to see some benchmarks because I don’t really believe python is faster than this implementation, and I still think it’s silly to say oh it only takes 3 lines to implement. that’s 3 lines of code duplicated all over the place. We write high level code with python to have these abstractions. I 100% am behind benchmarking and scrutinizing my implementation. In fact, I think it’s absolutely necessary. However, to say, it’s so easy to write why add a convience layer… is silly to me.

lsproule · April 23, 2024, 9:40pm

I personally think

batched → python3.12 great addition
pairwise → python3.10 great addition

I can write batched in like 2 lines of code. I would love to test the performance vs itertools because I doubt it’s faster, and if it is we should 100% be scrutinizing that because there should be no way that’s possible.

however these additions only make code simpler

from itertools import batched

I am 100% behind expanding itertools with sane iterating tools… ie… itertools

so… my opinion would be to discuss how to optimize nwise, so it’s the most efficient in C, so we can have a great sane iteration tool that is 100% faster than pure python…

MegaIng · April 23, 2024, 9:46pm

The reason why there is a general apprehension is decently exemplified by pairwise: It’s C implementation is apparently suboptimal, and therefore now needs to be fixed. This means extra work by multiple people, potentially a dozen man hours for a 3-line python function… Adding a function to the stdlib means practically indefinite support and effort, so it needs to justify it’s usefulness against that barrier. And C implementations also require extra implementation effort by all the other Python implementations which can’t just reuse the code.

Stefan2 · April 23, 2024, 9:50pm

Here’s one:

 23.1 ± 0.5 μs  consume(recipe(iterable))
 37.5 ± 0.5 μs  consume(pairwise(iterable))

Python: 3.12.0 (main, Oct  7 2023, 10:42:35) [GCC 13.2.1 20230801]

That’s with:

iterable = list(range(1000))
consume = deque(maxlen=0).extend

Benchmark script

from timeit import repeat

setup = '''
from itertools import pairwise, tee
from collections import deque

def recipe(iterable):
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

iterable = list(range(1000))
consume = deque(maxlen=0).extend
'''

codes = [
    'consume(pairwise(iterable))',
    'consume(recipe(iterable))',
]

from timeit import timeit
from statistics import mean, stdev
import sys
import random

times = {code: [] for code in codes}
def stats(code):
    ts = [t * 1e6 for t in sorted(times[code])[:5]]
    return f'{mean(ts):5.1f} ± {stdev(ts):3.1f} μs '
for _ in range(25):
    random.shuffle(codes)
    for code in codes:
        number = 10**3
        t = timeit(code, setup, number=number) / number
        times[code].append(t)
for code in sorted(codes, key=stats):
    print(stats(code), code)

print('\nPython:', sys.version)

Attempt This Online!

lsproule · April 23, 2024, 10:00pm

In this case, I don’t think that mean we should be hesitant about adding functionality. I do think that we need to revisit what is going on with pairwise, I will try seeing if I can do some benchmarks on the C code for pairwise, and see if we can implement it in terms of tee… That isn’t a bad idea actually. I mean since tee is written… I should be able to write them in terms of tee I want to see how this will affect performance. I 100% agree that this is unacceptable, however, at the end of the day… We can do better, and I think it’s worth it to do better.

Stefan2 · April 23, 2024, 10:00pm

Related topic: Why is pairwise slower than zip?

lsproule · April 23, 2024, 10:04pm

I mean tee is clearly faster, but we can use tee in C, and see how that affects performance… I will benchmark what I have, and see what I can do. I 100% agree with this take, if we are implementing in C there should be no excuse for it being slower.