Nwise itertools

I created this pulll request to add nwise to the itertools module. I saw that we had implemented pairwise, so I implemented nwise. I would have it marked as a draft. I would love some code review and some suggestions. However, I think this would be an excellent edition to the standard library.

1 Like

Not qualified to do a code review, but I would put the function name and return type on seperate lines, per PEP 7.

1 Like

I found at least one previous proposal for this: [Python-Dev] Proposal for a new itertools function: iwindow

Additions to itertools (and in particular converting recipes to module functions) are a fairly common type of suggestion. It would help your proposal if you could find and address some of the concerns that have been expressed in the past. See for example the discussion thread that lead to itertools.batched:

Also:

1 Like

You’re absolutely right. I will make those changes to formatting.

I use it for time_series stuff when I don’t have enough data to warrant numpy. I think it is a very common operation as well, so much so, that we have already allowed pairwise. I think nwise is just the generic case, and would be useful. I personally think it would be a great addition to the standard library.

def nwise(lst, n):
    iters = tee(lst, n)
    for idx, itr in enumerate(iters):
        next(islice(itr, idx, idx), None)

    for group in zip(*iters):
        yield group


print(list(nwise([1, 2, 3, 4, 5], 3)))

This isn’t hard is just boilerplate when it would be nice to exist in the standard library.

Especially since we already have pairwise so why not just add the generic nwise

more_itertools has had version of this (with bells and whistles) under a different name for the last 4 years:

https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.windowed

and has provided the recipe that the docs state is currently under consideration:

https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.sliding_window

def sliding_window(iterable, n):
    "Collect data into overlapping fixed-length chunks or blocks."
    # sliding_window('ABCDEFG', 4) → ABCD BCDE CDEF DEFG
    it = iter(iterable)
    window = collections.deque(islice(it, n-1), maxlen=n)
    for x in it:
        window.append(x)
        yield tuple(window)

If you want less boilerplate, you could change your

    for group in zip(*iters):
        yield group

to

    yield from zip(*iters)

or

    return zip(*iters)

I couldn’t find a link to this discussion thread in the pull request or issue.

I mean I agree however. Lowering boilerplate vs

from itertools import nwise

and having to use generator incantations to achieve these things.
I just personally think that since I wrote this is C if we write a nice convience function that is general like nwise, since we have pairwise. I think it makes good sense.

Generally, the answer for proposals to add something to itertools is something like “add it to more_itertools first, check if it’s used a lot and wait till it has a stable interface, then port it over if it isn’t a 3-line implementation or if the C implementation is significantly faster.” Seems like more_itertools.windowed is used quite a bit, I am not really sure where the cutoff is. But I would strongly prefer the more general interface of windowed instead of nwise.

6 Likes

Windowed exists now. We have batched, so I would say I would tend to agree that of the more_itertools pieces windowed would be used more. However, that’s functionally equivalent to batched. Which we just added. I think that sliding_window in more_itertools is most definitely used. I certainly think there is a use case for nwise as well, especially since we have pairwise, and I would personally love to have the general case. I also think that we would see speed ups from going from a python implementation in more_itertools to a C implementation in itertools. I think we should think about what sort of keyword arguments we’d like to see maybe take some inspiration from more_itertools. However, I am definitely for taking the more use functions like nwise → sliding_window and batched → windowed and implementing them in itertools. I love reaching for the itertools library whenever I am iterating the first thing I think is. Can I use something from itertools? Is there something in there that will make my life easier? However, if it comes to installing a dependency, I will usually forego that, and implement it myself.

before batched was released I don’t know how many times I wrote

def batched(it, n):
    itr = iter(it)
    while batch := tuple(islice(itr, n)):
          yield batch

I personally think that this is a 3 line implementation, but do I think that batched shouldn’t be added into itertools. I personally am very glad to see this make it’s way into itertools. I think nwise would be an excellent edition too. I would also like to benchmark it and see if we can get it to be as performant as possible.

The pairwise recipe was only three lines and often faster than the C implementation, though.

1 Like
  • That means the C implementation needs to be fixed. By definition of CPython, it is always possible to write a C implementation that is at least as fast as the python implementation (by copying the exact sequence of API calls that would happen anyway for example)
  • I guess the policy is less strict than I would have thought. @rhettinger added it with very little extra discussion from what I can tell.
2 Likes

I am very curious as to how the pure python implementation of pairwise can be faster than the C version…

static PyObject *
pairwise_next(pairwiseobject *po)
{
    PyObject *it = po->it;
    PyObject *old = po->old;
    PyObject *new, *result;

    if (it == NULL) {
        return NULL;
    }
    if (old == NULL) {
        old = (*Py_TYPE(it)->tp_iternext)(it);
        Py_XSETREF(po->old, old);
        if (old == NULL) {
            Py_CLEAR(po->it);
            return NULL;
        }
        it = po->it;
        if (it == NULL) {
            Py_CLEAR(po->old);
            return NULL;
        }
    }
    Py_INCREF(old);
    new = (*Py_TYPE(it)->tp_iternext)(it);
    if (new == NULL) {
        Py_CLEAR(po->it);
        Py_CLEAR(po->old);
        Py_DECREF(old);
        return NULL;
    }
    /* Future optimization: Reuse the result tuple as we do in enumerate() */
    result = PyTuple_Pack(2, old, new);
    Py_XSETREF(po->old, new);
    Py_DECREF(old);
    return result;
}

he is only loading two values and returning them… I would like to see some benchmarks because I don’t really believe python is faster than this implementation, and I still think it’s silly to say oh it only takes 3 lines to implement. that’s 3 lines of code duplicated all over the place. We write high level code with python to have these abstractions. I 100% am behind benchmarking and scrutinizing my implementation. In fact, I think it’s absolutely necessary. However, to say, it’s so easy to write why add a convience layer… is silly to me.

1 Like

I personally think

  • batched → python3.12 great addition
  • pairwise → python3.10 great addition

I can write batched in like 2 lines of code. I would love to test the performance vs itertools because I doubt it’s faster, and if it is we should 100% be scrutinizing that because there should be no way that’s possible.

however these additions only make code simpler

from itertools import batched

I am 100% behind expanding itertools with sane iterating tools… ie… itertools

so… my opinion would be to discuss how to optimize nwise, so it’s the most efficient in C, so we can have a great sane iteration tool that is 100% faster than pure python…

1 Like

The reason why there is a general apprehension is decently exemplified by pairwise: It’s C implementation is apparently suboptimal, and therefore now needs to be fixed. This means extra work by multiple people, potentially a dozen man hours for a 3-line python function… Adding a function to the stdlib means practically indefinite support and effort, so it needs to justify it’s usefulness against that barrier. And C implementations also require extra implementation effort by all the other Python implementations which can’t just reuse the code.

2 Likes

Here’s one:

 23.1 ± 0.5 μs  consume(recipe(iterable))
 37.5 ± 0.5 μs  consume(pairwise(iterable))

Python: 3.12.0 (main, Oct  7 2023, 10:42:35) [GCC 13.2.1 20230801]

That’s with:

iterable = list(range(1000))
consume = deque(maxlen=0).extend
Benchmark script
from timeit import repeat

setup = '''
from itertools import pairwise, tee
from collections import deque

def recipe(iterable):
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

iterable = list(range(1000))
consume = deque(maxlen=0).extend
'''

codes = [
    'consume(pairwise(iterable))',
    'consume(recipe(iterable))',
]

from timeit import timeit
from statistics import mean, stdev
import sys
import random

times = {code: [] for code in codes}
def stats(code):
    ts = [t * 1e6 for t in sorted(times[code])[:5]]
    return f'{mean(ts):5.1f} ± {stdev(ts):3.1f} μs '
for _ in range(25):
    random.shuffle(codes)
    for code in codes:
        number = 10**3
        t = timeit(code, setup, number=number) / number
        times[code].append(t)
for code in sorted(codes, key=stats):
    print(stats(code), code)

print('\nPython:', sys.version)

Attempt This Online!

In this case, I don’t think that mean we should be hesitant about adding functionality. I do think that we need to revisit what is going on with pairwise, I will try seeing if I can do some benchmarks on the C code for pairwise, and see if we can implement it in terms of tee… That isn’t a bad idea actually. I mean since tee is written… I should be able to write them in terms of tee I want to see how this will affect performance. I 100% agree that this is unacceptable, however, at the end of the day… We can do better, and I think it’s worth it to do better.

Related topic: Why is pairwise slower than zip?

I mean tee is clearly faster, but we can use tee in C, and see how that affects performance… I will benchmark what I have, and see what I can do. I 100% agree with this take, if we are implementing in C there should be no excuse for it being slower.