Why is `pairwise` slower than `zip`?

Times for them creating a million (None, None) pairs:

 23.17 ± 0.83 ms  zip_
 33.27 ± 0.63 ms  pairwise_

Python: 3.11.4 (main, Sep  9 2023, 15:09:21) [GCC 13.2.1 20230801]

The tested things:

def it():
    return repeat(None, 10**6)

def pairwise_():
    return pairwise(it())

def zip_():
    return zip(it(), it())

Why is zip so much faster? For each pair it has to get elements from two input iterators, whereas pairwise reuses one element. The latter should be faster.

Comparing zip_next and pairwise_next, the only potential reason I see is the latter’s PyTuple_Pack(2, old, new). Is that it? Is PyTuple_Pack harmfully slow? Should it maybe be PyTuple_Pack(old, new), i.e., use a version for exactly two elements?

Benchmark script

Attempt This Online!

from timeit import timeit
from statistics import mean, stdev
from collections import deque
from itertools import pairwise, repeat
import sys


def it():
    return repeat(None, 10**6)

def pairwise_():
    return pairwise(it())

def zip_():
    return zip(it(), it())


funcs = pairwise_, zip_

consume = deque(maxlen=1).extend

times = {f: [] for f in funcs}
def stats(f):
    ts = [t * 1e3 for t in sorted(times[f])[:5]]
    return f'{mean(ts):6.2f} ± {stdev(ts):4.2f} ms '
for _ in range(25):
    for f in funcs:
        t = timeit(lambda: consume(f()), number=1)
        times[f].append(t)
for f in sorted(funcs, key=stats):
    print(stats(f), f.__name__)

print('\nPython:', sys.version)
2 Likes

I cannot duplicate with 3.10 on Windows 10.

15.93 ± 0.12 ms  pairwise_
16.14 ± 0.17 ms  zip_

Python: 3.10.8 | packaged by conda-forge | (main, Nov 24 2022, 14:07:00) [MSC v.1916 64 bit (AMD64)]

Note that I used consume = deque(maxlen=1).extend to consume them. Times for other consumers:

A for loop keeping a reference shows a similar picture:

def consume(iterable):
    for element in iterable:
        pass

 26.20 ± 0.67 ms  zip_
 37.37 ± 1.12 ms  pairwise_

maxlen=0 would allow zip to reuse its result tuple, an optimization which pairwise doesn’t have:

consume = deque(maxlen=0).extend

  8.79 ± 0.05 ms  zip_
 31.01 ± 0.09 ms  pairwise_

Note that these benchmarks were run independently, so times across benchmark runs aren’t totally comparable because the machine might’ve been differently busy. Only within each benchmark run are times comparable.

1 Like

Thanks. I tried on replit.com’s 3.10.11 now and can reproduce it there:

 22.01 ± 0.35 ms  zip_
 28.33 ± 0.53 ms  pairwise_

Python: 3.10.11 (main, Apr  4 2023, 22:10:32) [GCC 12.2.0]

Perhaps a better question is “Why is pairwise slower than zip on Linux?”

*James Parrot has duplicated the slowdown on Windows for 3.10 and 3.11.

Hi Steven,
That would be a good question, but there can still be a difference on Windows machines too. Sorry - I meant to chip in earlier with my results from Python 3.11 on Windows 11 that confirm Stefan’s findings, but thought I’d leave it to you experts. I installed 3.10 just now out of curiosity and to try to recreate your findings, and will report back shortly with the results for 3.9

 36.50 ± 0.30 ms  zip_
 45.76 ± 0.70 ms  pairwise_

Python: 3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)]

 37.93 ± 0.20 ms  zip_
 44.37 ± 0.56 ms  pairwise_

Python: 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]

Thanks for double-checking the results on your machine.

No need to check as pairwise was added in 3.10.

Indeed it was. Python 3.10.8 still shows a difference on mine. I assume the speed up on yours is due to your hardware, not to conda-forge.

34.55 ± 0.16 ms zip_
45.55 ± 0.28 ms pairwise_

Python: 3.10.8 (tags/v3.10.8:aaaf517, Oct 11 2022, 16:50:30) [MSC v.1933 64 bit (AMD64)]

Apple M2:

 10.41 ± 0.04 ms  zip_
 11.45 ± 0.09 ms  pairwise_

Python: 3.11.5 | packaged by conda-forge | (main, Aug 27 2023, 03:33:12) [Clang 15.0.7 ]

If I modify

consume = deque(maxlen=1).extend

to

consume = deque(maxlen=3).extend  # or a value > 3

then I consistently get a reversed order:

 11.57 ± 0.01 ms  pairwise_
 12.05 ± 0.02 ms  zip_

There’s also a comment right above that, suggesting that it could re-use the tuple. That’s what zip_next is doing, although the comment references enumobject.c. Seems likely that’s the culprit?

No, that tuple reuse optimization isn’t it. I mentioned that in my second post here, where I used maxlen=0 to show the effect of that optimization. My real benchmark uses maxlen=1 to avoid that optimization (the deque then keeps a reference to the latest tuple, so zip can’t reuse it).

1 Like

Interesting. For me, zip remains significantly faster, but less so. With maxlen=7, zip is still slightly faster, and with maxlen≥8, pairwise becomes slightly faster (but even with maxlen=50 it’s only slightly faster)

The benchmark’s input repeat(None, 10**6) is so fast that the difference between calling the iterator and reusing an element is not noticeable. With a slower input, the difference is more evident.

def it():
    for _ in range(10 ** 6):
        yield None

The new timings are:

 32.01 ± 0.01 ms  pairwise_
 54.73 ± 0.03 ms  zip_

The pairwise_ and zip_ code are not equivalent.

>>> def it():
...     return iter(‘abc’)
... 
>>> list(pairwise_())
[(‘a’, ‘b’), (‘b’, ‘c’)]
>>> list(zip_())
[(‘a’, ‘a’), (‘b’, ‘b’), (‘c’, ‘c’)]
1 Like

@Wombat Not sure what your point is. I showed something surprising and asked for the reason. You showed something expected and … I don’t see why. And yes, I know they’re not equivalent. The zip one involves more work and was still faster.

It varies quite a bit across runs for me on Windows 10, but pairwise is consistently faster than zip under the current python.org distribution. Typical:

 17.28 ± 0.03 ms  pairwise_
 18.79 ± 0.04 ms  zip_

Python: 3.13.6 (tags/v3.13.6:4e66535, Aug  6 2025, 14:36:00) [MSC v.1944 64 bit (AMD64)]

Also under PyPy, which is quite a bit slower(!) than CPython on this task:

 33.85 ┬▒ 0.28 ms  pairwise_
 42.54 ┬▒ 0.27 ms  zip_

Python: 3.10.12 (af44d0b8114cb82c40a07bb9ee9c1ca8a1b3688c, Jun 15 2023, 15:42:22)
[PyPy 7.3.12 with MSC v.1929 64 bit (AMD64)]

There’s typically no answer to things like this short of staring at the generated machine code. The compiler and compiler flags in use can have huge effects on that.

2 Likes

“Surprise” means that a theoretical expectation was violated. Your stated theoretical expectation was that reusing an element would be cheaper than a second iterator call.

First, reusing an element is cheap but it isn’t free. It involves reading a structure element, replacing that element, and updating reference counts.

Second, iterator calls can be very cheap as with the case with repeat/None. The call step in pairwise is just new = (*Py_TYPE(it)->tp_iternext)(it) and the response in repeat is Py_NewRef(ro->element) which happens in parallel with if (ro->cnt > 0) ro->cnt--;.

With that knowledge, an updated expectation for your example should be that the costs would be about the same, plus or minus noise due to different builds run on different processors. Running your original code today on the Python 3.14 release candidate shows the zip variant as a bit slower.

 11.05 ± 0.02 ms  pairwise_
 11.81 ± 0.03 ms  zip_

But with another example (the one I provided), the iterator calls are in fact slower than the value reuse logic.

You asked for reason and it is this: You were only surprised because you chose an example that violated the your premise that iterator calls are more expensive than value reuse. With an alternate example, you can see that the expectation was correct in general and that there is no surprise.

The whole thread boils down to, “I am surprised that repeat.__next__ is so cheap that it can rival the speed of accessing a structure member.”

When investigating your question, my surprise and delight is how fast both pieces of code run. After dividing by the 10 ** 6 iterations, the timings are in the 11 ns range, making that iterator chain one of the least expensive things you can do in Python. Think about it. The repeat iterator supplies values, the zip and pairwise iterators retrieve those values, manipulate them, and emit tuples to a deque that consumes them, all in 11 ns. For Python, that is impressive.

2 Likes

Of course.

I know. That’s why I used it.

But it involves similar actions as reusing an element, and more. Especially in zip, which also additional fetches both iterators via a loop and via its tuple of iterators. So I still expect reusing an element and using only one iterator (with more direct access) to be faster.

Why talk about the low costs for pairwise? That’s not going to explain why zip was faster.

The code has changed, so I don’t see the point of this. How would that help explain speeds of the old code?

Very doubtful. Where’s the proof?

In particular, where’s the proof that the iterator calls were not more expensive or even were much less expensive than value reuse, and where is the proof that none of the other differences were reasons?

No. (Again, I see no proof that it did rival it, and it would have to significantly beat it in order to potentially explain why zip was significantly faster.)

11 ns, not μs.

1 Like

I can’t try it on the old system anymore, but I believe (based on my memory as well as measuring again now) that deque(repeat(None, 10**6), 0) took about 3 ns per element. My original benchmark had 23 ns per pair for zip and 33 ns per pair for pairwise. So if Zeke were right and the reason for that 10 ns difference were that reusing an element is slow, it would have to take 3+10=13 ns per element. Over four times slower than getting an element from the iterator. I think that’s unrealistic, and more likely the reason was something else.

Topic has devolved into arguing and nitpicking. It was also resurrected after 2 years. Don’t do any of those things.