Given docs say that recipe in heapq — Heap queue algorithm — Python 3.13.3 documentation is unstable, I would expect that this is also unstable?
Would be good to include this in the docs?
Given docs say that recipe in heapq — Heap queue algorithm — Python 3.13.3 documentation is unstable, I would expect that this is also unstable?
Would be good to include this in the docs?
Reading the source code, it appears to explicitly contain code to make it stable.
The docs don’t say so, but two plausible meanings for “stable” are satisfied by merge()
:
If it1
is an input iterator yielding some duplicate values, merge()
will produce them in the same order as it1
(intra-iterator stability).
If it2
is an input iterator that appears after it1
in the argument list, and produces some value also produced by it1
, than all the duplicate values from it1
will be produced before any from it2
(inter-iterator stability).
Note a subtlety: these are true even if reverse=True
is passed. Few people realize it, but the same is true of passing reverse=True
to list.sort()
or sorted()
: the original order of equal elements is preserved regardless of whether reverse
is in effect. reverse
only affects the order of non-equal elements.
Hmm, this is interesting, thx. Sounds like “good tricky interview question”! But maybe this nuance should be properly texted in documentation?
Have at it, if you like Nobody ever complains about it that I know of, so I’m not going mess with what appears to be working.
I think most people never really thought about it, and semi-consciously assume xs.sort(reverse=True)
is the same as
xs.sort()
xs.reverse()
But it’s not. It is the same as:
xs.reverse()
xs.sort()
xs.reverse()
which is how it’s actually implemented. That’s the only connection between the keyword argument “reverse=” and the method “.reverse()”. They have different purposes.
The docs kind of say this, via “If [reverse=
] set to True, then the list elements are sorted as if each comparison were reversed.”
I’d prefer plainer English, like “by default, elements are ordered from smallest to largest, but if reverse=True
from largest to smallest; regardless, the original order of equal elements is preserved.”
All that’s very intentional for lists. I don’t know whether it was intentional for heapq.merge()
. That stores tuples in the heap, one per input iterator. Each tuple holds the next unresolved value from its iterator, and the iterator’s bound __next__
method.
But that alone could blow up if different iterators had the same next value. Then tuple comparison could go on to try to compare the bound methods, and those don’t implement non-equality comparisons..
So, between those two, the index of the iterator in the original argument list is also stored. That ensures that all the tuples in the heap at any given time are totally ordered by lexicographic tuple comparison, never going on to compare he 3rd components.
“Inter-iterator stability” may just be a happy consequence of what was done just to prevent the implementation from blowing up, but that Raymond didn’t want the implementation to be constrained by it, so didn’t mention it in the docs.
OTOH, PyPy has a more efficient version of heapq.merge()
, written by @sweeneyde, which doesn’t use heaps, and doesn’t need that trick. I recall at the time he was writing it that we agreed “inter-iterator stability” was nice to have, and, as I recall (but may be wrong) that it fell out “for free” by the rules used to promote values in his tournament tree (“in case of a tie, the left child wins”).
I am just thinking whether this is possible with reasonable effort. 2 iterators is easy. Many ones - not so much. Let me know if you can remember how to find it. Looked at PyPy - pypy/lib-python/2.7/heapq.py at main · pypy/pypy · GitHub - seems to be using the same one.
Never mind, that is github of 2.7… Found it in my local installation.
How about a little competition for fastest merger of 2 iterators? I wrote a few a while ago…
How about a little competition for fastest merger of 2 iterators?
I wrote a few a while ago…
Rules?
Test cases?
Buffered ok?
In a tournament tree, when a value is delivered, it just has to go to the leaf holding the iterator it came from, get the next value, and move it up the tree promoting the (possibly new) winner at each level.. log2(n)
operations total. And they’re plain element compares, not costlier tuple compares.
Pushing on to a heap needs at most log2(n)
operations, but popping requires at least log2(n)
,
Trying to represent a tournament tree in an array is a nightmare, though, especially in a context (like merging) where the number of levels can change over time (as iterators become exhausted). So the PyPy code uses classes with tree nodes stitched together with parent and child pointers. PyPy does a good job at optimizing such code, but CPython not so much.
In CPython both implementations run approximately at the same speed.
Compiled it with Cython and then it runs 2x slower.
I think it would be possible to optimize it properly and put in Cython friendly form.
If this ever becomes a bottleneck, I know where to find it.
I’d say two iterators of a million ints (or less if that takes too long for good benchmarking). But solutions are not allowed to assume/use the finite length. Buffered means what? Keeping not just one element in memory but like 100 or 1000? Then yes, because I did that, too.
I’d say two iterators of a million ints
import random
random.seed(?)
M = 10**6
a = iter(sorted(random.randint(0, M) for _ in range(M)))
b = iter(sorted(random.randint(0, M) for _ in range(M)))
Buffered means what? Keeping not just one element in memory but like 100 or 1000? Then yes, because I did that too.
But where is the limit? Can just convert to lists and do in-place merge.
In CPython both implementations run approximately at the same speed …
That has scant meaning without knowing exactly what was tested. The more expensive comparisons are, the more it pays to do fewer compares. Comparing, e.g., floats is about the cheapest it gets in CPython. Invoking user-defined comparisons can be arbitrarily costlier.
It’s not that PyPy classes are faster than CPython array indexing . It’s that a tournament tree does fewer compares than a heap-based merge.
The advantage also grows more apparent the more iterators are being merged. If there are, e.g,. only 2, the heap only has 2 levels. Popping does no compare then, and pushing its replacement only 1.
No, can’t buffer unlimited. A fixed, reasonably small limit. I just checked, I used 1000. (My go-to size for chunked solutions.)
The inputs look alright. Mine used sorted(random.choices(range(n), k=n))
with n = 100_000
.
I haven’t spent much time on this, but figured it would be hard to beat just pasting the two lists together and letting sorted()
have at it. Which doesn’t respect the “buffer size” constraint, so is very likely an unreachable lower bound.
Anyway, the simplest non-brain-dead merge I could think of did like so, under CPython (python.org released 3.13, Windows 10):
sort 0.254
mine 0.531
heap 0.684
and under PyPy:
sort 0.035
mine 0.101
heap 0.191
No buffering of any kind in “mine”. It’s apparently hard not to beat heapq.merge()
. And the slowest under PyPy is quicker than the fastest under CPython. I believe that has a lot to do with that integers of the size @dg-pb picked are, in PyPy lists, “unboxed” (not objects - native machine bits).
from time import perf_counter as now
import random
import heapq
random.seed(31415926535)
M = 10**6
a = sorted(random.randint(0, M) for _ in range(M))
b = sorted(random.randint(0, M) for _ in range(M))
class Inf:
def __lt__(x, y):
return False
def __gt__(x, y):
return True
INF = Inf()
def wrap(xs):
yield from xs
while True:
yield INF
def merge(a, b):
nexta = a.__next__
nextb = b.__next__
aval = nexta()
bval = nextb()
while True:
if bval < aval:
t = bval
bval = nextb()
else:
t = aval
aval = nexta()
if t is INF:
break
yield t
def format_time():
return f"{finish - start:.3f}"
start = now()
exp = sorted(a + b)
finish = now()
print("sort", format_time())
start = now()
res = list(merge(wrap(a), wrap(b)))
finish = now()
print("mine", format_time())
assert res == exp
start = now()
exp = list(heapq.merge(a, b))
finish = now()
print("heap", format_time())
assert res == exp
I forgot a rule: should (must?) be stable, in both ways Tim mentioned.
Ok, here are my contenders.
merge_nested_states(xs, ys)
is always in one of two states: Either it holds the current x and yields smaller ys through, or it holds the current y and yields smaller (or equal) xs through. Whenever one y (or x) isn’t smaller, go to the other state. Since Python doesn’t have a “goto”, I nested the two states so I run into and drop out of the second state.
merge_buffered(xs, ys)
has a buffer of the next 1000 values for each input iterator. Always check which of the two buffers ends with the smaller value, merge it with the appropriate prefix of the other buffer (using list.sort
), yield that merged chunk, and refill the buffers.
Benchmarks later…
def merge_nested_states(xs, ys):
xs = iter(xs)
for x in xs:
break
else:
yield from ys
return
ys = iter(ys)
# Holding x
for y in ys:
if y < x:
yield y
continue
yield x
# Holding y
for x in xs:
if y < x:
yield y
break
yield x
# Ran out of xs
else:
yield y
yield from ys
return
# Ran out of ys
yield x
yield from xs
from itertools import islice, chain
from bisect import bisect_left, bisect_right
def merge_buffered(xs, ys):
xit = iter(xs)
yit = iter(ys)
k = 1000
def chunks():
xs = list(islice(xit, k))
ys = list(islice(yit, k))
while xs and ys:
if ys[-1] < xs[-1]:
i = bisect_right(xs, ys[-1])
ys[:0] = xs[:i]
del xs[:i]
ys.sort()
yield ys
xs += islice(xit, i)
ys[:] = islice(yit, k)
else:
i = bisect_left(ys, xs[-1])
xs += ys[:i]
del ys[:i]
xs.sort()
yield xs
xs[:] = islice(xit, k)
ys += islice(yit, i)
yield xs or ys
yield xit
yield yit
return chain.from_iterable(chunks())
funcs = [
merge_nested_states,
merge_buffered,
]
import random
import operator
# Testing, including stabilities
for merge in funcs * 3:
for k in 10**3, 10**4, 10**5:
xs = [[x] for x in sorted(random.choices(range(10**4), k=k))]
ys = [[y] for y in sorted(random.choices(range(10**4), k=k))]
expect = sorted(xs + ys)
result = list(merge(iter(xs), iter(ys)))
if result != expect:
print(merge.__name__, 'failed miserably', len(expect), len(result))
elif not all(map(operator.is_, result, expect)):
print(merge.__name__, "isn't stable")
print('done')
My 2 versions:
from itertools import batched, islice, chain, repeat
from bisect import bisect_left, bisect_right
def merge2(it1, it2):
it1 = iter(it1)
try:
v1 = next(it1)
except StopIteration:
yield from it2
return
it2 = iter(it2)
try:
v2 = next(it2)
except StopIteration:
if key is None:
yield v1
else:
yield v1[0]
yield from it1
return
while 1:
if v1 <= v2:
yield v1
try:
v1 = next(it1)
except StopIteration:
yield v2
yield from it2
return
else:
yield v2
try:
v2 = next(it2)
except StopIteration:
yield v1
yield from it1
return
def merge2batched(a, b, n=1000):
b = iter(b)
bb = list(islice(b, n))
bbclear = bb.clear
bbextend = bb.extend
bbsort = bb.sort
if not bb:
yield from a
return
a = iter(a)
b1 = bb[-1]
aa = []
aaclear = aa.clear
aaextend = aa.extend
aasort = aa.sort
for _ in map(aaextend, chain(batched(a, n), repeat(()))):
if not aa:
yield from bb
yield from b
return
a1 = aa[-1] # [0, 1, 1, 1]
if a1 <= b1:
i = bisect_right(bb, a1)
if not i:
yield from aa
aaclear()
else:
aaextend(bb[:i])
aasort()
del bb[:i]
bbextend(islice(b, i))
yield from aa
if not bb:
yield from a
return
aaclear()
b1 = bb[-1]
else:
while aa:
i = bisect_right(aa, b1)
if not i:
yield from bb
bb[:] = islice(b, n)
if not bb:
yield from aa
yield from a
return
b1 = bb[-1]
else:
yield from sorted(aa[:i] + bb)
del aa[:i]
bb[:] = islice(b, n)
if not bb:
yield from aa
yield from a
return
b1 = bb[-1]
if b1 > a1:
break
And did benchmark:
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ macOS-11.7.10-x86_64-i386-64bit | CPython: 3.12.4 ┃
┃ 5 repeats, 1 times | 2025-05-28T15:05:49 ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ Units: ms 0 ┃
┃ ┏━━━━━━━━━━━━━━━━━━┫
┃ merge_lb ┃ 207 ┃
┃ merge ┃ 631 ┃
┃ merge_tp ┃ 547 ┃
┃ merge_nested_states ┃ 353 ┃
┃ merge2 ┃ 355 ┃
┃ merge_buffered ┃ 292 ┃
┃ merge2batched ┃ 372 ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━┛
My merge2
is just a straight forward one - nothing else came to mind and I had this one written.
Biggest improvement to my buffered version would be to do chunks
function as @Stefan2 did to avoid yield from
, this would bring its speed to ~320.
Here are timings on the same data, on the same box, for all methods so far.
sort
: paste together and call sorted
.mine
: a small variation of the code I posted, to cut the number of times “is INF
” is tested about in half.heap
: heapq.merge()
nest
: @Stefan2’s merge_nested_states()
buff
: @Stefan2’s merge_buffered()
dg1
: @dg-pb’s merge2()
dg2
: @dg-pb’s merge2batched()
, but supplied my own implementation of itertools.batched
under PyPy (whose itertools doesn’t yet have it).There are 3 timings per line, sorted fastest first. While gc was disabled for the duration of each method tested, there’s still a lot of variance on this box. A few percent is just noise.
Bottom line: buff
is in a class by itself under CPython, but nest
and dg1
rule under PyPy.
CPython:
sort 0.252 0.283 0.283
mine 0.501 0.533 0.536
heap 0.584 0.620 0.620
nest 0.349 0.379 0.379
buff 0.287 0.311 0.320
dg1 0.339 0.387 0.400
dg2 0.359 0.388 0.389
PyPy:
sort 0.025 0.025 0.025
mine 0.081 0.081 0.086
heap 0.156 0.163 0.166
nest 0.066 0.066 0.076
buff 0.131 0.134 0.146
dg1 0.067 0.068 0.075
dg2 0.158 0.158 0.172