Is `heapq.merge` stable?

Given docs say that recipe in heapq — Heap queue algorithm — Python 3.13.3 documentation is unstable, I would expect that this is also unstable?

Would be good to include this in the docs?

1 Like

Reading the source code, it appears to explicitly contain code to make it stable.

2 Likes

The docs don’t say so, but two plausible meanings for “stable” are satisfied by merge():

  • If it1 is an input iterator yielding some duplicate values, merge() will produce them in the same order as it1 (intra-iterator stability).

  • If it2 is an input iterator that appears after it1 in the argument list, and produces some value also produced by it1, than all the duplicate values from it1 will be produced before any from it2 (inter-iterator stability).

Note a subtlety: these are true even if reverse=True is passed. Few people realize it, but the same is true of passing reverse=True to list.sort() or sorted(): the original order of equal elements is preserved regardless of whether reverse is in effect. reverse only affects the order of non-equal elements.

6 Likes

Hmm, this is interesting, thx. Sounds like “good tricky interview question”! But maybe this nuance should be properly texted in documentation?

1 Like

Have at it, if you like :smile: Nobody ever complains about it that I know of, so I’m not going mess with what appears to be working.

I think most people never really thought about it, and semi-consciously assume xs.sort(reverse=True) is the same as

xs.sort()
xs.reverse()

But it’s not. It is the same as:

xs.reverse()
xs.sort()
xs.reverse()

which is how it’s actually implemented. That’s the only connection between the keyword argument “reverse=” and the method “.reverse()”. They have different purposes.

The docs kind of say this, via “If [reverse=] set to True, then the list elements are sorted as if each comparison were reversed.”

I’d prefer plainer English, like “by default, elements are ordered from smallest to largest, but if reverse=True from largest to smallest; regardless, the original order of equal elements is preserved.”

All that’s very intentional for lists. I don’t know whether it was intentional for heapq.merge(). That stores tuples in the heap, one per input iterator. Each tuple holds the next unresolved value from its iterator, and the iterator’s bound __next__ method.

But that alone could blow up if different iterators had the same next value. Then tuple comparison could go on to try to compare the bound methods, and those don’t implement non-equality comparisons..

So, between those two, the index of the iterator in the original argument list is also stored. That ensures that all the tuples in the heap at any given time are totally ordered by lexicographic tuple comparison, never going on to compare he 3rd components.

“Inter-iterator stability” may just be a happy consequence of what was done just to prevent the implementation from blowing up, but that Raymond didn’t want the implementation to be constrained by it, so didn’t mention it in the docs.

OTOH, PyPy has a more efficient version of heapq.merge(), written by @sweeneyde, which doesn’t use heaps, and doesn’t need that trick. I recall at the time he was writing it that we agreed “inter-iterator stability” was nice to have, and, as I recall (but may be wrong) that it fell out “for free” by the rules used to promote values in his tournament tree (“in case of a tie, the left child wins”).

2 Likes

I am just thinking whether this is possible with reasonable effort. 2 iterators is easy. Many ones - not so much. Let me know if you can remember how to find it. Looked at PyPy - pypy/lib-python/2.7/heapq.py at main · pypy/pypy · GitHub - seems to be using the same one.

1 Like

Never mind, that is github of 2.7… Found it in my local installation.

1 Like

How about a little competition for fastest merger of 2 iterators? :slight_smile: I wrote a few a while ago…

1 Like

Rules?
Test cases?
Buffered ok? :slight_smile:

1 Like

In a tournament tree, when a value is delivered, it just has to go to the leaf holding the iterator it came from, get the next value, and move it up the tree promoting the (possibly new) winner at each level.. log2(n) operations total. And they’re plain element compares, not costlier tuple compares.

Pushing on to a heap needs at most log2(n) operations, but popping requires at least log2(n),

Trying to represent a tournament tree in an array is a nightmare, though, especially in a context (like merging) where the number of levels can change over time (as iterators become exhausted). So the PyPy code uses classes with tree nodes stitched together with parent and child pointers. PyPy does a good job at optimizing such code, but CPython not so much.

1 Like

In CPython both implementations run approximately at the same speed.

Compiled it with Cython and then it runs 2x slower.

I think it would be possible to optimize it properly and put in Cython friendly form.
If this ever becomes a bottleneck, I know where to find it. :slight_smile:

I’d say two iterators of a million ints (or less if that takes too long for good benchmarking). But solutions are not allowed to assume/use the finite length. Buffered means what? Keeping not just one element in memory but like 100 or 1000? Then yes, because I did that, too.

import random
random.seed(?)
M = 10**6
a = iter(sorted(random.randint(0, M) for _ in range(M)))
b = iter(sorted(random.randint(0, M) for _ in range(M)))

But where is the limit? Can just convert to lists and do in-place merge.

That has scant meaning without knowing exactly what was tested. The more expensive comparisons are, the more it pays to do fewer compares. Comparing, e.g., floats is about the cheapest it gets in CPython. Invoking user-defined comparisons can be arbitrarily costlier.

It’s not that PyPy classes are faster than CPython array indexing :wink:. It’s that a tournament tree does fewer compares than a heap-based merge.

The advantage also grows more apparent the more iterators are being merged. If there are, e.g,. only 2, the heap only has 2 levels. Popping does no compare then, and pushing its replacement only 1.

No, can’t buffer unlimited. A fixed, reasonably small limit. I just checked, I used 1000. (My go-to size for chunked solutions.)

2 Likes

The inputs look alright. Mine used sorted(random.choices(range(n), k=n)) with n = 100_000.

I haven’t spent much time on this, but figured it would be hard to beat just pasting the two lists together and letting sorted() have at it. Which doesn’t respect the “buffer size” constraint, so is very likely an unreachable lower bound.

Anyway, the simplest non-brain-dead merge I could think of did like so, under CPython (python.org released 3.13, Windows 10):

sort 0.254
mine 0.531
heap 0.684

and under PyPy:

sort 0.035
mine 0.101
heap 0.191

No buffering of any kind in “mine”. It’s apparently hard not to beat heapq.merge(). And the slowest under PyPy is quicker than the fastest under CPython. I believe that has a lot to do with that integers of the size @dg-pb picked are, in PyPy lists, “unboxed” (not objects - native machine bits).

Code
from time import perf_counter as now
import random
import heapq

random.seed(31415926535)
M = 10**6
a = sorted(random.randint(0, M) for _ in range(M))
b = sorted(random.randint(0, M) for _ in range(M))

class Inf:
    def __lt__(x, y):
        return False
    def __gt__(x, y):
        return True

INF = Inf()

def wrap(xs):
    yield from xs
    while True:
        yield INF

def merge(a, b):
    nexta = a.__next__
    nextb = b.__next__

    aval = nexta()
    bval = nextb()
    while True:
        if bval < aval:
            t = bval
            bval = nextb()
        else:
            t = aval
            aval = nexta()
        if t is INF:
            break
        yield t

def format_time():
    return f"{finish - start:.3f}"

start = now()
exp = sorted(a + b)
finish = now()
print("sort", format_time())

start = now()
res = list(merge(wrap(a), wrap(b)))
finish = now()
print("mine", format_time())
assert res == exp

start = now()
exp = list(heapq.merge(a, b))
finish = now()
print("heap", format_time())
assert res == exp
3 Likes

I forgot a rule: should (must?) be stable, in both ways Tim mentioned.

Ok, here are my contenders.

merge_nested_states(xs, ys) is always in one of two states: Either it holds the current x and yields smaller ys through, or it holds the current y and yields smaller (or equal) xs through. Whenever one y (or x) isn’t smaller, go to the other state. Since Python doesn’t have a “goto”, I nested the two states so I run into and drop out of the second state.

merge_buffered(xs, ys) has a buffer of the next 1000 values for each input iterator. Always check which of the two buffers ends with the smaller value, merge it with the appropriate prefix of the other buffer (using list.sort), yield that merged chunk, and refill the buffers.

Benchmarks later…

Code
def merge_nested_states(xs, ys):
    xs = iter(xs)
    for x in xs:
        break
    else:
        yield from ys
        return
    ys = iter(ys)

    # Holding x
    for y in ys:
        if y < x:
            yield y
            continue
        yield x

        # Holding y
        for x in xs:
            if y < x:
                yield y
                break
            yield x

        # Ran out of xs
        else:
            yield y
            yield from ys
            return

    # Ran out of ys
    yield x
    yield from xs


from itertools import islice, chain
from bisect import bisect_left, bisect_right

def merge_buffered(xs, ys):
  xit = iter(xs)
  yit = iter(ys)
  k = 1000
  def chunks():
    xs = list(islice(xit, k))
    ys = list(islice(yit, k))
    while xs and ys:
      if ys[-1] < xs[-1]:
        i = bisect_right(xs, ys[-1])
        ys[:0] = xs[:i]
        del xs[:i]
        ys.sort()
        yield ys
        xs += islice(xit, i)
        ys[:] = islice(yit, k)
      else:
        i = bisect_left(ys, xs[-1])
        xs += ys[:i]
        del ys[:i]
        xs.sort()
        yield xs
        xs[:] = islice(xit, k)
        ys += islice(yit, i)
    yield xs or ys
    yield xit
    yield yit
  return chain.from_iterable(chunks())


funcs = [
    merge_nested_states,
    merge_buffered,
]

import random
import operator

# Testing, including stabilities
for merge in funcs * 3:
    for k in 10**3, 10**4, 10**5:
        xs = [[x] for x in sorted(random.choices(range(10**4), k=k))]
        ys = [[y] for y in sorted(random.choices(range(10**4), k=k))]
        expect = sorted(xs + ys)
        result = list(merge(iter(xs), iter(ys)))
        if result != expect:
            print(merge.__name__, 'failed miserably', len(expect), len(result))
        elif not all(map(operator.is_, result, expect)):
            print(merge.__name__, "isn't stable")

print('done')

Attempt This Online!

2 Likes

My 2 versions:

Code
from itertools import batched, islice, chain, repeat
from bisect import bisect_left, bisect_right

def merge2(it1, it2):
    it1 = iter(it1)
    try:
        v1 = next(it1)
    except StopIteration:
        yield from it2
        return
    it2 = iter(it2)
    try:
        v2 = next(it2)
    except StopIteration:
        if key is None:
            yield v1
        else:
            yield v1[0]
        yield from it1
        return
    while 1:
        if v1 <= v2:
            yield v1
            try:
                v1 = next(it1)
            except StopIteration:
                yield v2
                yield from it2
                return
        else:
            yield v2
            try:
                v2 = next(it2)
            except StopIteration:
                yield v1
                yield from it1
                return


def merge2batched(a, b, n=1000):
    b = iter(b)
    bb = list(islice(b, n))
    bbclear = bb.clear
    bbextend = bb.extend
    bbsort = bb.sort
    if not bb:
        yield from a
        return
    a = iter(a)
    b1 = bb[-1]
    aa = []
    aaclear = aa.clear
    aaextend = aa.extend
    aasort = aa.sort
    for _ in map(aaextend, chain(batched(a, n), repeat(()))):
        if not aa:
            yield from bb
            yield from b
            return
        a1 = aa[-1]     # [0, 1, 1, 1]
        if a1 <= b1:
            i = bisect_right(bb, a1)
            if not i:
                yield from aa
                aaclear()
            else:
                aaextend(bb[:i])
                aasort()
                del bb[:i]
                bbextend(islice(b, i))
                yield from aa
                if not bb:
                    yield from a
                    return
                aaclear()
                b1 = bb[-1]
        else:
            while aa:
                i = bisect_right(aa, b1)
                if not i:
                    yield from bb
                    bb[:] = islice(b, n)
                    if not bb:
                        yield from aa
                        yield from a
                        return
                    b1 = bb[-1]
                else:
                    yield from sorted(aa[:i] + bb)
                    del aa[:i]
                    bb[:] = islice(b, n)
                    if not bb:
                        yield from aa
                        yield from a
                        return
                    b1 = bb[-1]
                if b1 > a1:
                    break

And did benchmark:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃  macOS-11.7.10-x86_64-i386-64bit | CPython: 3.12.4   ┃
┃      5 repeats, 1 times | 2025-05-28T15:05:49        ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃                         Units: ms                  0 ┃
┃                                   ┏━━━━━━━━━━━━━━━━━━┫
┃                          merge_lb ┃              207 ┃
┃                             merge ┃              631 ┃
┃                          merge_tp ┃              547 ┃
┃               merge_nested_states ┃              353 ┃
┃                            merge2 ┃              355 ┃
┃                    merge_buffered ┃              292 ┃
┃                     merge2batched ┃              372 ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━┛

My merge2 is just a straight forward one - nothing else came to mind and I had this one written.

Biggest improvement to my buffered version would be to do chunks function as @Stefan2 did to avoid yield from, this would bring its speed to ~320.

2 Likes

Here are timings on the same data, on the same box, for all methods so far.

  • sort: paste together and call sorted.
  • mine: a small variation of the code I posted, to cut the number of times “is INF” is tested about in half.
  • heap: heapq.merge()
  • nest: @Stefan2’s merge_nested_states()
  • buff: @Stefan2’s merge_buffered()
  • dg1: @dg-pb’s merge2()
  • dg2: @dg-pb’s merge2batched(), but supplied my own implementation of itertools.batched under PyPy (whose itertools doesn’t yet have it).

There are 3 timings per line, sorted fastest first. While gc was disabled for the duration of each method tested, there’s still a lot of variance on this box. A few percent is just noise.

Bottom line: buff is in a class by itself under CPython, but nest and dg1 rule under PyPy.

CPython:

sort 0.252 0.283 0.283
mine 0.501 0.533 0.536
heap 0.584 0.620 0.620
nest 0.349 0.379 0.379
buff 0.287 0.311 0.320
dg1  0.339 0.387 0.400
dg2  0.359 0.388 0.389

PyPy:

sort 0.025 0.025 0.025
mine 0.081 0.081 0.086
heap 0.156 0.163 0.166
nest 0.066 0.066 0.076
buff 0.131 0.134 0.146
dg1  0.067 0.068 0.075
dg2  0.158 0.158 0.172
2 Likes