Create new package similar to `heapq` but be able to pass custom comparator through a constructor

Ibrahim-Haroon · November 23, 2024, 7:56pm

The current heap container is okay in Python. It works as intended but you cannot pass a custom comparator and working with it feels “C like”, since you need to pass your list object each time. I’d be willing to make a new package so that we can have a similar heap/priority queue to java and c++.

I know you can override the less-than-operator, but that just doesn’t seem right and is not intuitive.

jamestwebber · November 23, 2024, 8:09pm

You are totally free to make such a package and put it on PyPI. I wouldn’t expect it to be added to the stdlib though.

Agent-Hellboy · November 24, 2024, 9:23am

I’ve solved many heap problems on LeetCode and adapted to Python’s heapq, but I can see where you’re coming from. The lack of a custom comparator can make heapq less intuitive and harder to use for certain use cases. Here’s an example to illustrate the issue:

Example: Custom Comparator Workaround in `heapq`

Let’s say you want to use a heap to process tasks based on their priority, where each task is represented as a tuple (priority, task_name). If you want a max-heap based on priority, you’d typically need to negate the priority:

import heapq

tasks = [(1, "task1"), (3, "task3"), (2, "task2")]
max_heap = []
for priority, task in tasks:
    heapq.heappush(max_heap, (-priority, task))

# Get the task with the highest priority
highest_priority_task = heapq.heappop(max_heap)
print(highest_priority_task)  # Outputs (-3, 'task3')

This approach works, but:

High cognitive load
Leaking information about the comparator, as the user has to manage it manually, like making a max-heap by negating values. This is also against encapsulation.
It’s very inflexible. Why?
If you need a different custom order (e.g., sort by task length or a combination of factors), you’re forced to restructure the data or override comparison operators, neither of which feels clean.

I don’t know why the first commenter simply discarded the design (I wouldn’t expect it to be added to the stdlib though.). At least counter it. The heapq library is against the design principle of high-level, expressive design.
I have solved a lot of Dijkstra algorithm questions using heapq, and I needed to adapt to the design. It is not at all intuitive.

ericvsmith · November 24, 2024, 11:50am

In my opinion, at this point in Python’s lifecycle it’s unlikely we’d add a data structure to the stdlib unless we had an internal use case for it. And even then we might not expose it (see HAMT).

Plus, there are plenty of heap implementations on PyPI.

Agent-Hellboy · November 24, 2024, 1:22pm

Thanks, that makes sense. However, what about the option to patch in an optional comparator within the design?

I understand there are plenty of libraries available online, but my concern lies in pair programming interviews where installing third-party libraries isn’t an option. When you try to adapt heapq to solve a Dijkstra’s algorithm problem, it often feels unintuitive to the interviewer. This is especially true when interviewers, who usually work in compiled languages, are accustomed to dedicated and intuitive priority queue implementations.

I respect your perspective and decisions, but this addition could be a significant help for students.

chepner · November 25, 2024, 2:16pm

Why does it not seem right? Using Prince Roshan’s example to demonstrate, the problem is trying to reuse existing data types instead of defining one that models what you want. Don’t try to use tuples that don’t compare like you want; do use a custom Task class that does compare properly.

from dataclasses import dataclass
import heapq

tasks = [(1, "task1"), (3, "task3"), (2, "task2")]

@dataclass
class Task:
    priority: int
    name: str

    def __lt__(self, other):
        return self.priority > other.priority

max_heap = []
for t in tasks:
    heapq.heappush(max_heap, Task(*t))

blhsing · November 26, 2024, 3:20am

But the OP isn’t proposing adding a new data structure to the stdlib. What’s proposed here is adding a custom comparator option, likely in form of a keyword argument key like that in sorted and heapq.merge, to heapq.heapify, heapq.heappush and heapq.heappop.

And we already have an internal use for it in the implementation of asyncio.queues.PriorityQueue, which currently has the limitation of data needing to be sortable due the lack of a key argument (so we can’t use a dict as data for example):

github.com

python/cpython/blob/4fd9eb2aca21489ea0841155cdb81cb30dda73d3/Lib/asyncio/queues.py#L291


      
          
          class PriorityQueue(Queue):
              """A subclass of Queue; retrieves entries in priority order (lowest first).
          
              Entries are typically tuples of the form: (priority number, data).
              """
          
              def _init(self, maxsize):
                  self._queue = []
          
              def _put(self, item, heappush=heapq.heappush):
                  heappush(self._queue, item)
          
              def _get(self, heappop=heapq.heappop):
                  return heappop(self._queue)
          
          
          class LifoQueue(Queue):
              """A subclass of Queue that retrieves most recently added entries first."""
          
              def _init(self, maxsize):

The documentation says the priority queue “retrieves entries in priority order” when in fact it retrieves entries in the order of both priority and data, which to me is an issue worth fixing.

dg-pb · November 26, 2024, 5:04am

It would be in line with API of bisect.

blhsing · November 26, 2024, 6:29am

Yes, heapq and bisect are very alike in that they are both collections of functions that operate on list and rely on the order of list items. So if bisect functions all get to have a key argument, it would bring the heapq API consistent with the bisect API for all heapq functions to also get a key argument.

Alex-Wasowicz · November 26, 2024, 12:02pm

It is wierd, that only heapq.marge has key and reverse arguments. Finding the reason why, I did find mail proposing this addition, but the discussion is only about the details. Why would someone marge heaps with std. lib, but heapify on their own?

ericvsmith · November 26, 2024, 1:09pm

Well, the subject is about adding a “new package”, so I assumed a new data structure.

Stefan2 · November 26, 2024, 1:14pm

Not just “new package” but “pass custom comparator through a constructor” and “heap container”. Sounds like a new data structure to me. Constructed like heap = Heap(key=...) (or I guess Heap(cmp=...)).

Stefan2 · November 26, 2024, 1:22pm

Those aren’t constructors. And adding that option to them wouldn’t be a new package.

Stefan2 · November 26, 2024, 1:37pm

It’s not for merging heaps. It’s for merging sorted iterables.

Alex-Wasowicz · November 26, 2024, 2:10pm

Ok, that makes more sense

blhsing · November 27, 2024, 6:16am

Ah, I had my blinders on and missed the very suggestion in the OP’s proposal about creating a new package.

Still, despite the XY problem, I think we can agree that the actual problem the OP is trying to solve is clear–that the existing heapq functions in the stdlib are missing an option to customize how the items are compared. A key argument for the heapq functions, rather than a new package/module/class, should be all it takes to solve the problem.

Agent-Hellboy · November 27, 2024, 8:18am

Hi @ericvsmith,

Suppose you’re solving a Dijkstra’s algorithm problem and need to use a priority queue. Currently, in Python, you might implement it like this:

import heapq

reference_point = (0, 0)

def squared_distance(point):
    x0, y0 = reference_point
    return (point[0] - x0) ** 2 + (point[1] - y0) ** 2


heap = []

points = [(1, 2), (3, 4), (-1, -1), (0, 0), (2, -2)]
for point in points:
    distance = squared_distance(point)
    heapq.heappush(heap, (distance, point))

Suppose, I’ve made changes to the Python implementation of heapq module to allow using a custom comparator function. Now, you can write:

import heapq

reference_point = (0, 0)

def point_comparator(p1, p2):
    x0, y0 = reference_point
    d1 = (p1[0] - x0) ** 2 + (p1[1] - y0) ** 2
    d2 = (p2[0] - x0) ** 2 + (p2[1] - y0) ** 2
    return (d1 > d2) - (d1 < d2)

heap = []
points = [(1, 2), (3, 4), (-1, -1), (0, 0), (2, -2)]
for point in points:
    heapq.heappush(heap, point, comparator=point_comparator)

The latter is very intuitive, and most language containers implementing priority_queue provide a way to pass a comparator function.

I understand that there are many open-source packages offering various implementations in different styles—functional, object-oriented, or other paradigms—but in coding competitions and interview settings, people are often not allowed to use these packages.

yes @Alex-Wasowicz ,I also didn’t find any request. I think we can agree that most competitive programmers use C++, and those who use Python have adapted their approaches accordingly, often starting from high school.

In response to @chepner , yes, the first thing that came to mind was to create an object and define the lt and other comparison methods. However, why should we use an object that takes extra memory when we can use a tuple?

Should I create a ticket for this? I was planning to post a new question about adding an optional comparator to heapq.

Agent-Hellboy · November 27, 2024, 8:24am

Exactly, I took the real problem he is trying to solve, ignored his suggestion of adding a new module, and proposed a solution. What do you think—should I create a ticket and work on it?
A lot of people have supported it.

I read somewhere that at least one core developer needs to support the idea, so I’m waiting for Eric V. Smith to confirm.

dg-pb · November 27, 2024, 8:32am

Unfortunately, Dijkstra’s algorithm will not benefit from this much, and it will even be a bit slower.

b = (1, 'a')
a = (1, 'b')
%timeit a < b    # 55.6 ns
# vs
IG0 = opr.itemgetter(0)
%timeit IG0(a) < IG0(b)    # 58.6 ns

Proposed functionality is useful for more complex situations nevertheless.

P.S. Standard argument name is key - “key function”.

dg-pb · November 27, 2024, 8:39am

On the other hand, if implemented correctly, it should be a bit faster as it would only call it once for new item, while use it many times.

Extra mile for maximum performance would be to hold parallel structure of extracted values so that key function is called only once for each item.

Would you be willing to show your implementation? I am quite curious.

Create new package similar to `heapq` but be able to pass custom comparator through a constructor

Example: Custom Comparator Workaround in heapq

Example: Custom Comparator Workaround in `heapq`