Create new package similar to `heapq` but be able to pass custom comparator through a constructor

Agent-Hellboy · November 27, 2024, 8:42am

Actually, I took it as an example, and I have no idea if latency (I guess it’s bytecode latency, maybe the extra bytecode it needs to deal with) matters here or not. Anyway, competitive programmers are going to use C++. I am more concerned about the interview setting. Also, in general, it should take an optional argument named key just to complement other Python APIs, like sorted and all.

One more thing—if this matters here, then some built-in APIs that accept keys must also be slow.

Sure thing.

Agent-Hellboy · November 27, 2024, 8:48am

sure thing , I will push it in my fork and let you review. But i made changes in .py

Sure, I’ll give it some thought. By the way, I checked the C implementation, and they using various caching approach. I’ll try to adopt their approach.

chepner · November 27, 2024, 11:58am

The memory used by a Heap object will be negligible, considering how few such objects you’ll create and relative to the memory used by the contents of the heap itself.

Agent-Hellboy · November 27, 2024, 12:07pm

Understood, but why should we force people to bear the extra cognitive load of maintaining an additional object with all those dunder methods for something like coordinates?
why shouldn’t we introduce an optional parameter?
I am not suggesting adding an extra data structure; I’m just proposing to patch the existing one with an optional comparator, which I believe all priority queue implementations should have.

chepner · November 27, 2024, 12:32pm

What cognitive load? Nobody would need to know or care about dunder methods.

class Heap:
    def __init__(comparison, data=[]):
        self.comparison = comparison
        # use data, if present

    def pop(self):
        ...
        # use self.comparison to reheapify as necessary

    def push(self, v):
        # use self.comparison to reheapify after insertion as necessary

h = Heap(some_comparator, [...])
v = h.pop()
h.push(h, v1)
h.push(h, v2)

An optional comparator that you are requiring the user to provide on each and every heap operation. A design where you can override what should be an invariant for all heap operations on a per-operation basis seems objectively bad.

h = [...]
heapq.heapify(h, some_comparator)
v = heapq.heappop(h, some_comparator)
heapq.heappush(h, v1, some_comparator)
heapq.heappush(h, v2)  # oops, bug, you forgot to override the default comparison

Unless you are suggesting that the heap be a tuple consisting of a list and a comparator, rather than just the list.

h = ([...], some_comparator)
heapq.heapify(h)
v = heapq.heappop(h)
heapq.heappush(h, v1)
heapq.heappush(h, v2)

This loses some of the documented benefits of using a plain list in the non-OOP interface:

These two make it possible to view the heap as a regular Python list without surprises: heap[0] is the smallest item, and heap.sort() maintains the heap invariant!

But one can at least debate the pros and cons of this and a class-based design.

mikeshardmind · November 27, 2024, 1:32pm

The first of your implementations actually reads more naturally, but using a key function here seems worse to me for reasons other than legibility too. Even if we make it so that the heap comparison function is only allowed in constructing the heap, and stored (avoiding issues with the heap invariant mentioned above) This would be recalculating and comparing your heuristic for multiple elements of the heap on each heap modification, rather than only comparing against the already calculated result as needed.

you can also write it as:

import heapq
import math
heap = []
origin = (0, 0)

points = [(1, 2), (3, 4), (-1, -1), (0, 0), (2, -2)]
for point in points:
    distance = math.hypot(origin, point)
    heapq.heappush(heap, (distance, point))

Agent-Hellboy · November 27, 2024, 1:34pm

Clint Hepner:

What cognitive load? Nobody would need to know or care about dunder methods.

class Heap:
    def __init__(comparison, data=[]):
        self.comparison = comparison
        # use data, if present

    def pop(self):
        ...
        # use self.comparison to reheapify as necessary

    def push(self, v):
        # use self.comparison to reheapify after insertion as necessary

h = Heap(some_comparator, [...])
v = h.pop()
h.push(h, v1)
h.push(h, v2)

ohh, I thought you meant to create an object as your earlier example was different where you were passing objects to heapq APIs

@dataclass
class Task:
    priority: int
    name: str

    def __lt__(self, other):
        return self.priority > other.priority

max_heap = []
for t in tasks:
    heapq.heappush(max_heap, Task(*t))

Agent-Hellboy · November 27, 2024, 1:36pm

Yes, But shouldn’t we leave it to the user as it’s an optional argument if the user is using it, they are aware of heap invariant.

Agent-Hellboy · November 27, 2024, 1:42pm

Thanks, this could be concerning. I will look into it more closely, but I disagree with your argument about not following the heap variant being a case to worry about. Since it’s an optional argument, people should use it with caution and understand what it will do to their code.

It’s not like a built-in language construct where you need to worry about invalidating the flow of code. It’s an optional feature. I am, however, very inexperienced to argue further here.

MegaIng · November 27, 2024, 2:15pm

In python 2, e.g. sorted used to take a compare function. This was changed as part of python 3 to instead take a key function for better performance and ease of use.

I think we should be consistent with this and only provide a key function if anything is added here. If you can really only express a problem as a compare function (which is not the case, almost via definition, for dijkstra) then you can create a small fake key object with an overwritten __lt__.

chepner · November 27, 2024, 2:59pm

Defining Task is just what I would do for the existing API. For a new API, it’s important that the comparator be tied to the heap itself, not individual heap operations.

No, we shouldn’t. The comparator to use is invariant for all operations on a given heap. There’s no reason to use different comparators for different operations, so it shouldn’t even be possible.

dg-pb · November 27, 2024, 9:48pm

I think much simpler solution would be to add utility class:

class KeyCompared:
    def __init__(self, key, value):
        ...
    def __le__(self, other):
        return self.key < other.key

Pure Python implementation of this is a bit slow, but other than that it covers all that is needed.

Why other solutions are less optimal:
a) Specialised class Heap is a fair amount of work with fairly limited benefit
b) key argument is very inefficient as there will be many more key.__call__ than theoretically needed.

Personally, I use tuples whenever possible and use Python implementation of the above when it is not. If above was implemented in C, I would just use it for all cases.

dg-pb · November 27, 2024, 10:30pm

Just a bit more on challenges implementing Heap / SortedList and similar objects properly.

Have a look at implementation of sortedcontainers.SortedList.

Some complexities would not apply to Heap equivalent (SortedList is built on bisect), however some would.

E.g. If there is implementation of such in standard library it should be properly performant and for that it would need to keep 2 lists - one with keys, another with values.

So my POV is that sensible order to improve things would be:

1-2. `KeyCompared` utility class
1-2. `key` argument for `heapq` module (for consistency with `bisect`)
3. `bisect.SortedList` class.

If there was a class for priority Q, I think it would make much more sense to have SortedList as it is much more functional.

There is a set of cases where heapq is enough and it will always be - e.g. Dijkstra’s algorithm.

However, there are many cases when there is always a risk that there will arise a need to know not only smallest item, but N smallest items or even be able to look up item in i’th place. In other words, to have continuously sorted list.

Although lst.sort() maintains heap invariance, but calling lst.sort() after each heapq.pushitem would neither be preferable from performance nor practical perspectives.

Although there is an argument to just use sortedcontainers.SortedList, but having such in standard library would be beneficial as it would both improve performance and also something implemented in standard library would bring it to the next level of simplicity, robustness, etc.

heapq.Heap class is just not worth it, IMO.