Provide optional key= and reverse= parameters on the heapq functions

crj · July 16, 2024, 10:24pm

The sorted() built-in and list.sort() both accept an optional key= parameter to specify the key for the sort and reverse= to reverse the result. But the functions in the heapq module do not. This feels like a curious omission.

Yes, I’ve seen this previous idea. But it feels to me like the scale of the problem isn’t properly appreciated.

As a case in point, look at the second example in the documentation. Even in this most trivial case, populating the heap with (key,value) tuples only works if either:

Keys are guaranteed to be distinct, or
Values are comparable

Otherwise, sooner or later you’re going to get a comparison error.

Currently, the only truly effective workaround I can see for this limitation is to incur the CPU and memory overhead of populating the heap with wrapper objects of a custom type which implements the required ordering. Well, aside from doing without the heapq module entirely, of course,and creating one’s own implementation.

Stefan2 · July 21, 2024, 10:56am

Are you saying the triple solution isn’t truly effective?

crj · July 26, 2024, 5:29pm

Sorry, which “triple” solution? I’ve not seen one articulated anywhere?

I’m inferring that you’re suggesting putting (key,nonce,item) triples in the heap, where the nonce exists solely to prevent tuples being compared by item? That remains problematic for a few reasons:

It’s cumbersome and ugly
Currently, to achieve a reversed order it is necessary that keys be negatable, though I see that exposed max-heap functionality is coming in 3.13 .
It relies on heap items never being compared with themselves. While in practice (and given the nature of the algorithm) I don’t think they ever can be, I don’t believe there’s a formal guarantee.
The choice of nonce is problematic. id(item) seems a natural option, except that this won’t work if the same item is inserted multiple times into the heap, which is sometimes a defensible action rather than a logic error.

Stefan2 · July 26, 2024, 8:41pm

The one discussed right under the example that you talked about.

crj · July 26, 2024, 10:20pm

Ooh, sorry - I see it now. It looks like I correctly intuited it, aside from using an indefinitely-incrementing counter as an alternative to id(item).

I suppose that being an officially recommended solution tacitly guarantees that nothing in heapq ever compares an element with itself. And I accept that a counter is an adequate solution to the choice of nonce.

The key objection though - that the solution is cumbersome and ugly - most definitely stands. It seems that something like this has to be done in almost every conceivable use case for heapq, which strongly suggests it should be supported natively?

Stefan2 · July 26, 2024, 11:01pm

Yeah I think so, too, and it even says “since no two entry counts are the same, the tuple comparison will never attempt to directly compare two tasks”.

About cumbersome/ugly: Would the key-parameter usage really be less cumbersome/ugly? Would be good to see a comparison of code for both ways.

crj · July 27, 2024, 3:32am

Hmm. I spiked some proof of concept code and rapidly realised something: what I really feel is missing is a factory for priority queue types.

Either or both of these spellings seem sane:

my_qtype = heapq.heaptype('my_qtype', key=attrgetter('start'))
class MyQType(heapq.HeapQ, key=attrgetter('start')):
    ... # Other stuff can go here

My ideal parameterisation for the type is:

key=None - Function from item to keying value
reverse=False - min-heap if False, max-heap if True
raw_factory=list - How the constructor should make the underlying container if none is specified

The resulting type would have approximately these methods:

    def __init__(self, raw=None, heapify=True):
        """Create a heap-queue.

        If raw is specified, it is used (by sharing, not by copy) as the
        underlying container. Otherwise, raw_factory() is called to
        create it. If heapify is True (the default) then any provided
        container is heapified.
        """
        ...
    @property
    def raw(self):
        "The underlying container"
        ...
    @property
    def head(self):
        return self.raw[0]
    def push(self, item):
        ...
    def pop(self):
        ...
    ...

Now, the interesting thing is that I can already implement this to my satisfaction; I probably will, tomorrow! However, as things stand, the implementation would have to wrap every item, which feels horribly inefficient. It still feels preferable for heapq to provide native support, either by providing an interface like the one above, or by providing the necessary underpinnings.

crj · July 27, 2024, 5:50am

Addendum: I’ve just noticed that, unlike the python implementation, the C implementation of heapq requires a heap to be an actual list, not merely “list-like”. So my dream of a raw_factory parameter should be disregarded.

Stefan2 · July 27, 2024, 10:21am

Yeah, I suspected you wouldn’t actually like to have to pass the key in every call of every heap function :-). A type factory seems overkill, SortedList for example just takes a key at the list construction.

A previous thread:

github.com/python/cpython

Adapt heapq push/pop/replace to allow passing a comparator.

opened 08:18AM - 13 Mar 14 UTC

closed 08:06AM - 14 Mar 14 UTC

10cb9bc7-4c70-4917-9526-b6bbd8cb3d4c

type-feature stdlib

BPO | [20905](https://bugs.python.org/issue20905) --- | :--- Nosy | @rhettinger,… @mdickinson, @gareth-rees Files | <li>[heapq_com.patch](https://bugs.python.org/file34388/heapq_com.patch "Uploaded as text/plain at 2014-03-13.08:18:15 by Prashant.Sharma")</li> <sup>*Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.*</sup> <details><summary>Show more details</summary><p> GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at = <Date 2014-03-14.08:06:43.539> created_at = <Date 2014-03-13.08:18:17.533> labels = ['type-feature', 'library'] title = 'Adapt heapq push/pop/replace to allow passing a comparator.' updated_at = <Date 2019-12-23.14:47:09.425> user = 'https://bugs.python.org/PrashantSharma' ``` bugs.python.org fields: ```python activity = <Date 2019-12-23.14:47:09.425> actor = 'mark.dickinson' assignee = 'rhettinger' closed = True closed_date = <Date 2014-03-14.08:06:43.539> closer = 'rhettinger' components = ['Library (Lib)'] creation = <Date 2014-03-13.08:18:17.533> creator = 'Prashant.Sharma' dependencies = [] files = ['34388'] hgrepos = [] issue_num = 20905 keywords = ['patch'] message_count = 10.0 messages = ['213361', '213363', '213365', '213373', '213437', '213496', '213504', '213505', '358804', '358817'] nosy_count = 6.0 nosy_names = ['rhettinger', 'mark.dickinson', 'BreamoreBoy', 'gdr@garethrees.org', 'mrolle', 'Prashant.Sharma'] pr_nums = [] priority = 'low' resolution = 'rejected' stage = None status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue20905' versions = ['Python 3.5'] ``` </p></details>

crj · July 29, 2024, 7:34pm

Well, I’ve implemented my own heapq.py which exposes the undocumented _heappop_max and _heapify_max, implements heappush_max in terms of _siftdown_max and then implements MinQ, MaxQ, KeyedMinQ and KeyedMaxQ in terms of them.

It works. And it’s about as efficient as an implementation in Python is ever going to be.

But I’m very mindful that:

Having an instance that wraps a list rather than a C object with an associated C array
Having a list of KeyedItem objects, rather than storing key and item directly inside a C array

…both cause significant unnecessary space to be used, and all the indirection in the Python code takes time.

Maybe I’m the only person who cares, but this feels like a sufficiently fundamental structure that it’s worth implementing efficiently.

(Yes, I’m aware I’m talking myself into a job, here…)

crj · July 29, 2024, 8:21pm

(Sorry, I’m aware I’m dangerously close to thinking out loud here, but…)

Having now read _heapqmodule.c I think I may have identified a useful half-way house.

Firstly, take it as read that max-heap versions of the various functions are arriving soon.

Second, extend the existing functions and their max-list variants:

heapq.heappush(heap, item, keys=None, key=None)
heapq.heappop(heap, keys=None)
heapq.heappushpop(heap, item, keys=None, key=None)
heapq.heapify(x, keys=None)
heapq.heapreplace(heap, item, keys=None, key=None)

If keys is specified, it shall be a second list the same size as heap, containing the comparison keys. For the push functions, if keys is specified, key is the key for the new item.

It feels to me like that strikes a good balance between implementability, efficiency and elegance.

(If it was felt to be more intuitive, an alternative would be to add values and value parameters and always sift by comparing elements of heap.)