Searching and removing events from heapq (heap queue) based on marker/identifier

EwoutH · February 13, 2024, 3:16pm

Background

The Python heapq module provides an efficient implementation of the heap queue algorithm, also known as the priority queue algorithm. It allows for the addition of elements in a way that the smallest (or largest, with a minor modification) element can be accessed and removed efficiently. However, a common use case that arises in various domains involves not just prioritizing elements, but also efficiently searching for and removing elements based on specific identifiers or markers, without disturbing the heap’s properties.

Problem

Currently, to remove a specific element from a heap, one might need to iterate over the entire heap to find the element, remove it, and then re-heapify the list. This process is inefficient, particularly for large datasets, as it operates in O(n) time for searching and an additional O(n log n) time for re-heapifying.

Potential solution

A potential enhancement to the heapq module could involve native support for markers or keys associated with heap elements. This feature would allow users to efficiently search for and remove elements based on these markers, significantly reducing the complexity of such operations. One possible implementation approach could involve maintaining an additional internal mapping of markers to heap indices, enabling constant-time lookups and potentially more efficient removals.

Drawing an analogy to the dict type in Python, which offers constant-time performance for lookups, insertions, and deletions by key, introducing a similar capability in heapq could dramatically improve its utility and efficiency. By leveraging an internal mapping or a complementary structure that associates keys or markers with heap elements, users could enjoy the dual benefits of a priority queue and hash table-like performance for specific operations.

An example case

In agent-based modelling, agents often plan and perform tasks that take a certain amount of time. However, an agent can be interrupted by either an other agent or some system-wide event. In that case, we want to efficiently remove (an subset of) events from the heap queue, and potentially schedule new ones. This requires efficient search and removal though the heap queue. With a large amount of agents this scales dramatically, even if agents only have one next event scheduled, the complexity scales O(n^2), since N agents might have to search though a queue of N events.

rhettinger · February 13, 2024, 4:19pm

Unless removals are common relative to other operations, the overhead of index tracking would swamp any benefits. For existing applications where no removals occur, it would be all cost and no benefit.

Consider trying other data structures better suited to the task. Grant Jenk’s SortedContainers or one of the many binary tree implementations would be a good place to start.

chepner · February 13, 2024, 4:22pm

The documentation is somewhat misleading. Heaps and priority queues are two different abstract data structures. You can use a heap as part of the implementation of a priority queue, as discussed in the heapq documentation, but heapq itself really only implements what is needed for a heap.

Typically, a priority queue uses a heap and some sort of lookup (hash table, tree, etc) to act as index into the elements of the heap. Operations on the priority queue are responsible for updating both the heap and the index as necessary. As such, an efficient priority queue implementation would, in my opinion, belong in a separate module, not added to heapq. Whether that module belongs in the standard library or not is a matter for further discussion.

Stefan2 · February 13, 2024, 4:26pm

heapq.heapify is O(n). Are you talking about something else?

slaymaker1907 · February 13, 2024, 6:16pm

Maybe this is what you’re describing, but I think what you really want is something like a box in the heap which you can reach into and set its contents to None or otherwise invalidate. That way you can have a separate dict or something from a primary key to the thing in the priority queue. Then, once the queue looks at that data again, the heap could somehow see that it is invalid and automatically remove it.

Without any modifications to heapq, you can do most of this already, though it is probably less efficient than if heapq supported some sort of validator. I say probably because having an is_valid function might be more expensive than whatever heapq does currently.

class Box:
	def __init__(self, item=None):
		self.item = item
index = dict()
pqueue = []
def remove_from_pqueue(key):
	item_box = index.get(key)
	if item_box is None:
		return
	item_box.item = None
	del index[key]
def add_to_pqueue(key, item, priority):
	remove_from_pqueue(key)
	item_box = Box(item)
	index[key] = item_box
	heappush(pqueue, (priority, item_box))
def pop_pqueue():
	while True:
		item_box = heappop(pqueue)[1]
		if item_box.item is not None:
			return item_box.item

The worst-case complexity of this solution is O(n log n) assuming you have n insertions, though constant factors are a bit worse than they could be for deletions and it obviously doesn’t completely free the memory of removed elements until they are removed via heappop. It’s still much better though than the quadratic solution of reconstructing the heap.

Feel free to use this code, I think it is short enough to not be copyrightable.