Add `OrderedSet` to stdlib

jagerber · April 4, 2023, 3:28am

There are two facts that led to me personally getting interested in this proposal. (1) I needed an object like a List but that raises an error when I try to append something to the list that is already in that list. And I need it in a few different packages that I’m working on so I don’t just want to spin up a quick helper class every time. (2) I was working in an environment where I did not have access to PyPI so it was very inconvenient to use any of the many (suitable) options on PyPI. So fully disclosure that my motivation is pretty highly “convenience” motivated (as opposed to, e.g. performance).

For now my use case is perfectly suited by (other than the introduction of an external dependency), for example:

setlist in collections-extended (setlists — collections_extended 2.0.2 documentation)
OrderedSet in sortedcollections (Ordered Set Recipe — Sorted Collections 2.1.0 documentation)
And many other third party packages.

See the source code in these examples for what might be desired in stdlib.

To adress comments/questions 1-5:

What new problems can be solved that can’t currently be solved by just using dict.fromkeys? The latter is what I use in the rare cases where the need arises. It is not a fully featured set but gives fast membership testing and ordered iteration which is all most people would want.

oset = dict.fromkeys([key_0, key_1, key_2]) needs to know all keys during instantiation, or if you want to add elements to it afterwards you need to do the awkward oset[key_3] = None (or use a wrapper which does this which is pretty much exactly what I’d like in stdlib). Also you can’t access elements of a dictionary via an index. That is I might want oset[0] to return key_0.

What should be returned by {1, 2, 3, ..., 200_000} & {2, 1, 0}? If the answer is {1, 2} rather than {2, 1}, how would that be computed without being 100,000 times slower than what we have now?

I do not know the answer to this question. Possibilities are: (1) binary options that exist on sets simply aren’t defined on ordered sets. No binary operations. (2) binary operations are considered to break ordering so that, for example, the union of two ordered sets returns a regular unordered set. (3) The first (or second) set in the binary operations gets precedence for determining ordering of the resulting set. That is, the new set is equivalent to adding one by one, in order, all the elements of the first set followed by adding, one by one, in order, all the elements of the second set. As mentioned above, my use case doesn’t (yet?) require good performance so I’d be willing to pay a, possibly large, performance price for having the ordering I’m looking for. But also what do you mean by “what we have now”? Do you mean merging/updating two dictionaries? If so I’d think/hope an implementation based on existing dictionaries would have similar performance?

Would OrderedSet get a custom literal notation such as o{1, 2, 3}, or would it have an inconvenient notation like OrderedSet([1, 2, 3]). The latter is what we have now for frozensets? No one really likes the latter.

No, I don’t think OrderedSet needs a custom literal notation. OrderedSet seems ok to me at
least. Similar in status to other collections in the collections package.

Would the implementation be a regular set augmented by a doubly-linked list? This is how collections.OrderedDict was implemented, making it slower and more memory intensive than regular dicts. Or would the implementation be a wrapper around the compact and ordered regular dictionary? The latter gives memory savings but is algorithmically poor for set applications with frequent deletions and reinsertions.

The thrust of most of the responses in this thread is that the ordered set will just be a wrapper around the existing ordered regular dictionary. See source code in the examples above for ideas about what operations would be included. (We could also hash out those details here if it’s an appropriate place, but I think those questions will be better discussed looking at code, not forum posts).

Why not just use one of the many choices on PyPI? I presume that if people really needed an ordered set on a regular basis, those packages would be in more popular use. I almost never see them used in my customer’s code repositories.

As I mentioned above, and as as explained much better further up in this thread, occasionally that is, unfortunately, not possible, or is prohibitively costly. Even beyond technical possibility, there are reasons to avoid inclusion of additional dependencies. Perhaps these constraints are a bit contrived (though they do apply to real python users out there…), but to this I respond/ask: Why was it appropriate to OrderedDict in stdlib at the time, but not OrderedSet? My personal opinion is that it’s not a question that OrderedSet would be useful to some users and if it could just “magically” appear in stdlib and be maintained it would be a no-brainer. So the only question is if there’s someone willing to put this in and someone willing to maintain it.

NeilGirdhar · April 4, 2023, 3:46am

I think you can implement this efficiently by building the ordered set using a modern Python dict (which preserves insertion order)

whose keys are the elements of the set, and
whose values are integers whose order reflects the ordering.

Set removal is just removal from the dict, and takes O(1) amortized time.

Set addition is adding to the mapping a key-value pair equal to the element being added and a new value that is one larger than than the last element’s value (next(reversed(d.values())) + 1). It also takes O(1) amortized time. (If the next integer is too big, then you can renumber all the elements in linear time by iterating the mapping. This doesn’t change the amortized time.)

Finally, set intersection can be efficiently implemented:

For each element in the smaller set, filter using the larger set, adding these to a list.
Sort this list using the integers from the dict of the first set.
Finally, add these sorted elements to a new ordered set.

This is O(mlogm) where m is the shorter set.

Rosuav · April 4, 2023, 1:59pm

Alternatively, maintain a count of deletions (which can be reset every time you renumber),and then add that onto the dict’s length.

EwoutH · December 18, 2023, 10:12am

I’m now encountering exactly this problem. Currently, we have 3 options:

self.agents = defaultdict(list)
self.agents[type(agent)].append(agent)

Performance for adding and removing 1000 agents: 2.06 ms ± 3.11 µs

self.agents = defaultdict(set)
self.agents[type(agent)].add(agent)

Performance for adding and removing 1000 agents: 439 µs ± 1.3 µs

self.agents = defaultdict(dict)
self.agents[type(agent)][agent] = None

Performance for adding and removing 1000 agents: 460 µs ± 2.11 µs

All options are imperfect:

Lists are slow (and allow duplicates)
Sets don’t keep insertion order (making runs non-deterministic)
dicts are ugly in usage (due to their redundant None values)

So I would really love to have a fast datastructure that keeps insertion order, doesn’t allow duplicates, and has a fast and nice user API without hacks like the dict with None values. OrderedSet sounds perfect for that.

EwoutH · December 18, 2023, 11:18am

I found two fast implementations available on PyPI:

The StableSet from stableset
The OrderedSet from ordered-set-37

stableset benchmarks around the 650 us and ordered-set-37 around the 630 us. A lot faster than using a list, but still a bit slower than a set or dict. Probably depends on the exact use case though.

However, depending on packages from PyPI has a lot of risks and complexity that’s often not desired. I think and OrderedSet is a logical data structure to have, between set and OrderedDict. It has a clear unique use case. So I would really love to have it in the Python standard library.

elis.byberi · December 18, 2023, 12:35pm

Appending to a list is faster than adding to a set (on my computer).

Benchmark

import timeit

loops = 10000


def benchmark_list_append():
    my_list = []
    for i in range(loops):
        my_list.append(i)


def benchmark_set_add():
    my_set = set()
    for i in range(loops):
        my_set.add(i)


# Measure the time taken for list append
time_list_append = timeit.timeit(benchmark_list_append, number=1000)

# Measure the time taken for set add
time_set_add = timeit.timeit(benchmark_set_add, number=1000)

print(f"Time taken for list append: {time_list_append} seconds")
print(f"Time taken for set add: {time_set_add} seconds")

We should avoid using the name “ordered set” as it is already a mathematical concept.

My suggestion is to have sets maintain insertion order, similar to dictionaries. This change would be backward-friendly.

EwoutH · December 18, 2023, 12:55pm

Appending to a list is faster than adding to a set (on my computer).

My benchmark was mainly about removing from a list/set/dict. In a set and dict that’s O(1), in a list O(N).

In our OSS project we just had a long discussion about this problem. No completely satisfying solution came up, only an OrderedSet could have provided that. Native support for multiple agent types by EwoutH · Pull Request #1894 · projectmesa/mesa · GitHub

I would completely happy with either OrderedSet being added to collections, or set maintaining insertion order (like dict became in 3.7).

pf_moore · December 18, 2023, 1:06pm

Not even using a 3rd party solution? You already have plenty of dependencies. You’d be more willing to drop support for Python versions older than 3.12 than add a new dependency? That seems rather extreme. I accept that a stdlib solution would be more convenient, but let’s be careful not to over-state the impact.

effigies · December 18, 2023, 1:13pm

Just a comment on the linked PR:

My biggest concern is that with the dict you can’t simply do
model.agents[SomeType]
anlymore. Now you have to add .keys(), and that’s just ugly. I already hear a hunderd students ask “What does the .keys() do?” and then I have to explain it’s a stupid implementation detail.

Is list(model.agents[SomeType]) better than .keys()? Or for agent in model.agents[SomeType]: ...? It’s an iterable container with insertion order, which is the desired API. That it has other methods because it happens to be a dict is the implementation detail, and users do not need to make use of that.

elis.byberi · December 18, 2023, 1:23pm

I was suggesting making use of a list. The set.add() and set.remove() operations have an amortized worst-case time complexity of O(n). In other words, for large sets, they are no longer O(1).

Quoting from the GitHub issue:

I like the performance of the dictionaries particularly as I saw someone run a model with Mesa last November with 22,921,301 nodes (GitHub - njiang8/Shareable-Reusable-Synthetic-Population-AND-Three-Agent-Based-Models )

I would suggest implementing a data structure like a binary search tree with balancing, and so on.

EwoutH · December 20, 2023, 5:22pm

“A society grows great when old men plant trees in whose shade they shall never sit.”

Of course we will work around it now. But it would be nice to have somewhere in the future!

That’s a good point. However, because a dict is somehow more specialized, it has underlying assumptions that prevent some functionality.
For example, I can imagine index based acces, subsetting and slicing being implemented as native OrderedSet methods. Having access to an index just adds a lot of possibilities.

To give a very practical use case, we’re now considering adding a select_agents methods to our Model class. That method currently has a n argument that allows selecting n agents from Model.agents. If it was an OrderedSet with slicing implemented, selecting n agents would be way easier (and potentially faster). It would also enable entirely new possibilities without cumbersome implementations, like selecting n agents in the middle of the range.

I didn’t know that, thanks! What kind of size are we talking about? And in that case, how much would the difference with a list be?

jamestwebber · December 20, 2023, 5:57pm

I don’t think you can get O(1) indexing and O(1) removal from arbitrary locations. Once you remove something you need to re-index the rest somehow.

EwoutH · December 20, 2023, 6:46pm

An OrderedSet could be implemented using a combination of a hash table and a linked list. That could offer efficient operations for both element removal and order maintenance.

The hash table component ensures uniqueness of elements and allows for O(1) lookup and deletion. When an element is added, it’s placed in the hash table, providing quick access for checking existence and removal.

The linked list maintains the order of elements. Each element in the hash table points to its corresponding node in the linked list. This enables the data structure to maintain the insertion order.

When an element is removed, the data structure uses the hash table to find the element in O(1) time and then removes it from both the hash table and the linked list. The linked list allows this removal without the need to shift elements (as would be required in an array-based structure like a list), thus maintaining O(1) removal complexity.

The disadvantage is that it would be a bit less memory efficient. I don’t know if that trade-off is worth it, but it could be an option.

I think that there are also other options like balanced trees or skip lists that could offer some O(log n) indexing and removal.

effigies · December 21, 2023, 2:36am

Sounds like setutils - IndexedSet type — boltons 23.0.0 documentation.

jamestwebber · December 21, 2023, 3:40am

This doesn’t give you O(1) indexing, though–indexing a linked list is O(n) and the hash table doesn’t know the index.

adamsilkey · December 21, 2023, 4:02am

I think what they’re saying is store a reference to the linked list node in the hash table.

Though at that point, you could just store prev and next in the hash table along with the hashed object itself.

Getting the actual position of a node would still be O(n), but you could easily look up the neighbors as O(1).

I think!

jamestwebber · December 21, 2023, 4:22am

You can get to the location of the node in the linked list by using the value. But you can’t get to it by index (“give me the nth item”), unless I’m misunderstanding the intent.

i.e. you have hash_table = {"a": Node("a"), "b": Node("b")} and linked_list = Node("a") -> Node("b") ^[1]. This gets you unique values and an ordering. But if you ask “what’s the 2nd value?” you will need to iterate over the linked list. If you store another mapping for the indexes, you’ll need to update that whenever you delete something.

making up notation for this ↩︎

elis.byberi · December 21, 2023, 3:23pm

It approaches a speed 1.5 times faster than that of a set, and the difference will be even more significant for a set that maintains insertion order. List also has a very low memory footprint.

Take a look at GitHub - rspeer/ordered-set: A mutable set that remembers the order of its entries. One of Python's missing data types. to get a sense of how slow it could become. The builtin dict is the fastest solution.

We are off-topic in discussing this matter and not adding significant content to the thread. Elaboration is provided on why changes in this regard are unlikely to happen in the near future.

Stefan2 · December 24, 2023, 11:32am

That sounds wrong. Did you mean “worst-case” instead of “amortized worst-case”?

elis.byberi · December 24, 2023, 8:19pm

Reference: TimeComplexity - Python Wiki