Add `OrderedSet` to stdlib

I do not need that often, but one use case is when serializing/deserializing JSON which doesn’t have sets, only lists. I can use a library that will desarialize some JSON lists to sets (eg. hints in annotations) so I can work on sets in code. In one of my applications I use content addressing. The order of sets is important when serializing back.

4 Likes
  • when we finally came up with a good ordered dict implementation, that implementation unfortunately does not carry over to sets;

why cannot ordered dict implementation be used for set ?

a set can be implememented with a dict where you simply don’t use/care about the value given/associated to the keys, no ?

2 Likes

My memory may fail me, but at my previous job as an intern I had to design a CRUD application that would generate both aggregate and per-file statistics for a collection of CSV files, where I would have to filter out invalid rows based on a number of numerous predefined conditions. Deduplicating indeces that have failed multiple expectations at once with indeces = {} and indeces = indeces | dict.fromkeys(df.index.values)) - read, with an ordered hash set - inside a for loop has allowed me to generate a sequence of indeces that has failed at least one expectation while maintaining an intuitive order in which a sequence of invalid rows has appeared in that collection of files. I believe my code would be more readable if I had a stdlib option to use a hash set for my purposes, as opposed to a hash map as a proxy to an ordered hash set - I needed a set, but my particular data structure did not have all of that good stuff that comes from inheriting from collections.abc.Set.

Now, as a more experienced coder I see that I probably should have used pd.concat and pd.unique, but the point still stands. Deduplication of a sequence where you would like to 1) keep it a subtype of a sequence 2) while gaining collections.abc.Set methods is something that I would imagine comes handy in designing many CRUD applications.

2 Likes

Side note: msets actually are pretty useful. We’ve been benefiting from msets through collections.Counter , which practically is a multiset :slight_smile:

Is it? I see it more as a special case of defaultdict(int). Maybe I’m missing something fundamental about it.

The idea behind a multiset / multimap, as I understand it, is to be able to store multiple equivalent keys (objects that compare equal to one another), or multiple values bound to the same key, and then find all of them with a lookup. Not just count how many there are, but actually store and return them all.

A multimap[K, V] can be implemented using a mapping from K to Collection[V]. A multiset[K] could be considered equivalent to multimap[K, K].

The advantage of a dedicated multiset/multimap data structure is that it works without requiring this two-level structure. In fact, it usually isn’t much of a problem to take the hash table backing sets and maps and configuring it to allow inserting multiple “equal” keys.

But the need for such sets/mappings doesn’t come up that often, and the “Collection[V] as value” way usually does the job, as sub-optimal as it is.

2 Likes

Regarding OrderedSet I will say this; not sure if it’s been mentioned before, but it is an effective device to shield the rest of the program from non-deterministic behavior originating in builtin object.__hash__.

This is valuable in scientific computing context where one wishes to repeat runs on a given workload and expect identical behavior throughout the entire program.

In this case, it is not the exact order that we care about, just that it will exactly the same across repeated runs on the same data (even on a different machine, or build of the interpreter). At least, assuming no other sources of non-determinism are present.

4 Likes

That sounds a lot like Larry Hasting’s use case for osets in the python-dev discussion linked above. He was using it to create a partial ordering of installation dependencies such that the poset would be stable from run to run.

I have personally wanted osets less for the mathematical reasons and more for ease of reading in diagnostic output. For example, just the other day I was implementing a pair of functions that computed the closure of several string transformations and its inverse. (The goal was to explode a short list of names into many aliases based on rules like case-folding, and then be able to reconstruct an equivalent shortlist for repr by removing redundant aliases.)

In my application, the specific ordering didn’t matter, but it was important for my sanity to have a consistent and easily-scannable ordering. I ended up using sorted(aliases) on a regular set, but not before reaching for the non-existent collections.OrderedSet.

My overall take on this is that Python has a few decent ways to do the relevant oset operations, either with dict.fromkeys or sorted depending on whether you’re looking for an OrderedSet or a SortedSet, but every time I do that I feel like there’s a hole in the collections module. The hole isn’t big enough to deal with the hassles of finding and vetting a third-party module, so each time I end up using one of the existing stdlib compromises, but I would much rather just write OrderedSet (or SortedSet).

4 Likes

Maybe, for these situations, a dict is a better choice. Map your keys to irrelevant-but-true values (from experience, it’s convenient to have them all be true). PEP 584 handballed this off to a future proposal:

and it may well be that this is the best way to deal with it. Leave the native set() type as-is, but if you want to retain order, use a dictionary.

The set type was far more algorithmically appropriate here, because I had to use set difference extensively in the inverse closure, both to determine which items were redundant and to remove them from the main set. Doing that in a dict would have been a bigger hassle than applying sorted to the end result.

More generally, I find that I usually want to sort sets for presentation reasons most time that I use them. I work in infra (specifically developer experience), so I’m usually implementing stuff in the style of Unix commands where you either want to maintain the order of inputs or sort them lexicographically. While dict does do the preserving inputs, it’s not a suitable replacement for set when I’m doing set operations.

Or to put it another way, I don’t typically need the mathematical properties of osets specifically, but I do often want sets, and I usually want them ordered or sorted because that’s how Unix-style commands present their output.

[edit: and to be clear, I agree with the python-dev consensus that the core set type should prioritize performance, I just personally would always use OrderedSet or SortedSet when I need a set type, because of the kind of work that I do.]

dict.keys() are already set-like. You can use set operations on them. From an earlier post:

Given:

>>> words = ['foo', 'bar', 'bar', 'foo', 'baz', 'foo', 'quuz']
>>> d = dict.fromkeys(words)
>>> mask = set(["foo", "baz"])
# Ordered + Unique
>>> list(d)
['foo', 'bar', 'baz', 'quuz']

Code:

# Set operation on keyviews
>>> d.keys() - mask
{'bar', 'quuz'}

Huh, I wasn’t aware that dictionary view objects observed the collections.abc.Set interface. Thanks for pointing that out.

However, I don’t consider that an acceptable replacement for just using a set object, especially for my use case where the sorted function was sufficient for my ordering needs. It’s several lines of code more than just using a set and, more importantly, significantly more opaque. Also, I’m not even sure whether there’s a guarantee that the dictionary view operations will preserve the key order. There’s no guarantee of that in the docs, and the usual implementation of set intersection is not order preserving, for example.

Just confirmed in the Python 3.11 interpreter:

>>> words = ['foo', 'bar', 'quuz', 'bar', 'foo', 'baz', 'foo', 'quuz']
>>> d = dict.fromkeys(words)
>>> tuple(d.keys())
('foo', 'bar', 'quuz', 'baz')

So far, so good. But when you intersect that with a set:

>>> filter = set(("quuz", "baz"))
>>> d.keys() & filter
{'baz', 'quuz'}

It’s now lost the original ordering. Instead of the subset being {"quuz", "baz"} it has become {"baz", "quuz"}, which means that this is not a correct implementation for set operations if the goal is to maintain original insertion order. To perform things like set intersection or difference, you’ll need to reimplement those algorithms, which is significantly extra work and opportunity for error.

1 Like

Is that really the case? What if you intersect or union two sets that have contradictory insertion orders? I don’t think preserving insertion order for any operation other than insertion is necessarily even possible, in general.

That’s the sort of thing that a collections.OrderedSet spec would define. However, note that I can also produce the same error with set difference, which has no such conflict of ordering:

>>> filter = set(("foo", "bar"))
>>> d.keys() - filter
{'baz', 'quuz'}

The core problem is that the operator result is a set, not a dictionary view, and so it does not respect the original ordering. It happens that my implementation always sorts "baz" before "quuz" regardless of the dict ordering. That problem isn’t immediately obvious if you’re thinking that the dictionary rules will still apply.

I think this is a pretty good demonstration of why collections.OrderedSet is superior to rolling your own workarounds with dict. It’s easy to go down a garden path where things seem to work at first, and then you need a different set operation, and either it’s a lot more work, or it seems to work for a quick test and then breaks in production. Note how @pylang’s code worked for a simple example, but it broke as soon as I swapped two items in the original list. That’s something you could get wrong even with unit tests.

The dict workarounds are OK if you only need to filter one list or some similar one-liner. However, it’s much less transparent than using an actual set type, and it’s more prone to breakage as the example grows into more than a toy problem.

Luckily I didn’t need an OrderedSet in my code, but only a sorted set for presentation, and that is easy enough to accomplish with the sorted function. I think it’s much easier to work around the lack of a SortedSet than the lack of an OrderedSet. Even if you need sorted output for more than just presentation, it’s much easier to apply an adapter as necessary than it is to implement set operations correctly without breaking the insertion order.

I believe you are using sets improperly, and that’s why you are surprised with the results. dictviews are set-like, so you can use set operations on them. The result is a set, which doesn’t guarantee order. It’s all behaving as expected.

I think you are trying to articulate some lack of insertion order in set operations. That is an intriguing idea, and I think you should explore that further. While you’ve mentioned an unrelated use case where sorted() was sufficient, you have not actually demonstrated a use case, namely examples requiring that insertion order is preserved after set operations.

Dict difference is one of the operators mentioned in the deferred section that I linked to.

My specific use case was fine with sorted, but I don’t think it’s a stretch to consider the same case where the insertion order is important instead of the lexicographic order. I also don’t understand how I’m using sets improperly when the task is calculating a closure and its inverse. That seems like an ideal use case for sets, and my solution using sets worked well for the purpose.

I was not surprised by the results! I expected that example to fail because of unspecified behavior, and it was quite easy to demonstrate the bug and how it failed to meet my requirements. (If it wasn’t meant to meet my requirements, then what was the purpose of the example? I said that I already had working, idiomatic code.)

I am, however, surprised at the several recommendations that I replace working set code with workarounds that do less with more code. That’s not helpful, and I don’t think it’s a good idea to recommend roll-your-own dict solutions outside of the simplest cases.

[Sorry about the multiple replies. I’m trying to be terse! Chris’s response arrived just as I was sending this message originally. Also, I’m also not looking for advice here. I have the skills and knowledge to implement this stuff on my own. I’m saying that this feature would be a convenience that would save me a small amount of effort on a great deal of occasions, whereas the workarounds cost me effort with the same frequency, and with greater risk of error.]

1 Like

Using a dict where a set is appropriate is still more work, and error-prone work as the example above shows. As far as I’m aware, there is no significant opposition to collections.OrderedSet other than available labor. On the contrary, the python-dev thread seemed to reach the consensus that it was a good idea.

My addition here was to note that there’s demand for OrderedSet (and SortedSet) even in cases where it’s not algorithmically necessary, but simply a convenience for presentation. Furthermore, I think that is exactly the sort of use case where a third-party package would be more effort than it’s worth, but people would gladly use a stdlib implementation.

4 Likes

I expected that example to fail because of unspecified behavior, and it was quite easy to demonstrate the bug and how it failed to meet my requirements.

dictview - set → set. is not a bug.

I think you mean oset. Is it really much work?

>>> oset_ = dict.fromkeys(words).keys()
# compared to
>>> oset = some_module.OrderedSet(words)

As I mentioned,

I think you are trying to articulate some lack of insertion order in set operations. That is an intriguing idea, and I think you should explore that further.

Touching on @bryevdv’s reply, what should insertion order look like between 1) two osets and 2) an oset and set?

I said that I am not looking for advice, and neither am I looking to workshop the interface of OrderedSet. Please stop.