Add `OrderedSet` to stdlib

pf_moore · December 9, 2022, 11:17am

I find this comment very confusing. You say you want sets (your emphasis) but then say that you want them ordered. But neither Python sets nor mathematical sets are ordered, so I’m not clear how you can say that you want “sets” if ordering is important.

To put it another way, you say you don’t need the mathematical properties of osets but you need sets and you need ordering. What “mathematical properties of osets” are there, apart from set properties and ordering???

I have no problem with the idea that getting elements in a consistent order might be an important requirement for your code. That’s understandable, and common. But if that is a requirement, then you either have to impose ordering on top of an unordered data structure (by writing extra code to manually maintain the order) or use an ordered data structure (which may or may not have trade-offs to allow maintaining the ordering).

bradd · December 9, 2022, 1:15pm

Sure, I’m happy to clarify. The short version is that I’m making a distinction between data model and presentation. I typically only need the properties of a set for modeling, but the additional properties of an oset are helpful for presentation, they don’t interfere with modeling, and I don’t care about the potential performance gap between the two types.

The longer version:

I often find myself working with data that lends itself to set representation, like collections of tags or keywords. I mentioned a recent example where, given some strings s and transformations s->s', I needed both the closure s* and the minimal starting set s0, which are easy to calculate with set union and difference. That naturally lent itself to a set representation:

# excerpt from close(*args):
aliases = set(args)
closure: set[str] = set()
while aliases != closure:  # Stop when the closure is stable.
    aliases |= closure
    for alias in aliases:
        closure |= {alias, alias.casefold(), ...}

# excerpt from __repr__:
minset = set(aliases)
for alias in aliases:
    minset -= close(alias) - {alias}  # Remove the non-root aliases.

That algorithm doesn’t need an oset, just set. Because it doesn’t require ordering semantics, and the set notation is quite clear, I think it would be a mistake to reimplement this with a dict to get ordering semantics. It would obscure the code for no benefit.

However, when it comes to presentation, then I do care about set ordering, for consistency and ease of reading. The second excerpt there is from a __repr__ method, which “should look like a valid Python expression that could be used to recreate an object with the same value,” and the minset is the smallest set that generates the same closure. I also want a stable repr that is easy for a human to parse. Thus, the ordering matters here.

While writing that code, I hopped over to the Python 3.11 collections docs to see whether collections.OrderedSet or SortedSet exists yet, and finding that they don’t, I used sorted to order the output instead.

Because this kind of code doesn’t need oset or poset properties for correctness, I’m not willing to sacrifice notational clarity, testability, effort, etc. to get them. However, if I had a well-vetted oset class handy, such as the hypothetical collections.OrderedSet, then I would happily use it for the convenience of stable and readable diagnostic output. I expect that the oset class would sacrifice some performance for this convenience, but that’s a tradeoff I’m fine with.

I hope that clarifies why I would prefer to have an oset class available even for code that doesn’t require it algorithmically, and why I’m not happy with the suggestion to use workarounds more complex than throwing sorted into my output methods.

pf_moore · December 9, 2022, 2:14pm

Thanks, I see. I’m sorry, I’d missed that you were happy with adding sorted in the output methods, that’s exactly what I would have suggested (or done myself, if it were my code).

As to having an ordered set class available, I’m sure there are such classes on PyPI (people have probably mentioned some in this thread, but I haven’t checked). The question of whether something being on PyPI is sufficient, or whether “being in the stdlib” is a benefit, is a difficult question, which people have widely varying opinions on. And the debate isn’t helped by the fact that everyone naturally has the view that “libraries that I rely on are worth adding to the stdlib” If you’re in the position where “get an implementation from PyPI” isn’t a viable option, I sympathise, but don’t have a solution (that’s the point at which the various “simulate ordered sets with what is in the stdlib” suggestions come in, but as you can imagine, they are essentially workarounds and compromises for the awkward constraints imposed on what solutions can be considered).

pylang · December 9, 2022, 11:18pm

I appreciate the clarification (which is what I had been seeking in previous questions). This will conclude my attempts on that discussion.

In short - having a sorted display does not appear to be a strong reason to implement separate set constructs, e.g. OrderedSet, SortedSet. As you’ve shown, sorted() is good enough. PyPI has more to offer.

To clarify - the mentioned workarounds are intended to bring awareness to the convenient set-like features of dict keys and modern dicts, which again are workarounds and not substitutes to osets.

bradd · December 9, 2022, 11:55pm

Oh, I understand. I used to work in the realm of C & C++ standards.

I think the folks who need an oset algorithmically are likely to deal with the effort & risks of implementing a solution or trying their luck on PyPI. It’s a trickier situation with cases like mine, where there’s a tiny bit of benefit to a lot more people, and the “benefit” may be more like, just gratitude than anything concrete.

Given that the major obstacle seems like a lack of resources than any matter of principle, I’m tempted to just volunteer, but I don’t know that I have the bandwidth right now. I’ll see how I feel about it after I get through my current project.

fancidev · December 14, 2022, 6:27am

I’m surprised there is collections.OrderedDict (and the standard dict keeps insertion order since Python 3.6?) but there is no collections.OrderedSet.

Having OrderedSet in collections is great for completeness’ sake.

Is it absolutely necessary? No, but most part of std lib is not absolutely necessary. The question is really is it better to have or have not?

And all the discussion about element ordering is bizarre: this is about element insertion order, not element comparison order.

yonillasky · December 21, 2022, 8:17pm

This is due the bad choice of a name for these collections (OrderedSet and OrderedDict). In most languages that distinguish “ordered” from “unordered” associative containers, the term “ordered” refers to a pre-existing order relation on keys (and requires the ability to compare them); these containers are implemented by a completely different data structure, usually a balanced tree of some kind.

There isn’t a “standard” on how to name containers that use insertion order as iteration order, but a term unrelated to key ordering, like “StableSet” and “StableDict”, would’ve been better - at least it would be less likely to confuse people who are new to the language.

fancidev · January 1, 2023, 1:02pm

I agree that a name like “StableDict” is more appropriate than “OrderedDict”. Fortunately the standard dict keeps insertion order so we can forget about OrderedDict altogether.

Does it make sense to make the standard set keep insertion order?

NeilGirdhar · January 1, 2023, 5:36pm

Answered here by Serhiy.

The current dict implementation is called a “compact dict
implementation”, because it saves memory in common cases. It was the
initial reason of writing it. At the same time there was a need in
ordered dicts for kwarg and class namespace. We discovered that slightly
modified compact dict implementation preserves order, and that its
possible drawback (performance penalty) is too small if exists.

But ordered set implementation is not more compact that the current set
implementation (because for dicts the item is 3-word, but for sets it is
2-word). Also the current set implementation has some optimizations that
the dict implementation never had, which will be lost in the ordered set
implementation.

Take to account that sets are way less used than dicts, and that you can
use a dict if you need an ordered set-like object.

steven.daprano · January 2, 2023, 1:40am

Apologies for the lateness of the reply, I’m just catching up on this thread now.

In context, Bradd was referring to it as a bug when your requirements are to preserve some sort of sensible order. dictview - set does not preserve any sort of order and so using it would introduce a bug.

If your requirements are different, and you don’t care about preserving order, then dictview - set is fine.

[quote]
I think you mean oset. Is it really much work?

>>> oset_ = dict.fromkeys(words).keys()
# compared to
>>> oset = some_module.OrderedSet(words)

Now try oset.add(element).

This is not an insurmountable problem. We should follow the same rules as for dicts, as much as possible. For unions, that’s already definited:

>>> dict.fromkeys(range(5)) | dict.fromkeys(range(5, -1, -1))
{0: None, 1: None, 2: None, 3: None, 4: None, 5: None}

Much of this was already nutted out in PEP 584.

bradd · January 24, 2023, 8:46pm

Regarding insertion order more generally, I think OrderedSet and Counter should follow the same rules to minimize surprises, and Counter currently specifies this:

Changed in version 3.7: As a dict subclass, Counter inherited the capability to remember insertion order. Math operations on Counter objects also preserve order. Results are ordered according to when an element is first encountered in the left operand and then by the order encountered in the right operand.

fkxxyz · April 1, 2023, 1:17pm

I would like to see OrderedSet added to the standard library. Currently, I have a need to implement sensitive word filtering, where the set of sensitive words should not contain duplicates, but I also want to store it in an ordered way in a file. When the number of sensitive words is particularly large, I need to add and delete items without duplicates while also saving them at any time. Repeatedly converting between list and set will cause a performance drop.

In my opinion, there is no need to delve into the mathematical meaning of OrderedSet, as users of this data structure are not using it for mathematical calculations, but rather for real-world use cases.

Rosuav · April 1, 2023, 2:14pm

Can you use a dictionary? They retain order.

ntessore · April 1, 2023, 8:53pm

Forgive the naive idea, but wouldn’t it merely take a singly linked list, at the cost of one pointer per item, on top of the existing set() implementation to at least maintain insertion order for iteration and .pop()? IIRC that was the essence of a vintage OrderedSet recipe by @rhettinger.

jagerber · April 2, 2023, 1:06pm

I would say it has been well enough decided and agreed upon that there should be OrderedSet in stdlib and it should be based on the modern dict which is sorted by default. I don’t think there’s need for further discussion on whether OrderedSet should be included or how it should be implemented.

I think the next step is finding volunteers to write and maintain the code. Next steps probably involve a PEP or something.

rhettinger · April 2, 2023, 9:17pm

Be sure to address the following:

What new problems can be solved that can’t currently be solved by just using dict.fromkeys? The latter is what I use in the rare cases where the need arises. It is not a fully featured set but gives fast membership testing and ordered iteration which is all most people would want.
What should be returned by {1, 2, 3, ..., 200_000} & {2, 1, 0}? If the answer is {1, 2} rather than {2, 1}, how would that be computed without being 100,000 times slower than what we have now?
Would OrderedSet get a custom literal notation such as o{1, 2, 3}, or would it have an inconvenient notation like OrderedSet([1, 2, 3]). The latter is what we have now for frozensets? No one really likes the latter.
Would the implementation be a regular set augmented by a doubly-linked list? This is how collections.OrderedDict was implemented, making it slower and more memory intensive than regular dicts. Or would the implementation be a wrapper around the compact and ordered regular dictionary? The latter gives memory savings but is algorithmically poor for set applications with frequent deletions and reinsertions.
Why not just use one of the many choices on PyPI? I presume that if people really needed an ordered set on a regular basis, those packages would be in more popular use. I almost never see them used in my customer’s code repositories.

I don’t know what the current mood is regarding the standard library vs PyPI. At least for a while, the thinking was to make the standard library smaller, retire old packages, and let people download what they need from the thriving PyPI ecosystem. I don’t recall us ever wanting to move infrequently used data structures into the standard library, otherwise, we would already have blists, numpy style arrays, red-black trees, pairing heaps, sorted collections, b-trees, judy arrays, bloomfilters, etc. Over the years, I’ve written most of these but never thought they would be appropriate or welcome for the standard library.

storchaka · April 2, 2023, 9:24pm

You meant &, not |.

rhettinger · April 2, 2023, 9:29pm

Thanks. I just fixed it in an edit. Let me know if you see anything else.

vovavili · April 2, 2023, 10:12pm

I feel like if Python was like Node.JS, it would be used a lot. It’s just that Python development culture disincentivizes relying on smaller packages to get your task done, unless it’s something very well established like tqdm or tenacity.

h-vetinari · April 3, 2023, 6:21pm

Dependencies have a cost (sometimes a pretty high one), and pushing something into the standard library greatly reduces those costs for the user (but shifts some of those costs to CPython maintenance).^[1]

From a perfectly egoistic user^[2] POV, more stuff in the stdlib is good. That mean it’ll still be a good tradeoff when all constraints are considered.

In this particular case I’d like to think that it would actually make sense to have it in the stdlib, but I’m not nearly deep enough in the details to argue against the technical objections. Still, one comment to Raymond’s list:

Why would it be an issue if ordered sets have worse performance than unordered sets? Keeping more information means more work, and that’s OK (and would be opt-in by using that class; of course “ordered” shouldn’t become the default for set literals).

As an aside, C++ has the same problem but even worse. ↩︎
and disregarding things like embedded usecases where stdlib footprint matters ↩︎