Add `OrderedSet` to stdlib

I find this comment very confusing. You say you want sets (your emphasis) but then say that you want them ordered. But neither Python sets nor mathematical sets are ordered, so I’m not clear how you can say that you want “sets” if ordering is important.

To put it another way, you say you don’t need the mathematical properties of osets but you need sets and you need ordering. What “mathematical properties of osets” are there, apart from set properties and ordering???

I have no problem with the idea that getting elements in a consistent order might be an important requirement for your code. That’s understandable, and common. But if that is a requirement, then you either have to impose ordering on top of an unordered data structure (by writing extra code to manually maintain the order) or use an ordered data structure (which may or may not have trade-offs to allow maintaining the ordering).

2 Likes

Sure, I’m happy to clarify. The short version is that I’m making a distinction between data model and presentation. I typically only need the properties of a set for modeling, but the additional properties of an oset are helpful for presentation, they don’t interfere with modeling, and I don’t care about the potential performance gap between the two types.

The longer version:

I often find myself working with data that lends itself to set representation, like collections of tags or keywords. I mentioned a recent example where, given some strings s and transformations s->s', I needed both the closure s* and the minimal starting set s0, which are easy to calculate with set union and difference. That naturally lent itself to a set representation:

# excerpt from close(*args):
aliases = set(args)
closure: set[str] = set()
while aliases != closure:  # Stop when the closure is stable.
    aliases |= closure
    for alias in aliases:
        closure |= {alias, alias.casefold(), ...}

# excerpt from __repr__:
minset = set(aliases)
for alias in aliases:
    minset -= close(alias) - {alias}  # Remove the non-root aliases.

That algorithm doesn’t need an oset, just set. Because it doesn’t require ordering semantics, and the set notation is quite clear, I think it would be a mistake to reimplement this with a dict to get ordering semantics. It would obscure the code for no benefit.

However, when it comes to presentation, then I do care about set ordering, for consistency and ease of reading. The second excerpt there is from a __repr__ method, which “should look like a valid Python expression that could be used to recreate an object with the same value,” and the minset is the smallest set that generates the same closure. I also want a stable repr that is easy for a human to parse. Thus, the ordering matters here.

While writing that code, I hopped over to the Python 3.11 collections docs to see whether collections.OrderedSet or SortedSet exists yet, and finding that they don’t, I used sorted to order the output instead.

Because this kind of code doesn’t need oset or poset properties for correctness, I’m not willing to sacrifice notational clarity, testability, effort, etc. to get them. However, if I had a well-vetted oset class handy, such as the hypothetical collections.OrderedSet, then I would happily use it for the convenience of stable and readable diagnostic output. I expect that the oset class would sacrifice some performance for this convenience, but that’s a tradeoff I’m fine with.

I hope that clarifies why I would prefer to have an oset class available even for code that doesn’t require it algorithmically, and why I’m not happy with the suggestion to use workarounds more complex than throwing sorted into my output methods.

2 Likes

Thanks, I see. I’m sorry, I’d missed that you were happy with adding sorted in the output methods, that’s exactly what I would have suggested (or done myself, if it were my code).

As to having an ordered set class available, I’m sure there are such classes on PyPI (people have probably mentioned some in this thread, but I haven’t checked). The question of whether something being on PyPI is sufficient, or whether “being in the stdlib” is a benefit, is a difficult question, which people have widely varying opinions on. And the debate isn’t helped by the fact that everyone naturally has the view that “libraries that I rely on are worth adding to the stdlib” :wink: If you’re in the position where “get an implementation from PyPI” isn’t a viable option, I sympathise, but don’t have a solution (that’s the point at which the various “simulate ordered sets with what is in the stdlib” suggestions come in, but as you can imagine, they are essentially workarounds and compromises for the awkward constraints imposed on what solutions can be considered).

3 Likes

I appreciate the clarification (which is what I had been seeking in previous questions). This will conclude my attempts on that discussion.

In short - having a sorted display does not appear to be a strong reason to implement separate set constructs, e.g. OrderedSet, SortedSet. As you’ve shown, sorted() is good enough. PyPI has more to offer.

To clarify - the mentioned workarounds are intended to bring awareness to the convenient set-like features of dict keys and modern dicts, which again are workarounds and not substitutes to osets.

Oh, I understand. I used to work in the realm of C & C++ standards. :joy_cat:

I think the folks who need an oset algorithmically are likely to deal with the effort & risks of implementing a solution or trying their luck on PyPI. It’s a trickier situation with cases like mine, where there’s a tiny bit of benefit to a lot more people, and the “benefit” may be more like, just gratitude than anything concrete.

Given that the major obstacle seems like a lack of resources than any matter of principle, I’m tempted to just volunteer, but I don’t know that I have the bandwidth right now. I’ll see how I feel about it after I get through my current project.

1 Like

I’m surprised there is collections.OrderedDict (and the standard dict keeps insertion order since Python 3.6?) but there is no collections.OrderedSet.

Having OrderedSet in collections is great for completeness’ sake.

Is it absolutely necessary? No, but most part of std lib is not absolutely necessary. The question is really is it better to have or have not?

And all the discussion about element ordering is bizarre: this is about element insertion order, not element comparison order.

5 Likes

This is due the bad choice of a name for these collections (OrderedSet and OrderedDict). In most languages that distinguish “ordered” from “unordered” associative containers, the term “ordered” refers to a pre-existing order relation on keys (and requires the ability to compare them); these containers are implemented by a completely different data structure, usually a balanced tree of some kind.

There isn’t a “standard” on how to name containers that use insertion order as iteration order, but a term unrelated to key ordering, like “StableSet” and “StableDict”, would’ve been better - at least it would be less likely to confuse people who are new to the language.

1 Like

I agree that a name like “StableDict” is more appropriate than “OrderedDict”. Fortunately the standard dict keeps insertion order so we can forget about OrderedDict altogether.

Does it make sense to make the standard set keep insertion order?

1 Like

Answered here by Serhiy.

The current dict implementation is called a “compact dict
implementation”, because it saves memory in common cases. It was the
initial reason of writing it. At the same time there was a need in
ordered dicts for kwarg and class namespace. We discovered that slightly
modified compact dict implementation preserves order, and that its
possible drawback (performance penalty) is too small if exists.

But ordered set implementation is not more compact that the current set
implementation (because for dicts the item is 3-word, but for sets it is
2-word). Also the current set implementation has some optimizations that
the dict implementation never had, which will be lost in the ordered set
implementation.

Take to account that sets are way less used than dicts, and that you can
use a dict if you need an ordered set-like object.

2 Likes

Apologies for the lateness of the reply, I’m just catching up on this thread now.

In context, Bradd was referring to it as a bug when your requirements are to preserve some sort of sensible order. dictview - set does not preserve any sort of order and so using it would introduce a bug.

If your requirements are different, and you don’t care about preserving order, then dictview - set is fine.

[quote]
I think you mean oset. Is it really much work?

>>> oset_ = dict.fromkeys(words).keys()
# compared to
>>> oset = some_module.OrderedSet(words)

Now try oset.add(element).

This is not an insurmountable problem. We should follow the same rules as for dicts, as much as possible. For unions, that’s already definited:

>>> dict.fromkeys(range(5)) | dict.fromkeys(range(5, -1, -1))
{0: None, 1: None, 2: None, 3: None, 4: None, 5: None}

Much of this was already nutted out in PEP 584.

1 Like

Regarding insertion order more generally, I think OrderedSet and Counter should follow the same rules to minimize surprises, and Counter currently specifies this:

Changed in version 3.7: As a dict subclass, Counter inherited the capability to remember insertion order. Math operations on Counter objects also preserve order. Results are ordered according to when an element is first encountered in the left operand and then by the order encountered in the right operand.

1 Like