Add `OrderedSet` to stdlib

ajoino · October 23, 2022, 6:45am

It depends on what the meaning of “is” is

It shares the interface of set, but it also violates a fundamental property of sets, i.e. non-ordedness. So it is definitely set-like, but I think calling it a set is wrong. I also think that giving it a name that highlights the essential property that separates this data structure from other similar ones, i.e. uniqueness of the elements, makes its use-cases clearer. Now, if someone feels like pointing out “Like how OrderedSets are different from sets because they ordered ;)", read the first two sentences of this paragraph again

Also, doesnt the Mappinginterface implement the entiresetinterface? Wouldn't that then "require" them to be calledMappingSetssince they areset`s? Definitely not arguing for that, just wanted highlight that naming types after their interface can lead to weird results.

steven.daprano · October 23, 2022, 8:14am

Do you really care more about sets being unordered than you care about the ability to perform membership testing and to find intersections, unions, set differences etc? That seems odd.

Do you have a use-case for sets that requires them to be unordered, one that would break if sets became ordered like dicts are?

Without such a use-case, we can hardly say that lack of order is the most important property of a set.

Mathematically, sets are defined by their properties, of which membership is the most fundamental.

It is not correct to define sets as necessarily lacking order. Whatever order sets may or may not have is merely not significant, and set operations ignore it. So that the sets {1, 2} and {2, 1} are considered to be equal. When we say that sets are unordered, what we mean is that whatever order they have is arbitrary, not that they lack order.

Nor is uniqueness fundamental: we can merely ignore duplicates in a set, or not, as we prefer.

Most people want ordered sets to be like a set, not a stack. The name is a hint

The fundamental (minimal) stack API is two operations, push and pop. It is true that stacks are containers, and hence contain elements, but they do not have the one truly fundamental operation which defines a set, membership testing. The only way to see whether an object is contained by an abstract stack is to pop elements off, one at a time, until either the element is found or the stack is empty.

(Of course implementations of stacks are permitted to support additional operations, e.g. Python stacks are lists and support the entire list API.)

Having an OrderedSet which preserves insertion order, or a SortedSet which preserves lexicographic order of the elements, does not contradict their set-like behaviour any more than dicts cease to be associative arrays or key:value stores merely because they preserve insertion order.

ajoino · October 23, 2022, 8:56am

I don’t think you’re wrong in anything that you are, but I don’t necessarily agree with everything. The reason I like to think about arbritrary order as the most fundamental property of a set, in this case, is because if you just consider membership as the fundamental property, basically every container is a set. I don’t really disagree with that notion, IIUC that’s the most popular way to fundamentally describe all mathemathics (with ZFC), but I think that it’s very unhelpful to say that every object that supports __contains__ is a set. And that’s why we don’t do that.

So if membership is not a useful way to distinguish sets from other types, what is? I’d argue arbritrary order is the most natural way to distinguish a set from other types, other might argue uniqueness. However, these opinions are both equally wrong from the proper defintion of a set, which I stated in an earlier reply.

PS: I’m still in here because I find this kind of naming and categorizing a fascinating intersection of scienctific rigour and and the human fuzzy mind. I’ll gladly continue this conversation but if you think it is out-of-topic I’ll stop

orise · October 23, 2022, 9:22am

The fundamental property of sets in ZFC is the axiom of extensionality - two sets are equal if they have the same elements. This is certainly not true for other builtin datatypes in python, but as far as I understand should still hold for the proposed orderedsets.

vbrozik · October 23, 2022, 9:29am

I see it very differently:
Defined order is just an additional property. When you ignore this property you will see the set as a normal (unordered) set. As I understand it the proposal is that the basic operations like comparison will ignore the order. (like it is being done for dict already)

By the way sets in Python have certain order. You cannot say they are unordered. When you iterate a set you get its items in certain order. The problem is that this order is not very useful.

ajoino · October 23, 2022, 9:39am

I wasn’t precise in my wording, I meant arbritrary order. Sorry about the confusion.

ajoino · October 23, 2022, 9:53am

@orise This is a good point I didn’t consider. I think I would agree that if equality doesn’t care about order, then it’s definitely a set. Thanks for the insight, really helped clarify why I think arbritrary order is important!

I looked through the PyPI implementations above to see how they deal with equality, and they are both based on an implementation that ignores order when testing equality with a set, but doesn’t ignore order when testing it against sequences. If it was a set, shouldn’t always ignore order when testing equality?

jagerber · October 23, 2022, 4:03pm

(1) For the OrderedSet naming convention see https://en.wikipedia.org/wiki/Partially_ordered_set. In this nomenclature the object we’re discussing would be a totally ordered set. Such naming would be very conventional and expected with respect to mathematics conventions.

(2) I’m not convinced at all that arbitrariness of element ordering is “the” fundamental property of sets or even that it is an important property as all. In a math context I would agree with others here that set membership is the fundamental property of being a set. Orderedness of a set just happens to not be specified as one of its properties. So, indeed, any object which contains other objects could be considered to be a set in the mathematical sense. I don’t know if this last sentence is true in the CS sense though because I don’t know or not if I would argue that every object that implements __contains__ should implement the full set API. Certainly union and intersection would have to be defined differently for list (which I would say are multi-ordered-sets, see below) than for sets.

(3) It is common in math that when a type of object is defined as a “sub”-type of another object the naming is often {adjective}-{type}. For example a Group is a structure with a binary operation and it is not specified if a.b=b.a. A commutative or Abelian group is just a like a group but it is indeed specified that a.b=b.a, but no one is arguing that “arbitrariness of commutativity” is the fundamental defining feature of a group… For more color: I would consider MultiSets and OrderedSets to be specifications of general Sets. I would consider a MultiOrderedSet to be a type of Set that is similar to (or identical to) a Sequence or List. My guess is I would be in the majority with this opinion. I think the general point is that properties of objects are typically thought of in a positive sense. I.e. an object is a certain type because it HAS certain properties. It is not typically the LACKING of properties that determines type-hood. I’m sure there are many counter examples but I think it’s a fair general trend.

(4) The question of how to handle equality between OrderedSets and other collections is an interesting and important implementation detail. Should the ordered set {('a', 1), ('b', 2)} equal the ordered set {('a', 2), (b, 1)} or the unordered set {a, b}? How about other conventions? I’d be inclined to say no in both cases but my opinion isn’t strong. But I also probably wouldn’t make a decision on that and would follow whatever convention exists between OrderedDict and dict (or maybe whatever convention existed before dicts were ordered?) for consistency.

(5) Do you make the same argument for ordered dicts? That arbitrary ordering of keys in key-value pairs is the fundamental property of dictionaries and that naming the OrderedDict that way was a mistake and making dicts ordered was a mistake? If not why? What is different between dicts and sets for you?

(6) Multiple packages already use the name OrderedSet and people are used to it, so I think the ship has probably sailed on how such an object would be named. Debate could still be had as to whether that name was a mistake or not but it feels unlikely to change the outcome.

(7) I think the naming convention is a bit off-of-the-specific-topic of if something that we’re referring to as an OrderedSet should be added to stdlib. But, that is, if it is decided that OrderedSet should be added, and if enough people have issue with the name, it is a discussion that needs to happen at some point. I’d prefer to not have it bog down the “should-it-be-included-in-stdlib” discussion at the moment but here I am engaging with it.

ajoino · October 23, 2022, 4:37pm

Yeah, I do recognize that I’m in the minority, and the name won’t keep me from using something if I see the need. Was a fun conversation for me and I will try to relieve my boredom some other way next time

Moving on, I’m not at a computer right so I can’t confirm but I think that you have to explicitly convert dict.{keys,values,items}() (or dict itself) before comparing it to another container type. Maybe OrderedSets should compare like sets to any set type and require explicit conversion for other types? Is this even an important question?

pylang · October 24, 2022, 9:37am

Just to be clear, in the same link you provided, it shows the modern way to emulate an ordered set in Python. Modified from that post:

>>> words = ['foo', 'bar', 'bar', 'foo', 'baz', 'foo', 'quuz']
>>> d = dict.fromkeys(words)
>>> list(d)
['foo', 'bar', 'baz', 'quuz']

The result has both order and unique elements, which are properties of an O-set.

Well ok, but a list doesn’t have normal set properties/attributes. Ah yes, but dict keys are already set-like:

>>> mask = set(["foo", "baz"])
>>> d.keys() - mask
{'bar', 'quuz'}

It’s true, there is a trade-off of not having both order and set properties at the same time, but as you can see, we can work around that pretty easily. Can you give examples where we need both order and set properties at the same time? I think you would need practical applications to help the argument of adding an OrderedSet to the stdlib.

Pyprohly · October 24, 2022, 12:01pm

A couple people have asked now about the practical applications of an ordered set. Me and @jagerber have already shared our 3 personal, unique uses cases for OrderedSet. To summarise them:

It is a good choice as the backing collection for the observer pattern.
It can be used in seen.add(idn) style algorithms to limit the size of the seen set by pruning earlier items.
It is useful for various user input processing applications in order to give users a ‘consistent and predictable’ display.

I would also like to highlight another use case: how to remove duplicates from a list while preserving order.

Removing duplicates from a list: list(set(lst)).
Removing duplicates from a list while preserving order: list(dict.fromkeys(lst)).
Removing duplicates from a list while preserving order, using OrderedSet: list(OrderedSet(lst)).

So with OrderedSet we gain a more obvious way to solve this common problem. We no longer need to go from set to dict; we can simply think in sets now.

I would also like to reiterate the non-practical applications of adding OrderedSet:

It sits nicely next to OrderedDict.
The implementation is easy because it can piggy pack off of dict which is ordered.
Java has an ordered set (LinkedHashSet).

jagerber · October 24, 2022, 4:07pm

Another vote from the maintainer of python ordered-set package for including OrderedSet in stdlib. The owner also indicates the implementation in stdlib should probably be different than the implementation in that package and that it should derive from new default dict orderedness. She is open to discussing and using some parts of the ordered-set package implementation if there is anything of interest.

github.com/rspeer/ordered-set

Interest in including OrderedSet in python standard library?

opened 07:10PM - 22 Oct 22 UTC

jagerber48

Hello, I am interested to know if the maintainers of this package would have int…erest in including this functionality in the python standard library. It seems natural for `OrderedSet` to sit next to `OrderedDict` in the `collections` package. There is a discussion on the Python Discussion Forums. It is not clear-cut that it should be added but I wanted to gauge the interest of the folks maintaining the current Python ordered set implementations. https://discuss.python.org/t/add-orderedset-to-stdlib/12730/15

pf_moore · October 24, 2022, 4:55pm

So, so far, two 3rd party implementations who don’t want to see their code go into the stdlib, but who are in favour of someone else writing a stdlib implementation^[1]? That’s perfectly fair, but still leaves us with the question of who would write (and maintain) the stdlib implementation.

This feels to me like a borderline case at the moment. No-one is particularly against the idea, but no-one is offering to do the work. I suspect that this will go nowhere unless/until someone writes an implementation.

Maybe at this point the next step should be to create a PR adding OrderedSet to collections. That might get some core dev attention and maybe a sponsor for a PEP.

Which presumably would mean they could recommend people use the stdlib version rather than theirs in future… ↩︎

storchaka · October 24, 2022, 5:24pm

Before moving on, find the answer to a few questions for yourself:

Why was not OrderedSet added together with OrderedDict?
Why was not the builtin set made ordered at the same time when the builtin dict was made ordered?

Note that both the builtin set and collections.OrderedDict have the same author.

toddrjen · October 24, 2022, 11:31pm

I found this explaining why sets aren’t ordered: https://groups.google.com/g/dev-python/c/QYwuyo_qQRE

But I haven’t had any luck determining why collections.OrderedSet was not implemented, and there was a lot of support for it in the above thread. Do you have a link to that discussion? I can’t find it in python mailing lists nor the github issue tracker

pylang · October 25, 2022, 12:19am

Lacking Osets

I’ve summarized the latter discussion on insertion order in sets elsewhere on SO. That response has some overlap with the link above. The rest of the SO post has some discussion and other resources germane to this topic. I add those links here only to avoid retyping and help others gather refs.

Math + Python

Also on SO, there I posted a short comparison between mathematical data structures, related Python implementations and their guaranteed properties.

Container | Ordered | Unique | Implemented
----------|---------|--------|------------
set       |    n    |    y   |     y
oset      |    y    |    y   |     n
list      |    y    |    n   |     y
mset      |    n    |    n   |     n*

In short - the OrderedSet (oset) being discussed is not implemented in Python, but dict.from_keys() is related. Neither is the Multiset (mset) implemented, although the collections.Counter is a similar cousin (also dict-based).

Conclusions

If OrderedSet is added on the basis of mathematical symmetry or completeness, then ensure that same reasoning aligns with why MultiSet should or should not be added into Python’s Standard Library.

Again, I think the modern dict is an elegantly, sufficient implementation that thankfully serves as a common-ground between three additionally distinct data structures: OrderedDict, OrderedSet and Multiset.

toddrjen · October 25, 2022, 12:34am

Yes, that links back to the same thread I linked to. The thread provides good reasons why the default set implementation shouldn’t be ordered. They don’t seem to address why collections.OrderedSet isn’t a good idea, other than that some people don’t personally have a use for it or they think that mathematically it doesn’t make sense.

vbrozik · October 25, 2022, 7:55am

I did not notice that ordered sets were asked for because of mathematical completeness. I noticed that they were asked for because of practical use.

I do not see much practical use for multi sets. What special properties of multi sets would be practically useful? I mean properties which are not available with existing list and collections.Counter.

I do not understand the argument that ordered sets would not make sense mathematically. You can make the same operations with ordered sets as with unordered sets by just ignoring the additional property of items order. We could argue a similar way that int does not make sense mathematically because it has an additional property - you can get id of an int object. (I understand that there are also other differences between mathematical integer and python int.)

steven.daprano · October 25, 2022, 9:45am

Because the Python core developers are overworked, and we are risk-adverse when it comes to adding new features to the language. If we don’t add something, we can always add it later. But if we do add it, we’re likely stuck with it, if not forever, for a long time until we can go through a long deprecation period and removal.

Consequently, we tend to add new features only slowly, when it becomes really obvious that they are needed and not better suited to a third-party library.

Some history:

the first version of sets was a pure Python wrapper around dicts;
the next implementation of sets was built in;
it is difficult to have a good, reliable, fast dict implementation which preserves order, which is why dicts were unordered for a long, long time;
when we finally came up with a good ordered dict implementation, that implementation unfortunately does not carry over to sets;
so making builtin sets preserve order is still not practical.

But we could steal the pure-Python set implementation from Python 2.5 or 2.6 (I forget which), and repackage it as an OrderedSet.

I don’t think that would take much work. It was working and well tested, it just needs to be renamed and updated to Python 3, and new tests added.

pylang · October 25, 2022, 1:20pm

Yes, math has been discussed a few times here. My point is, if the discussion goes down that route, explain why OrderedSet has more relevance to be added than say a Multiset.

I’m not advocating adding either to the Standard Library. A “dict” is good enough.

Side note: msets actually are pretty useful. We’ve been benefiting from msets through collections.Counter, which practically is a multiset