Add `OrderedSet` to stdlib

Can we have an OrderedSet class in collections?

An OrderedSet is a set that maintains insertion ordering.

The current workaround is to use dict but I thought that using a dictionary as a set was too ugly to use in my library so I wrote an OrderedSet wrapper around it. It would be nice to have collections.OrderedSet be a thing though.

 
Relevant:

12 Likes

… using a dictionary as a set was too ugly to use in my library …

My guess is you might need a bit more rationale if really want to add this to the Standard Library. :slight_smile:

And Merry Christmas!

I’m in favor of having OrderedSet somewhere in the standard library. Probably collections.

Suppose we don’t want to add dependencies to a projects just to support the OrderedSet collection type object.

If we don’t want external dependencies the best way I know to implement an ordered set (without writing a new class which would be annoying to do in multiple packages) is to use the keys of a dictionary. Say my ordered set contains 'apples' and 'oranges'. I can do

ordered_set = dict()
ordered_set['apples'] = None
ordered_set['oranges'] = None

ordered_set will then act in some nice ways like an ordered set. For example we can do things like for element in ordered_set... and it will behave as expected. One major shortcoming is when we want to printout the ordered set: print(ordered_set) will reveal the ugly dictionary structure. This ordered_set variable also lacks typical set syntax like ordered_set.add('bananas'). I doubt set operations like unions or intersections are easily available either.

There are perfectly acceptable external packages for the OrderedSet like ordered-set · PyPI and orderedset · PyPI and others. But all of these require adding an external dependency to a package for such a small purpose. It seems OrderedDict has been a part of collections for a very long time and now even dict objects themselve are natively ordered. Why do I need an external dependency for ordered sets but not dicts?

I don’t know much about performance considerations but my understanding is that the ordered sets may be slower than regular sets. I read in a StackOverflow post (don’t have the link right now) that the cython implementation orderedset package is 5x faster than the python implementation and 5x slower than regular sets. If performance is a concern it seems like it would make more sense to, at least for now, keep a seperate OrderedSet object around in collections rather than making existing set objects ordered (like how regular dict objects are now ordered).

I haven’t contributed to Python before. @pylang mentioned “you might need a bit more rationale”. Are there other types of rationale we should be thinking about other than what I’ve mentioned above?

1 Like

I don’t understand what an ordered set (which just seems like a fancy list to me?) is used for. Could someone share some applications where orderedsets are the best datatype for the task?

Could you please elaborate why this is a bad thing? External dependencies have many benefits, but only one major draw-back that i can think of (increasing the likelihood of a dependency resolution conflict, if the version is restricted).

An ordered set would be a data structure which supports the set API, but additionally guarantees that elements keep some specified order (usually insertion order).

Sets are most certainly not merely “fancy lists”. The most important part of the set API is fast membership testing, preferably O(1) (constant time) but sometimes O(log N) is acceptible. This contrasts with lists, where it is O(N).

For example, if colours is a list of strings, then "red" in colours will take time proportional to the size of the list. If you double the size of the list, it will double the average time to search it. If you increase the size of the list to be 100 times bigger, then on average it will take 100 times longer to search.

But with sets, the time is (approximately) independent of the size of the set. "red" in colours will take the same amount of time regardless of whether the set has ten entries or ten million entries.

That fast membership testing allows sets to efficiency support operations like intersection and disjoint union, which are possible with lists but slow.

Python’s sets are unordered: they do not guarantee the order of elements:

>>> print({"red", "blue", "green", "yellow", "orange"})
{'blue', 'red', 'orange', 'yellow', 'green'}

An ordered set would guarantee the order of elements. This is useful for many reasons:

  • having a consistent and predictable display;
  • making it easier to write doctests involving sets;
  • guaranteeing iteration order;
  • predictability of set.pop().

But adding order to a set doesn’t come for free: it requires a significantly more complex implementation, which is more work, probably requires more memory, and may slow down set performance.

11 Likes

I don’t dispute that external dependencies have advantages (yay, code I don’t have to write, or maintain!) but the idea that it only has one disadvantage is very naive.

Many organisations and people have strong requirements – including legal and statutary requirements – regarding the software they use. They may need to audit the software for quality, copyright and trademarks. They may have strict legal or corporate requirements for what they can use, with a long, slow and expensive (in both time and manpower) processes for getting external software approved.

Typically, such organisations may have already approved “the Python standard library” as an approved dependency (if they haven’t, they won’t be using Python and we don’t care about them!) but any external third-party library may be simply too hard to get approved.

As a project manager, you don’t control external dependencies. You can’t control API changes or set the priority of upgrades and bug fixes. You have to work around their timetable.

External dependencies can be taken away by legal (copyright and trademark disputes), or even if the developer in charge has a fit of pique.

The Javascript/Node.js community has a culture of heavy reliance on external dependencies, with predictable results, not just leftpad but others. Protestware can easily become malware.

Some Node developers talk about needing to rewrite projects after just 18 months due to the bitrot of external dependencies.

Fortunately the Python community does not have the same extreme culture of using external dependencies for trivial functions like leftpad, but the risks still exist.

I’m not saying these problems are insurmountable, but they are problems that need to be managed:

  • External code repos are a target for attackers.
  • Sometimes the maintainer of the package is the attacker. The next protestware that intentionally deletes files may target your country, not Russia.
  • Every external dependency is another point of failure for your project.
  • External dependencies run to their timetable, not yours.
  • And then there are dependency conflicts.

If the costs of external dependencies are less than the benefits, for you and your projects, then that’s great! But not everyone has the same cost/benefit tradeoff.

There is another class of people who cannot easily use external dependencies. Even in 2022, there are many people in the world who do not have cheap, easy, fast and reliable internet access. Access to external dependencies can be restricted or blocked, either by your own country or by Github themselves. Many schools and businesses prohibit the installation of “unauthorized software” on their machines.

Whether due to economics or politics or some other reason, the ability to just run pip install ... is not a universal privilege.

6 Likes

I still wish for OrderedSet.

I like to use an ordered set as the backing collection type when implementing the observer pattern. A regular set might be adequate for this, but I have the impression that most users would want predictable firing orders for handlers. If it were included in the standard library, it would have been a good candidate as the backing collection type for Future.add_done_callback(), and Future.remove_done_callback() would be O(1). Admittedly, a lot of observer pattern implementations, such as C#’s event keyword, just use an array/list.

Another use case I had for this collection type is for monitoring changes in some external listing. Fast membership testing is beneficial here. But for this scenario, I need something like a BoundedSet which is an OrderedSet with a maximum size, similar to what collections.deque has.

I never said that sets are fancy lists, but ordered sets seem like fancy lists. It’s not very precise, but I think the rest of your answer confirm what I was thinking. I cannot remember ever needing this kind of data structure myself, but maybe I had some problem where it would have solved my issues :thinking:

I thought some more about what I meant by “fancy list”, and I think it boils down to the two essential (?) properties of a set, all elements are unique and unordered. To me, the most important of these properties is that the elements are unordered, which means that, to me, an ordered anything is not a set (insert something about ZFC and how everything reaaly is a set), and thus an “ordered set” is, to me, more of an “unique stack” or whatever other data structure you want it to be like. However, if you think that the more important property is uniqueness, you probably think my intuition is wrong. Doesn’t matter really since both properties are equally important in defining a set.

@steven.daprano Thank you so much for your great overview of the downsides of external dependencies. Indeed I just submitted a request with the security team at my organization to include external ordered set packages in our internal repo which will take some time to process. Getting the dependencies included on commercial projects will also require time and discussions. Another downside I encountered is, because of whatever quirk of my environment I ran into C related errors when I tried to pip install the cython orderedset package. Suffice it to say I am having to pay a time cost to include OrderedSet in my project. And the question remains, why must I pay a cost for OrderedSet but not OrderedDict?

@ajoino I don’t think the need for ordered sets needs to be justified. There are already packages available dedicated to providing ordered set functionality so clearly folks want them for something. Nonetheless, I’ll provide my use case for posterity and conversation fodder. I have an object which has tags (strings) that are dynamically added to the object. But the tags are unique so it is very convenient to have a tag.add('tag_0') api. To get uniqueness with a list I would have to do a manual check, and as has been mentioned already, there is a performance hit with a list compared to a set (though performance at this junction is not a concern for my specific present use case). Now, these tags are generated in the order of specific user input data and after being generated, the tags will be used to label data processing output (printouts, plots, etc.) and, for convenience, the user will expect the data printouts to be in the same order as the tags were generated. This could of course be accomplished without ordered sets (for example I could include new attributes which remember the order of tag inputs and then pull these in later) with slightly more overhead but ordered sets are a very natural and convenient solution.

So for my use case I would like a data structure whose elements are unique and whose elements are ordered. Lists and tuples are ordered but do not have unique elements. Sets have unique elements but are unordered. So my use case calls for a different collection type, the OrderedSet which many others have found uses for. I don’t see any reason why this container type shouldn’t be in collections next to OrderedDict. wrt the “fancy list” conversation: OrderedSets are fancy lists in that they have element uniqueness and OrderedSets are also fancy sets in that their elements are ordered. But the fact that ordered sets are “fancy something elses” doesn’t, to me, have any bearing on if they should be included in the standard library or not…

The need for ordered sets does not need to be justified, but the need for them to be in the standard library does. Given that many other packages, with significantly wider usage than any of the ordered set libraries, are not in the stdlib, what makes ordered sets a better candidate for stdlib inclusion than (say) urllib3, more-itertools, or packaging?

Also, which ordered set library are you proposing? An existing one? Does the maintainer even want to offer it for the stdlib? A new one? Who will write and maintain it, and why is it worth the effort of someone rewriting functionality that already exists?

3 Likes

Earlier in the thread @jagerber showed how to use a dict as an ordered set. Sets in Python where originally implemented using dicts. There’s no reason the same couldn’t be done for OrderedSet. In fact, you could just go back to the latest Python branch where sets were implemented in pure Python, 2.7 I think:

Since dictionaries are ordered (implicitly since 3.5 or thereabouts, explicitly since about 3.7), your sets will also be ordered.

4 Likes

Thanks for the comment. Yes, I was strictly saying “there are clearly use case for ordered sets” in response to the other poster. I agree that including them in standard library needs to be justified. That’s what we’re trying to work on in this thread.

I’ll start off by re-iterating that I haven’t contributed to Python before, so I don’t know that the cultural norm/bar is for including something in the standard library and what sorts of considerations are typically made. I think there’s evidence here that there would be some good benefits. I don’t see any downsides to including it other than the effort that has to be put in. If anyone has downsides other than the effort to include it I’d really want to hear those points.

My main argument for why OrderedSet is a good candidate to include compared to those other libraries is that OrderedSet feels very qualitatively similar in many ways to OrderedDict which has been included for a long time, and at least to someone who finds themselves having a need for an ordered set, and who has a cost associated with including external dependencies, the lack of OrderedSet next to OrderedDict feels like a hole/oversight in the collections library.

To address your specific questions: I haven’t done deep research into the different options yet. I’m relying mostly on this Stack Overflow question and answer: Does Python have an ordered set? - Stack Overflow. The main options I see:

  • “orderedset” cython pacage feels like a good option since it is probably most performant.
  • The pure python “ordered-set” package would probably be good.
  • I haven’t looked into the implementation in collections-extended.
  • Reverting to having OrderedSet derive from dict or OrderedDict as sets used to in older versions of python.

I can reach out to the maintainers of the first 3 to see if they have any interest in having their implementation included in the standard library.

I don’t know who will write the code. I’m open to helping with that effort but would have to learn a lot. I don’t know who typically writes code like this, I guess folks here.

The code would be maintained by the same person/people who are currently maintain the collections package. I could reach out to them as well (hints as to the best avenue to do this would be appreciated, I assumed this discussion forum would get me close…) to get their input.

Again, whatever the justification was for including OrderedDict in collections, or making regular dicts ordered, could probably be copy and pasted to justify the inclusion of OrderedSet.

The mere existence of the code in the stdlib is a burden. It needs to be maintained going forward.

Yes, needing to maintain code is also a burden. So the costs are (1) time/energy to implement code in standard library and (2) time/energy to maintain code in the standard library. These are the costs of any piece of code that would go in the standard library and clearly some code does get added.

What are the usual metrics used to determine whether something is included in the standard library against the unavoidable cost of actually adding and maintaining the code? Where does the discussion go from here? Like I said above I will contact the maintainers of the existing packages. Is there a way to get in touch with specific folks who maintain the collections package, or are the people posting here already those people?

Does this discussion forum act as go/no-go gate on ideas like this or is there somewhere else decisions like that are made?

I never asked for justification, was curiously asking what the use cases are because I had never needed anything like it. Also, please don’t get hung up on the wording “fancy list” (my bad too since I used in 3, now 4, posts), my point was that the data structure described in this thread is, to me, not a set at all. “Fancy list” was just a placeholder for “any ordered data structure with a uniqueness requirement”.

I propose that if this object gets added to the std lib, it should be named “UniqueStack”, “UniqueList”, or something less set-related. I take no stand on whether it should be added or not.

But it is a set – it would have all the set methods, isinstance(orderedset, set) should be True – the only difference would be preservation of insertion order. Why call it anything that doesn’t have set in the name?

5 Likes

Not entirely, but to some degree.

The normal process to add something like a new data structure is:

  • Gather as much community feedback as you can, at a minimum here on Discuss or the Python-Ideas mailing list. (You can also discuss it on places like Reddit, or any other forum you like.)
  • If the community seems to get behind the idea, or at least not be strongly opposed, the next step is to ask for a sponsor from the core developers.
  • If at least one core dev is willing to act as sponsor, you (or somebody) should write a PEP proposing the feature.
  • The PEP then gets sent back here for additional rounds of feedback.
  • If there is sufficient community interest in the PEP, the author then asks the Steering Council to accept the PEP.
  • If they accept it, then somebody (often the PEP author) will implement it and add it to the std lib.

Note that acceptance by the Steering Council does not necessary mean that there will be a volunteer willing and able to do the work needed to push the data structure into the std lib.

@steven.daprano Thanks for that information. I think that’s pretty clear. So really the question, if I want OrderedSet in python stdlib is what information needs to go into a good PEP for that feature I guess PEP 372 – Adding an ordered dictionary to collections is a good model to go after. I guess I have more thinking to do on this but welcome others’ thoughts.

By way of update, the author of the cython orderedset package responded (Interest in including OrderedSet in python standard library? · Issue #37 · simonpercivall/orderedset · GitHub) indicating that that package should not go into stdlib because it is cython based but did cast a vote for including an implementation of OrderedSet in the stdlib and that its implementation should be based on (now ordered) python dictionaries.

1 Like