Sorting ordered dicts

Yes, that’s actually what I’m advocating for.

Oh, so id(d) before and after sort would return different memory addresses ? Other variables (or attributes) holding the dict would still hold the unsorted version ?
I think not.

Still wrong… this is the dict copied and ordered by value. Nothing to do with the original use case : having a dict initially sorted (in-place is better), then edited, then iterating the updated values following the former ordering.

Slow compared to what ? It would still be faster than any alternative listed here, by omitting unnecessary operations - with the additional advantages of being in-place as far as the public API can see, which is a small feature in itself.

Even though id(d) is not changed, sort is executed temporal array, and whole linked list will be reconstructed. Generally speaking, “in-place sort algorithm” doesn’t need temporal list + full reconstruction.

And that’s why your “in-place sort” would be much slower than simple “out-place” dict sort.

I can not understand your use case. I stop discussing about it.

Compared to:

items = sorted(d.items(), key=itemgetter(0))
d.clear()
d.update(items)  # This is what you call "in-place"!

Try implementing it yourself. Then you would understand how many operations are needed to reconstruct linked list in OrderedDict.
Both of Python implementation and C implementation are good. They are different from each other, but both need significant overhead for just reorder the linked list.

2 Likes

So, I did some tests, which are too long to copypaste here, so here it is (a gist with 5 different implems I’ll detail here).

  • Option A is self.update(sorted(self.items(), key=itemgetter(0))), which some here said it would be the quickest. Its drawback is, it doesn’t support the key= parameter.
  • Option B is sorting the keys using sorted() (which all the following options will do as well), then mapping them with their values, then passing that through update().
  • Option C is @storchaka’s : calling move_to_end on each key in sorted order.
  • Option D is the same, but without calling move_to_end, and instead executing a copypasted version of its code without the unnecessary if (since we always move to the last end).
  • Option E is a simplification of the previous one, avoiding unnecessary writes and rewrites of pointers in the double-linked list.

Here is my timing code:

import timeit
import random
import collecs # my other file is collecs.py

def time(d=None, /):
    if d is None:
        d = collecs.OrderedDict((random.randrange(1000), object()) for _k in range(100))
    sd = d.copy()
    print(tuple(d))
    sd.sort_C()
    print(tuple(sd))

    rv = {}
    for let in "CDE":
        rv[let] = timeit.Timer(f"dc.sort_{let}()", setup="dc = d.copy()", globals=dict(d=d)).autorange()
        dc = d.copy()
        getattr(dc, f"sort_{let}")()
        if dc != sd:
            print(f"sort_{let} failed")
            print(tuple(dc))
    return rv

I know, it only tests one type of key and one size of dict, but it’s a very basic version.
Also, you have to edit your collections/__init__.py to make it use the pure-python version instead of overriding it with the C version, otherwise implems D and E don’t work. Anyway.
As I tested it on 3.9, it turns out options A and B don’t work : update() doesn’t update the order of the elements. I’ll try again on later versions of python when I find the time (pun intended). It turns out I forgot the .clear was required before .update. Anyway.
The results are as follows :
A:(5000, 0.2827916), B:(5000, 0.2619572), C:(10000, 0.2920622), D:(10000, 0.21275239), E:(20000, 0.3542883)]

So, contrary to your predictions, it seems that creating a new dict is the worst way, and that my idea of optimizing Serhiy’s idea is the best performance-wise. And I’m not sure how in-place you would consider it, @methane, but as far as the public API goes it certainly is.

Or just add a list, and that’s basically option A without the entire self.update call.

I posted this as a proof-of-concept PR, so the tests would run : none failed (but at the same time none call the method, so…) (and actually it seems there may not be any test on the OrderedDict class ??).

There are some objections from @rhettinger on Github, I’ll answer them here.

This is a permanent API change just to optimize the idiom OrderedDict(sorted(od.items())) or for k in sorted(od): od.move_to_end(k).

Yes, more like the second one in that it’s in-place.

AFAICT people rarely use those idioms and when they do it is never speed critical; otherwise, they would be using a regular dict which already has much better performance than collections.OrderedDict.

I think there’s a chicken-and-egg situation here : people don’t use it because there’s no optimized way to do it - and also, in my opinion, because the for-move_to_end idiom itself is not obvious. Having an easily linkable OrderedDict.sort could change that a lot.

Also, collections.OrderedDict is close to being a dead part of the collections module. Mostly It is only useful for being backwards compatible to versions before Python 3.6 and for creating LRU cache variants.

Lastly, it isn’t desirable to have API to diverge further from the regular dict.

I’ll merge those into one, because I don’t see why it wouldn’t be desirable to add things to OrderedDict per se, other than for keeping it “dead”.

Yes, the OrderedDict is mostly dead as it stands. Not as much as the UserX classes, since it does have the move_to_end method and the popitem(last=False) feature, but I agree it’s still quite comatose. I’m proposing we change that, and maybe not even in a big way. Why would changing its deadness be a bad thing at all ?
I understand the fear of adding a new feature which will have to be maintained, but not the link between that and the deadness of the class.

In a way I think this is indirectly asking why this should be done on OrderedDict rather than the builtin dict : primarily sorting is related to ordering, so it fits more the purpose of OrderedDict to have its ordering being changed, than the simple dict.
Furthermore, according to the thread I linked in the OP, there are performance and implementation issues with implementing a dict.sort method. On the opposite, this implementation is - as far as my benchmarks went - the quickest available, period, seemingly even better than creating a new dict by sorting the items. My intuition (for now) is that taking aventage of the doubly-linked list, while being a waste of memory, lets us gain time as compared to the builtin dict. The lock-during-sort feature isn’t a problem in any of the proposed implementations, as was the fear for a dict.sort method.

Please share your benchmark in complete, runnable form using gist or github repository.
I can not believe benchmark result no one can reproducible.

Wait, are you sorting OrderedDict??
My prediction is sorting dict is much faster than sorting OrderedDict.

Please make an effort to stay relevant to the topic, everybody here talks about OrderedDict, it’s even in the name of the thread. We were talking about doubly-linked lists, that was always the case of OrderedDict and never of dict, so why did you think we weren’t talking about sorting OrderedDict ??

My benchmark is in a gist, and the code to test it I copypasted in a message above. Just look up.
It won’t work until you edit your collections/__init__.py file to make it not use the C-boosted implementation though.

Do you mean you need to use OrderedDict even if sorting dict is much (say 10x) faster than sorting OrderedDict?
Why you can not use dict?

Because as I said in the OP, another thread is dedicated to sorting dict and it poses other issues - implementation issues apparently - so this is about taking the other approach, OrderedDict.
Also, I only have a python implementation for what I want to do, even though if it were accepted it would eventually get a C version. So benchmarking my implem against C-based dict is comparing apples and oranges : of course it will be faster to compare execution of a C code with a Python code, but it gives no insight whatsoever as to the performance of a C version of my implem, on OrderedDict, versus your (plural your) ideas using the native dict.

Even conceptually it’s not trivial : sorting a native dict requires entering n keys into the mapping after pulling them out, whereas sorting an OrderedDict only requires setting 2*n links. It’s not obvious to me that the one would be faster than the other.
That’s the intuition I shared earlier : I suspect the OrderedDict structure, while heavier in memory, allows us to gain time as compared to dict in this sorting use-case.

I think it is a bad motivation.

Since OrderedDict uses 2x memory and slow to create than dict, adding sort method only to OrderedDict makes sense to me only if:

a. Only OrderedDict have strong use cases, or
b. OrderedDict.sort() will be much efficient than dict.sort().

You didn’t explain why OrderedDict is important for your use case.

(b) is the only reason to add sort method only to OrderedDict to me.
That is why I compared dict.sort() and OrderedDict.sort() performance. It is very relevant to this topic.

OrderedDict.sort() requires n times lookup + setting 2*n links. Dict.sort() requires n insertion.
I don’t know which is faster, but I would not expect OrderedDict.sort() to be more than twice as fast as dict.sort(). Both are O(n * log n) anyway.

2 Likes

Not for that, I was sort-of expecting the reverse, that the new feature had the potential to attract people into using OrderedDict - not that it’s a goal in itself either. That, and sorting being related to ordering and ordering to OrderedDict which makes the feature “fit” more with OrderedDict as a concept. So, no for your (a).

As for performance benchmarks, I’ll say again : you can’t compare C and Python. If we want a relevant benchmark, my OrderedDict.sort needs to be ported to C language. But I don’t think I’m able to write that, and I’m certainly incapable of compiling it.

Of course. I just wanted you to realize that OrderedDict.sort() needs to do more than just update the double linked list.

What else does it need to do? Doesn’t move_to_end also only update the double linked list?

C implementation of move_to_end does:

  1. Lookup key. (dict lookup)
  2. Find list node from key index
  3. Move the node to the end.

Since list node position is tightly coupled with internal index of the key, OrderedDict.sort() need to do this n times.

1 Like