Defaultlist: an optimisation for defaultdict

AdamWRichardson · January 24, 2024, 12:15pm

Hi all, thank you for taking the time to read this and the great work in the python community.

I’ve come across situations while coding where I want the behaviour of a defaultdict but where the keys are unordered unsigned integers (essentially indices). When you know ahead of time how many elements are going to be in the collection you can use [None] * N and which is then fast to perform access.

However in the case where you don’t know the size of the list ahead of time and insertions are not guaranteed to be ordered (inserting element 1 before element 0) the above approach doesn’t work. This seems like a pretty niche corner case but I’ve found myself in this situation multiple times. Historically I have resolved this by using a defaultdict but this seems inefficient for memory and is also slow due to pointer indirection and hashing (although this is less of a problem because hash(1) == 1).

Instead I would like to propose adding a defaultlist data structure to the collections module which is very similar to defaultdict except subclassing from list instead. A quick python implementation is:

class defaultlist(list[T]):
    __slots__ = ("default_factory",)

    def __init__(
        self,
        default_factory: Callable[[], T] = _none_callable,
        __iterable: Iterable[T] = (),
        /
    ):
        super().__init__(__iterable)
        self.default_factory = default_factory

    def __missing__(self, index: int):
        self.extend(self.default_factory() for _ in range(index + 1))

    def __getitem__(self, index: int) -> T:
        if len(self) <= index:
            self.__missing__(index)
        return super().__getitem__(index)

    def __setitem__(self, index: int, value: T):
        if len(self) <= index:
            self.__missing__(index)
        return super().__setitem__(index, value)

I’m keen to hear the communities thoughts on this and happy to make the contribution myself if deemed useful.

ericvsmith · January 24, 2024, 12:18pm

This seems like a niche case not well suited for the stdlib. What’s wrong with just using your implementation above in your own code?

AdamWRichardson · January 24, 2024, 12:25pm

It appears that the pure python is considerably slower (in a bad case) than defaultdict for insertions and retrieval. I can certainly implement in C and release on PyPI but thought maybe this could be useful for more people. I do agree that it seems quite niche.

edit: In fact there is a pure python implementation on pypi but it’s last release was in 2017.

boxed · January 24, 2024, 1:36pm

“Seems”? That doesn’t sound like you measured a real problem you were having…

AdamWRichardson · January 24, 2024, 1:51pm

That doesn’t sound like you measured a real problem you were having

Just running a quick benchmark using pympler.asizeof

for i in (1, 10, 100, 1000):
        default_list = defaultlist()
        default_dict = defaultdict()
        for j in range(i):
            default_list[j] = "something"
            default_dict[j] = "something"
        print(f"size of defaultlist for {i} entries: {asizeof.asizeof(default_list)}")
        print(f"size of defaultdict for {i} entries: {asizeof.asizeof(default_dict)}")

Give the following results:

size of defaultlist for 1 entries: 96
size of defaultdict for 1 entries: 328
size of defaultlist for 10 entries: 192
size of defaultdict for 10 entries: 744
size of defaultlist for 100 entries: 1216
size of defaultdict for 100 entries: 7960
size of defaultlist for 1000 entries: 8608
size of defaultdict for 1000 entries: 69024

Even for python this is a significant waste.

barry-scott · January 24, 2024, 3:53pm

Compared to what?

What are the numbers if you use a list() and a dict()?
Then you can see the cost of defaultlist/defaultdict over a list/dict.

AdamWRichardson · January 24, 2024, 4:03pm

@barry-scott I’m sorry I don’t quite follow what you are asking here. Since the above implementation of defaultlist inherits from list (and defaultdict inherits from dict) I would expect that defaultlist uses the same amount of memory as a similarly sized list.

The above compares the memory usage of defaultlist with defaultdict which is an illustration of my main point - in a (niche) situation a default list can be more optimal than a default dict.

I’m sorry if I’ve missed the point of your question.

Kacarott · January 24, 2024, 4:43pm

While I find the idea of defaultlist interesting (perhaps a bit niche?) I also have a worry about its potentially misleading performance implications. The use case you listed is when keys are unordered unsigned ints and where a regular list wont work, however its actually narrower than that. The keys must also be fairly “dense” (as in, no big outliers), otherwise the efficiency compared to defaultdict plummets. For example compare the memory usage of defaultdict to defaultlist for keys: (5, 90000, 65, 187).

While this is fairly obvious if you know what is happening under the hood, I worry that it would be adding a trap for users who compare it to defaultdict and expect it to act like an efficient sparse list. But maybe that isn’t really a valid concern, I am not sure.

Kacarott · January 24, 2024, 4:52pm

Btw I believe theres actually a bug here, shouldnt it be for _ in range(index + 1 - len(self))? I think currently you generate way too many values on subsequent accesses.

barry-scott · January 24, 2024, 5:20pm

Let me see if I can get my thoughts in order.

Your benchmark is not matching your usecase right?
Are you after a sparse array implementation?

You are measuring the different in a list of N items and a dict of N items.
Also the index/key is range(N) that is a dense range.

Changing defaultlist to list and defaultdict to dict I see these results on my mac with 3.12 and see results like you get.

% tool-python3 a.py
size of list for 1 entries: 144
size of dict for 1 entries: 312
size of list for 10 entries: 240
size of dict for 10 entries: 728
size of list for 100 entries: 976
size of dict for 100 entries: 7944
size of list for 1000 entries: 8912
size of dict for 1000 entries: 6900

In other words you have measured the cost of the python dictionary implementation.

AdamWRichardson · January 24, 2024, 5:42pm

Yes thank you for pointing this one out! I missed it out of the original description but this is true. The original usecase I had for this sort of structure was mapping the indices of one list to the indices of another so in that case it was dense.

Yes, absolutely! I haven’t thoroughly tested the above code so was expecting some bugs, thank you

AdamWRichardson · January 24, 2024, 5:49pm

Thank you for clarifying, I think I see where the confusion has come from:

This was not the idea I was trying to propose. As pointed out by @Kacarott the data in the list (for my use case) is quite dense.

I completely agree that for sparse entries a dictionary is way more efficient as your tests show but for dense entries (as I was trying to show) a list is far more compact.

The dense data argument might be the final nail in the niche coffin such that this doesn’t need to be in the standard library

apalala · January 24, 2024, 8:52pm

Maybe a better algorithm is to create a list of the estimated length, and grow it more agressively (like in hash-maps, or memory managers) when a new index is out of bounds.

AdamWRichardson · January 25, 2024, 9:36am

If you have an estimated length then you can provide this as the second argument to the above implementation (similar to passing a second argument to defaultdict).

I’m not sure what you mean by grow aggressively, if you mean something similar to a hash tables load factor then growing aggressively then I don’t think the list would be any more efficient than a standard dictionary.

Sorry if that’s not what you meant by growing agressively.