Random choice from a dict

matthiasgoergens · August 1, 2022, 5:16am

At the moment, random.sample from a dict gives a reasonable error message. When running random.choice on a dict, at best you get a confusing KeyError, and at worst you get silently wrong results. Examples:

>>> random.choice({0: 'a'})
'a'
>>> random.choice({1: 'a'})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.10/random.py", line 378, in choice
    return seq[self._randbelow(len(seq))]
KeyError: 0
>>> random.choice({'foo': 'bar', **{i: str(10*i) for i in range(10)}})
'90'
>>> random.choice({'foo': 'bar', **{i: str(10*i) for i in range(10)}})
'10'
>>> random.choice({'foo': 'bar', **{i: str(10*i) for i in range(10)}})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.10/random.py", line 378, in choice
    return seq[self._randbelow(len(seq))]
KeyError: 10
>>> random.choice({'foo': 'bar', **{i: str(10*i) for i in range(10)}})
'50'

As a first simple suggestion, I propose adding the error handling from random.sample to random.choice. As a reminder on my system random.sample’s error handling looks like this:

>>> random.sample({}, 1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/random.py", line 466, in sample
    raise TypeError("Population must be a sequence.  For dicts or sets, use sorted(d).")
TypeError: Population must be a sequence.  For dicts or sets, use sorted(d).

I have a proof-of-concept of this change.

Second, converting your dictionary to a list is one way to get a random choice. But it costs O(n) time and space. I have another proof-of-concept to get O(1) time and space random choice for dicts.

My proof-of-concept works as-is with today’s CPython dicts, but, alas, exploits an undocumented feature in their stable C-API.

There have bne previous suggestions to make random.choice work on dicts, but they were either to do an implicit conversion to a list or to use reservoir sampling. Both implicit conversion to a list and (naive) reservoir sampling take O(n) time.

So I agree with the rejection of these issues. Both because of the reasons mentioned in their tickets, but also because you can easily implement your own versions of these functions that do either implicit conversion or reservoir sampling in a library. No change to the standard library or dict implementation is necessary.

My suggestion is to enable random choice from a dict in O(1) time. Without any conversions, neither implicit nor explicit.

(To be completely clear, my proof-of-concept needs some polish to deal properly with very sparse dicts, ie dicts that used to have lots and lots of entries but recently almost all of them got popped.)

methane · August 1, 2022, 5:40am

It is O(n), not O(1).

        if PyDict_Next(d, &pos, &key, &value) and x + 1 == pos:
            return (x, <object> key, <object> value)

matthiasgoergens · August 1, 2022, 5:43am

That’s O(1) amortised (if your dict is not too sparse, eg say at least 10% filled).

Btw, this is basically the same logic that random.randrange from the standard library uses. It makes use of this function:

def _randbelow_with_getrandbits(self, n):
    "Return a random int in the range [0,n).  Defined for n > 0."

    getrandbits = self.getrandbits
    k = n.bit_length()  # don't use (n-1) here because n can be 1
    r = getrandbits(k)  # 0 <= r < 2**k
    while r >= n:
        r = getrandbits(k)
    return r

Yes, random.randrange (and thus also random.choice) in the standard library takes infinite time in the worst case, but O(1) in the average case.

I already mentioned the assumption of dicts not being too sparse:

(To be completely clear, my proof-of-concept needs some polish to deal properly with very sparse dicts, ie dicts that used to have lots and lots of entries but recently almost all of them got popped.)

You can detect a dict that eg less than 10% filled in O(1) time before the call, and cause a resize.

That would mean that random.choice on a dict would be O(1) on average (and when amortised with any series of dict operations), but could take O(n) in the worst case. Very similar to inserting into a dictionary, or even lookups, which could also take O(n) in the worst case, because of collisions.

methane · August 1, 2022, 6:06am

You are right. My mistake.

Could you give us a link to previous discussions?

Was O(n) only reason for rejection? Did we have enough use cases for adoption?

matthiasgoergens · August 1, 2022, 10:42am

No worries about the the asymptotic complexity.

Steven D’Aprano kindly provided a few links to previous discussions. Unfortunately, the forum software only allows me to put at most two links in this post. They are issue39626, issue37708, issue37682 and issue35094 on the Python bug tracker.

Was O(n) only reason for rejection?

I saw two recurring themes in the discussions: the suggestions of implicit conversion were dismissed as being implicit rather than explicit. For random.sample, the suggestion of reservoir sampling for iterators took less memory than manually converting to lists, but was rejected because the (naive) version of reservoir sampling they tried was to slow: it still O(n) runtime, and had worse constant factors than what’s currently implemented.

(And my own opinion is that both of these can easily be provided by a third-party library, and don’t need support from Python itself nor its standard library.)

Did we have enough use cases for adoption?

I came across this when I was implementing some algorithms for a project. I naively assumed that random.choice would work on dicts, and then was greeted with occasional weird KeyError exceptions. (The test data I had tried at first happened to have consecutive ints as keys.)

There’s also this stackoverflow question that has a few upvotes, and quite a few other questions that got closed as duplicates of it.

I eventually managed to work around the problem in my algorithm, but it made things more annoying.

Thus my minimum suggestion is to copy the error handling from random.sample to random.choice, just so that people are aware of the problem. As an extended suggestion, we can offer amortised O(1) random.choice (and efficient random.sample) on dicts right now without any changes to any other dict operation nor any overhead for any other dict operation—along the lines of the proof-of-concept code I provided.

The main downsides are:

it’s more code that we need to maintain
it might constrain future changes to dict implementation in unforeseen ways

(I don’t foresee any new constraints that we are not already exposing by making dicts guarantee ordering.)

Instead of extending random.choice, you could also codify an API function that lets you query arbitrary slots in the underlying dict representation. You’d either get the entry, or get told that the slot is empty or invalid etc. That’s what PyDict_Next already does as an accident, but we could either guarantee that behaviour or add an extra function whose behaviour is guaranteed-enough and defined-enough.

That would have benefit of allowing you to build random.choice out of simpler and public mechanisms. And different people could build other useful things.

However the big downside is that this would expose even more of how dicts work internally. So I don’t think I’d recommend it. (But again, dicts guaranteeing ordering now already constraints their implementation quite a lot, in a way that almost forces any implementation to be able to support queryable slots for cheap or even free.)

An even stronger suggestion, that I have seen in some issues, is that you let users query the i-th element in your dict in insertion order. For dicts that had no deletions (since the last resize), that’s equivalent to making the slots queryable. But for dicts with deletions, that would require extra overhead to support.

I do not support this stronger suggestion for dicts. (However it might make sense in the future to support this behaviour for ordered dicts. They kind of lost their purpose with normal dicts being ordered, but they are the place to put API that has a bit of an overhead in return for getting better access to orderedness and perhaps indexing by order.)

oscarbenjamin · August 1, 2022, 11:26am

I don’t know how the internal structure of dicts works any more but I checked the PyDict_New docs:

ppos should not be altered during iteration. Its value represents offsets within the internal dictionary structure, and since the structure is sparse, the offsets are not consecutive.

To me that suggests that if you choose a random pos and then call PyDict_New it will skip over empty slots in a table until it reaches a non-empty slot. That implies that the probability of an item being chosen depends on the number of empty slots preceding it in the table. Since the number of empty slots presumably depends on __hash__ and insertion order in some complicated way I expect that this cannot be considered to be a uniform random selection of the items in the dict.

You can see one (rejected) suggestion to add reservoir sampling here:

github.com/python/cpython

Add a function to get a random sample from an iterable (reservoir sampling)

opened 12:10AM - 16 Jul 20 UTC

closed 06:06AM - 19 Jul 20 UTC

a35477f8-71c1-46d8-a00e-b2f89ce712bf

stdlib 3.10

BPO | [41311](https://bugs.python.org/issue41311) --- | :--- Nosy | @tim-one, @r…hettinger, @mdickinson <sup>*Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.*</sup> <details><summary>Show more details</summary><p> GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at = <Date 2020-07-19.06:06:01.044> created_at = <Date 2020-07-16.00:10:31.020> labels = ['library', '3.10'] title = 'Add a function to get a random sample from an iterable (reservoir sampling)' updated_at = <Date 2021-07-02.17:34:09.940> user = 'https://bugs.python.org/oscarbenjamin' ``` bugs.python.org fields: ```python activity = <Date 2021-07-02.17:34:09.940> actor = 'oscarbenjamin' assignee = 'rhettinger' closed = True closed_date = <Date 2020-07-19.06:06:01.044> closer = 'rhettinger' components = ['Library (Lib)'] creation = <Date 2020-07-16.00:10:31.020> creator = 'oscarbenjamin' dependencies = [] files = [] hgrepos = [] issue_num = 41311 keywords = [] message_count = 17.0 messages = ['373733', '373742', '373771', '373781', '373794', '373828', '373853', '373857', '373861', '373864', '373865', '373870', '373894', '373928', '373930', '373943', '396870'] nosy_count = 4.0 nosy_names = ['tim.peters', 'rhettinger', 'mark.dickinson', 'oscarbenjamin'] pr_nums = [] priority = 'normal' resolution = 'rejected' stage = 'resolved' status = 'closed' superseder = None type = None url = 'https://bugs.python.org/issue41311' versions = ['Python 3.10'] ``` </p></details>

That uses O(n) time since it needs to iterate over the whole dict but it was timed to be slightly faster than building a list. Someone emailed me asking if I could “release” that code under a license so I put it here:

gist.github.com

https://gist.github.com/oscarbenjamin/4c1b977181f34414a425f68589e895d1

reservoir.py

# Copyright 2021 Oscar Benjamin
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in

This file has been truncated. show original

matthiasgoergens · August 1, 2022, 12:20pm

Thanks for having a look, Oscar!

I consulted the source code of the internal structure of dicts when writing that code. It is a uniform random selection.

Specifically:

Yes, if you select an arbitrary pos and then call PyDict_New it will skip over empty slots until it either reaches a non-empty slot or the end of the table.

Yes, the number of empty slots before an entry depends on insertion order and on deletions since the last resize. It has nothing to do with __hash__.

But that’s also totally irrelevant: my code checks whether any skipping occured, and retries until it hit an entry directly. That’s what the x + 1 == pos bit is for. All entries have the same probability of being hit directly. We discard ‘indirect’ hits.

Thanks for the link to the reservoir sampling implementation. I will have a look.

P.S. I had a look at the reservoir sampling implementation.

Nice! It looks like it should be using only about O(k * (1 + log (n / k)) random numbers or so. It still needs to consume the whole iterator in O(n) steps, of course.

In any case, that’s not what I am suggesting here. Especially since you can easily implement that as a third-party library with the only loss being convenience.

(My implementation does fiddle with the value of the ppos parameter, and thus violates what the docs tell you to do. That’s why it wouldn’t work reliably as a library. At least you wouldn’t get any guarantees between Python versions.)

oscarbenjamin · August 1, 2022, 3:19pm

I see now. Yes, that should be fine for uniform sampling.

Having consulted the code to see how dicts work these days I see that there is now a dense array of entries dk_entries that you could use directly so that you only need to call randrange once:

github.com

python/cpython/blob/d965d1995ec3efbbf666c2e7fab02217744d08e7/Objects/dictobject.c#L11-L32


      
          This implements the dictionary's hashtable.
          
          As of Python 3.6, this is compact and ordered. Basic idea is described here:
          * https://mail.python.org/pipermail/python-dev/2012-December/123028.html
          * https://morepypy.blogspot.com/2015/01/faster-more-memory-efficient-and-more.html
          
          layout:
          
          +---------------------+
          | dk_refcnt           |
          | dk_log2_size        |
          | dk_log2_index_bytes |
          | dk_kind             |
          | dk_usable           |
          | dk_nentries         |
          +---------------------+
          | dk_indices[]        |
          |                     |
          +---------------------+
          | dk_entries[]        |

This file has been truncated. show original

You can access that using the DK_ENTRIES macro.

Perhaps an alternative suggestion would be for Python to have a public API for getting the nth item from a dict since dicts are now ordered and the implementation can make it O(1) but that capability is not exposed publicly.

matthiasgoergens · August 2, 2022, 12:14am

That’s a really good idea, Oscar! I like it. In fact, I like it so much, that this is what I have already implemented:

The public API does not want to give you access to dk_entries, but you can (ab)use that PyDict_Next is both part of the public API and also just a thin wrapper around dk_entries.

Yes, dk_entries is mostly dense. However it handles deletions by replacing entries with a placeholder. The placeholders get compacted away only on a resize.

(That’s a pragmatic implementation of deletion, and works willl with the other design decisions for dicts that already only make amortised runtime guarantees.)

Perhaps an alternative suggestion would be for Python to have a public API for getting the nth item from a dict since dicts are now ordered and the implementation can make it O(1) but that capability is not exposed publicly.

Yes, that was basically one of my alternative suggestions in the original post. However, for the current dict implementation you can only really give access to dk_entries in O(1), and then let the user deal with placeholders and the facts that deletions will not change indices, but will make them ‘wrong’, and that indices will change on resize.

If you want to give users access to the n-th item in O(1) amortised time, and also want to have that still work after deletions, you need either an auxiliary data structure (or you need to be smarter than me, and see how you can provide that access quickly despite deletions).

I’m not even quite sure whether we can provide that access in O(1) time even with an auxiliary structure (and with small enough constant factors), perhaps we need O(log n) time.

If some auxillary data is needed, I would suggest to put that functionality into ordered dict. Ordered dict by itself is a bit pointless nowadays, but I would think it’s a good place to put some extra overhead to support more functionality around order and ordered access.

ChrisBarker-NOAA · August 3, 2022, 1:03am

There was a long discussion about this on the old python-ideas list.

Personally, I think it would be handy – and yes, it’s O(N), but should be a small constant. I still don’t get why folks don’t want to provide a useful functionality because it’s not as fast as it people might expect it to be – when it’s still as fast a it could be – it’s not like there’s any other way to get the nth element of dict that’s better performing – and there are many that are less performant.

The strongest (IMHO) argument against adding indexed access is that folks may use for repeated sampling, and in that case, it would be better to convert to a list first, and sample that – so better not give them the poorly performing option. I see that point, but I think there are any number of more obvious ways that folks can get sub-optimal performance – why make this one a show-stopper?

Anyway, IIRC, getting a random sample was the most common use case that folks came up with, so maybe simply addressing that directly is the way to go, as the OP suggests.

My first choice would be reservoir sampling of arbitrary iterables, as that’s more generally useful, but if we aren’t going to get that, then +1 on adding it directly to dict.

ChrisBarker-NOAA · August 3, 2022, 1:35am

Just for fun, I wrote up a few pure python implementations fo a random sample from a dict. All O(1) – from the simplest to the most-folks-won’t-think-of-it version using islice. (I think I learned about the performance benefits of islice from the thread about this on python-ideas – probably from @methane).

these are timed with a new, dense dict, which may affect the timing a bit, but I think the trend is clear.

Interestingly, making a full list first is about as fast as naive iteration. But iterating with islice is notably faster.

I think it would be worth it to put something like the islice one in the stdlib, or at least as a recipe in the docs.

[I whipped these out fast, off-by-one errors are likely :-)]

import random
from itertools import islice

DICT = {str(i): i for i in range(100000)}

def sample_list(d):
    return random.choice(list(d.items()))


def sample_iter(d):
    n = random.randint(1, len(d)-1)
    items = iter(d.items())
    for i in range(n):
        val = next(items)
    return val


def sample_iter2(d):
    n = random.randint(0, len(d)-1)
    for i, val in enumerate(d.items()):
        if i == n:
            break
    return val


def sample_islice(d):
    n = random.randint(0, len(d)-1)
    return next(islice(DICT.items(), n, n + 1))

In [45]: %timeit sample_list(DICT)
5.75 ms ± 72.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [46]: %timeit sample_iter(DICT)
3.65 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [47]: %timeit sample_iter2(DICT)
3.1 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [48]: %timeit sample_islice(DICT)
507 µs ± 19.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

matthiasgoergens · August 3, 2022, 4:58am

Could you link me to the discussion on the old mailing list, please, Chris?

I have a question going on at StackOverflow about whether we can give the indexed access faster. There was an interesting comment on the question, that I don’t understand, yet; I need to think some more.

As my prototype shows, for sampling you can give access to dk_entries in O(1) time to get sampling done in expected O(1) time for dicts that haven’t seen much deletion (and in amortised, expected O(1) time for all dicts including sparse ones, if you also expose a ‘resize, if below x% density’ request in the API).

You don’t need access to the stronger, proper indexed access for sampling—just retry when you hit an invalid (because marked-as-deleted) entry.

Reservoir sampling for arbitrary iterables is nice. But a third party library can provide that just fine. Putting that in the innards of CPython would only add minor convenience for the user, and I’m not sure that’s enough to overcome the burden of inclusion.

If you wanted to include indexed access in the standard libraries, I think providing O(log n) indexed access could be provided in the ordered dict. Since about Python 3.7 ordered dicts have otherwise become mostly superfluous, so this might be a good way to breath life into them again?

I’ll have a look at your benchmarks. I’ll rerun them locally and also add my own proof-of-concept.

matthiasgoergens · August 3, 2022, 9:43am

Chris, I reran all your benchmarks on my machine with a variable size for the dictionary.

I can broadly confirm your numbers. I can also confirm that your solutions take O(n) runtime, where n is the size of the dictionary. (I just replaced the 100000 in your example with a variable size for the dict.)

I also ran your benchmark against my solution. Given how you chose the input data, my solution is O(1). For simplicity here’s a reproduction of your experiment with 100k items in the dict:

In [1]: from chris_barker_benchmark import *

In [2]: %timeit sample_list(DICT)
3.89 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %timeit sample_iter(DICT)
2.12 ms ± 190 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit sample_iter2(DICT)
1.98 ms ± 76.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit sample_islice(DICT)
299 µs ± 9.39 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [6]: from index_access import sample

In [7]: %timeit sample(DICT)
418 ns ± 9.78 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Take note that timeit switches from milliseconds to microseconds to nanoseconds at the end.

Repeating with one million items in the dict:

In [8]: DICT = {str(i): i for i in range(1_000_000)}

In [9]: %timeit sample_islice(DICT)
4.34 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [10]: %timeit sample(DICT)
473 ns ± 1.8 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [11]: %timeit sample_islice(DICT)
4.74 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [12]: %timeit sample(DICT)
554 ns ± 31.5 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

As mentioned above, I also have some overly-complicated code to run with variable sizes:

import index_access as ia
def benchmark():
    Path('output').mkdir(exist_ok=True)
    funs = [sample_list, sample_iter, sample_iter2, sample_islice, ia.sample]
    while True:
        size = random.randrange(1, 1_000_000)
        d = {str(i): i for i in range(size)}
        f = random.choice(funs)
        start = time.perf_counter()
        f(d)
        stop = time.perf_counter()
        Path('output', f.__name__).open(mode='at').write(f"{size} {stop-start}\n")

You can see the result of plotting that. sample is the function from my suggestion.

ChrisBarker-NOAA · August 3, 2022, 4:31pm

This is the thread from pyton-ideas:

https://mail.python.org/archives/list/python-ideas@python.org/thread/S7UMTWK65X6BJDYZ3SSU7I7HOIASDMMJ/#XQPFI3CR46XN2XZHV2KJ2J5OUB7EORKC

hopefully the whole thing – threads do gte broken / go off on tangents sometimes…

ChrisBarker-NOAA · August 3, 2022, 5:03pm

@matthiasgoergens:

I’m still confused about how your implementation is O(1) (even though I just re-read the discussion about that, but I’ll take your (and the others’ in the discussion work for it) – and your timing does indicate that. it’s fast

Thanks for posting the timing – I think it makes a good point. I think the islice implementation is as good as you can do with pure Python. And your implementation is a lot faster for large dicts.

Which is a pretty good argument for including it in Python.

Though only if getting a fast random sample from a dict is an important use case – which I’m not so sure about.

QUESTION: have you run your code on a “messy” dict? e.g. one that’s had lots of deletions?

FINAL NOTE: you could still make this an external package – though it would have to be written in C, and compiled specifically for particular Python versions. But with binary wheels and conda and all, that’s not too hard. (will, a lot easier than it used to be, anyway…)

One other question – IIUC, you mention that you could do better if the internal array is exposed – but that’s not part of the public API for dicts. If this is an external package is needs to use the Public API – but if it’s builtin, couldn’t it use internal details? What would be locked in is an API for getting a random item for a dict – not how it was implemented.

matthiasgoergens · August 8, 2022, 3:56am

Thanks for your feedback, Chris! You actually inspired me to write a longer piece explaining exactly what’s going on. That one will include a pure Python implementation (or as pure as we can make it.)

While working on that piece, I discovered a small problem with random.randrange. See Fix corner case in _randbelow_with_getrandbits by matthiasgoergens · Pull Request #95775 · python/cpython · GitHub for the fix.

(Or specifically Fix corner case in _randbelow_with_getrandbits by matthiasgoergens · Pull Request #95775 · python/cpython · GitHub I had to roll back my fix, because we need to preserve the slightly buggy behaviour in order to reproduce pseudo random numbers between releases. Alas.)

Marco_Sulla · August 25, 2022, 7:36pm

Well, don’t know if it is in topic, but frozendict implements key(), value() and item() methods. For example, fd.value(1) gives you the second value of the frozendict. Since it’s immutable, it’s O(1). This way you can get a random key, value or item from a frozendict.

matthiasgoergens · August 25, 2022, 10:57pm

That’s neat! I’ll have a look!