Best way to test if value is in a predefined set of values

Gouvernathor · March 19, 2023, 5:34pm

In the arguably common case where you need to test if the value of a certain variable is among a set of predefined values, and for some reason you don’t need that set of values to be defined elsewhere.

if a in (1, 3, 4):

I always advised the use of a tuple in this case, instead of a list, because lists are candidly more complex objects than tuples so they may take less room in memory and/or take less time to get built. I extended a similar reasoning to creating a set (since sets are mutable), and also frowned upon using frozenset(1, 3, 4) because it’s less simple to read and it represents an avoidable function call, which is slow in python.

However, I recently read that __contains__ checks are quicker in sets and dicts than in tuples and lists, for reasons that made a lot of sense. So, is there someone with more knowledge than me as to the implementation of basic types who could settle the two following questions :

Is creating a tuple truly more efficient than creating a list ?
Is creating and testing in a set quicker than either of the above ?

barry-scott · March 19, 2023, 5:44pm

The python compiler may, and in cpython does, optimise this case.

Will depend on the number of items in the set to be checked.
I’m sure, without benchmarking, 1000 items in a set are a lot faster then
searching in a tuple. But for a small few items the tuple I’d guess.

tuple is O(n/2)
set is O(log(n))

tjreedy · March 19, 2023, 6:45pm

For a small set, anything is fast. For anything ‘big’, a frozenset(tuple) may be fastest. Use the dis module to get some idea of the bytecode, and then time if really care.

Gouvernathor · March 19, 2023, 6:48pm

I don’t think I have the level for using dis ^^
The type of use case I’m considering uses like a dozen of elements at worst, and usually 2 or 3, so I’ll stick with tuples being best as a good practice. Thank you both !

cameron · March 19, 2023, 11:09pm

Searching a tuple is O(n/2). Searching a set is O(1).

I’d imagine making them from a known-size source sequence should both be
O(n) (because you can preset each of them) but I haven’t tested this.
Though making a tuple will be faster (no buckets, no hashing). And as
mentioned, CPython at least makes a literal tuple directly and wires it
into the generated byte code.

Cheers,
Cameron Simpson cs@cskk.id.au

steven.daprano · March 20, 2023, 12:32am

TL;DR: use a tuple for two values, and a set for three or more, and you will not be far from optimum.

This is an interesting micro-optimization which often is inside a tight loop, so worth considering.

The answer will depend on what values you are matching, and what version of which Python interpreter you are using.

If your values are not hashable, then you can’t use a set. If that case, using a tuple is surely going to be faster and more memory efficient regardless of which Python interpreter you use.

(Some imaginary Python interpreter, let’s call it StoogePython, may be exceptionally bad, and have tuples bigger and slower than lists. But who would use it? Any reasonable interpreter will have tuples at least as efficient as lists.)

Recent versions of CPython will automatically convert a list-display into a tuple:

>>> import dis
>>> dis.dis('x in [1, 2, 3]')
  1           0 LOAD_NAME                0 (x)
              2 LOAD_CONST               0 ((1, 2, 3))
              4 CONTAINS_OP              0
              6 RETURN_VALUE

which is nice, but not all interpreters will do that so it is better to explicitly use a tuple.

Likewise, recent versions of CPython will convert a set into a frozenset:

>>> dis.dis('x in {1, 2, 3}')
  1           0 LOAD_NAME                0 (x)
              2 LOAD_CONST               0 (frozenset({1, 2, 3}))
              4 CONTAINS_OP              0
              6 RETURN_VALUE

but the same comment about other interpreters and versions applies.

Which is faster, a (frozen)set or a list/tuple?

Both successful and unsuccessful searches of a set take O(1) comparison;
unless there are lots of hash collisions;
successful searches of a list take O(N/2) comparisons on average;
unsuccessful searches take O(N) comparisons;
but the overhead of each data structure may be different.

In my experiments, I have found that for two items in the set/tuple:

successful searches for a tuple are, on average, about the same for a set
but if one item is very common, and the other is rare, and you put the common item first in the tuple, you can beat the set by about 20%
sets are always faster for unsuccessful searches.

For three or more items, sets are almost always faster except in the case where the first item in the tuple is very common.

Note that all of these tradeoffs may differ in interpreters which lack CPython’s keyhole optimizer that turns set displays {1, 2, 3} into a constant frozenset. But even there, you won’t be far from the optimal solution: everything is fast for small enough N, and for large N, the overhead of constructing the set is probably going to be relatively small compared to the overhead of searching a large list.

Rosuav · March 20, 2023, 12:59am

If the set has to be constructed just to be searched, I doubt you’ll get any benefit. Try this, which defeats the optimization:

rosuav@sikorsky:~$ python3 -m timeit -s 'x = -1' 'x in {1, 2, 3, 4, 5, len(globals())}'
1000000 loops, best of 5: 199 nsec per loop
rosuav@sikorsky:~$ python3 -m timeit -s 'x = -1' 'x in [1, 2, 3, 4, 5, len(globals())]'
2000000 loops, best of 5: 121 nsec per loop

barry-scott · March 22, 2023, 6:53pm

The set is O(1)? Really?

I would have thought that is not correct as that means that all the items in the have to have no hash collisions and that the hask bucket has been resize to be the size of the set.

Is that really what happens in practice?

Rosuav · March 22, 2023, 7:15pm

As long as hash collisions are relatively rare, it’s still O(1). Looking up an item in an N-element set requires calculating the item’s hash (no relationship to N), looking in a specific bucket (no relationship to N), and then potentially handling 1-2 hash collisions (not N/2 hash collisions). So yes, in the normal case, set/dict lookup really is O(1).

There are, of course, many caveats. Constructing the set still requires O(N) operations (hashing all the elements and putting thiem in their appropriate buckets), so a single-use set is worse than a single-use tuple; if you deliberately destroy the hashing algorithm (def __hash__(self): return 42), it devolves to a linear search with the extra overhead of the unnecessary hashing first; and I’m not certain, but I think that enlarging a set takes a bit more effort than enlarging a list. But the payoff really is constant-time lookups.