When you need to find out if something is in some collection of items
such as a list, how slow (expensive) that it depends on the nature of
the collection.
If you have a list which was in no particular order you would need to
examine each item in the list to see if it matched. So the expense, on
average, would be proportional to the length of the list. We call this
cost O(N) where N represents the length of the list. Big-O notation
means the order (scale) of the expense as a function of some measure of
the problem - here the measure is the length of the list.
See: Big O notation - Wikipedia for detail and further
references.
If the list was known to be in order, for example ascending order, you
have a couple of choices to take advantage of this:
- search until you find a match or the items exceed your target value -
at that point you know there’'s no point in searhcing further; that is
also O(N), but better - about o(N/2)
on average
- do a bisect search
A bisect search starts in the middle of the list. If your item is less
than that entry, pick a point half way between the start and that entry
and look again. If it’s more, pick a point half way between that entry
and the end and look again. In this way we start with two bounds: the
start and end of the list. Pick an item half way between and compare.
Then choose the left or right side as the new boundary and repeat.
In this way the boundaries get closer together until you find a match or
the boundaries come together (no match). Because each iteration halves
the size of the range, you need about O(log2(N))
iterations. This is
much faster than the linear search. For larger lists. For a tiny
list like your 3-element examples the mucking about exceeds the cost of
just scanning the list. In the real world the bisect approach is a win
for this.
A set is better still. It uses a data structure internally called a hash
table: Hash table - Wikipedia
A hash table is essentially a list, each of whose entries is a small
list of items. You decide where an item goes in the list with a hash
function whose purpose is to (on average) distribute the items evenly
across the list, and the hash function always produces the same value
for a given item. You choose the size of the list based on the number of
values you’re storing in it in. The idea is that on average each entry
in the list has very few items stored it in, often 0 or 1.
Then, when you go to see if some test item is in the table, you compute
the test item’s hash value, and use that to pick which entry would
hold this item if it were in the list. Then you only need to see if the
item occurs in the very short sublist stored in the entry, and can
ignore all the other entries completely.
By sizing the table to have very few items in any entry, this makes the
cost of looking things up the cost of searching a 0 or 1 element list,
which is basicly a constant small time. We call this O(1) in big-O
notation.
This is amazingly effective. Of course it isn’t free: you need to do
work to create and populate the list using the hash function etc. But
once made, lookups are very fast.
A Python set
uses a hash table to index its element, and thus has
O(1) lookup time (the cost of a single item in Akeys
test).
A Python dict
which maps keys to values also uses a hash table for
its lookups.
When you make a set, eg:
s = set( (1, 2, 3, 4, 2, 6, 7) )
Python does all this work for you, and then you can test:
item in s
very cheaply.