Python sublist check

What is a fast method to check if one list is a sublist of a bigger one? For example, in my case,

A = [1, 2, 2, 3] is a subset of B = [1, 2, 2, 3, 4], but A is not a subset of C = [1, 2, 3, 4] since one of the 2’s in A is missing in C.

edit: premature send

Can you make use of sets? set(A) <= set(B) is extremely efficient. Perhaps, since duplicates seem to matter, collections.Counter() can serve as a multiset.

Can’t use sets because the containers I’m working with may contain duplicate elements. But multisets can contain duplicates?

If all the numbers are non-negative ints, you can use numpy.bincount:

import numpy as np
A = [1, 2, 2, 3]
B = [1, 2, 2, 3, 4]
C = [1, 2, 3, 4]

cA = np.bincount(A)
cB = np.bincount(B)
cC = np.bincount(C)

def is_subset(ca, cb):
    return len(ca) <= len(cb) and (ca <= cb[:len(ca)]).all()

print(is_subset(cA, cB))  # True
print(is_subset(cA, cC))  # False

Are the lists always sorted, as in your examples?

For sorted lists A and B, iterate an index through list B, to find the first occurrence at index i, of A[0] in B, i.e. B[i] == A[0].

So:

def is_sorted_sublist_of_sorted(A, B):

    # TODO: reimplement using bisect.py
    i = B.index(A[0])

    for x, y in itertools.zip_longest(A, B[i:], fillvalue=float('nan')):
        if x != y:
            return False
    return True

Your title says “sublist”, but your text says “subset”. OTOH, your examples are sets not lists, so I’ll assume list.

If also “sublist” means consecutive and anywhere (so that [2,2,3]is a sublist of B but [1,3,4] is not a sublist of B, then we are just dealing with the same problem as string search (as in str.find), but with integers instead of characters.

For this, I like Boyer-Moore. There is a set-up cost proportional to the target size (squared?), but it is fast if the target sequence is a lot shorter than the searched sequence, or you use the same target on a lot of searched sequences.

By definition, yes. There isn’t a specific multi-set implementation provided, but collections.Counter can function that way. They don’t implement comparison operators with the meaning that sets do, but we can do it more explicitly:

from collections import Counter

def is_sublist(x, y):
    cx, cy = Counter(x), Counter(y)
    cy.subtract(cx)
    return min(cy.values()) >= 0
1 Like

Counters do support comparison operators in expected manner. So you can simplify the code a and just return Counter(y) >= Counter(x)

Ah, it looks like this was added in 3.10:

$ python3.10
Python 3.10.14 (main, May  2 2024, 15:07:09) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from collections import Counter
>>> Counter() >= Counter()
True
>>> 
$ python3.9
Python 3.9.19 (main, May  2 2024, 15:03:24) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from collections import Counter
>>> Counter() >= Counter()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '>=' not supported between instances of 'Counter' and 'Counter'

If the input lists are always sorted as they are in your example then you can perform the test more cheaply as a subsequence test with an iterator:

def is_subsequence(a, b):
    seq = iter(b)
    return all(i in seq for i in a)

print(is_subsequence([1, 2, 2, 3], [1, 2, 2, 3, 4])) # outputs True
print(is_subsequence([1, 2, 2, 3], [1, 2, 3, 4])) # outputs False
1 Like

I have blogged this, but it depends on what you treat as “found”. I am looking for the elements of the sub-list in their given order, in the main list.