Python sublist check

CyrusPolentschz · June 9, 2024, 4:00pm

What is a fast method to check if one list is a sublist of a bigger one? For example, in my case,

A = [1, 2, 2, 3] is a subset of B = [1, 2, 2, 3, 4], but A is not a subset of C = [1, 2, 3, 4] since one of the 2’s in A is missing in C.

Rosuav · June 9, 2024, 4:08pm

edit: premature send

Can you make use of sets? set(A) <= set(B) is extremely efficient. Perhaps, since duplicates seem to matter, collections.Counter() can serve as a multiset.

CyrusPolentschz · June 9, 2024, 4:28pm

Can’t use sets because the containers I’m working with may contain duplicate elements. But multisets can contain duplicates?

bverheg · June 9, 2024, 4:38pm

If all the numbers are non-negative ints, you can use numpy.bincount:

import numpy as np
A = [1, 2, 2, 3]
B = [1, 2, 2, 3, 4]
C = [1, 2, 3, 4]

cA = np.bincount(A)
cB = np.bincount(B)
cC = np.bincount(C)

def is_subset(ca, cb):
    return len(ca) <= len(cb) and (ca <= cb[:len(ca)]).all()

print(is_subset(cA, cB))  # True
print(is_subset(cA, cC))  # False

JamesParrott · June 9, 2024, 5:28pm

Are the lists always sorted, as in your examples?

For sorted lists A and B, iterate an index through list B, to find the first occurrence at index i, of A[0] in B, i.e. B[i] == A[0].

So:

def is_sorted_sublist_of_sorted(A, B):

    # TODO: reimplement using bisect.py
    i = B.index(A[0])

    for x, y in itertools.zip_longest(A, B[i:], fillvalue=float('nan')):
        if x != y:
            return False
    return True

jeff5 · June 9, 2024, 7:13pm

Your title says “sublist”, but your text says “subset”. OTOH, your examples are sets not lists, so I’ll assume list.

If also “sublist” means consecutive and anywhere (so that [2,2,3]is a sublist of B but [1,3,4] is not a sublist of B, then we are just dealing with the same problem as string search (as in str.find), but with integers instead of characters.

For this, I like Boyer-Moore. There is a set-up cost proportional to the target size (squared?), but it is fast if the target sequence is a lot shorter than the searched sequence, or you use the same target on a lot of searched sequences.

kknechtel · June 10, 2024, 7:04am

By definition, yes. There isn’t a specific multi-set implementation provided, but collections.Counter can function that way. They don’t implement comparison operators with the meaning that sets do, but we can do it more explicitly:

from collections import Counter

def is_sublist(x, y):
    cx, cy = Counter(x), Counter(y)
    cy.subtract(cx)
    return min(cy.values()) >= 0

mdrissi · June 10, 2024, 7:58am

Counters do support comparison operators in expected manner. So you can simplify the code a and just return Counter(y) >= Counter(x)

kknechtel · June 10, 2024, 11:41am

Ah, it looks like this was added in 3.10:

$ python3.10
Python 3.10.14 (main, May  2 2024, 15:07:09) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from collections import Counter
>>> Counter() >= Counter()
True
>>> 
$ python3.9
Python 3.9.19 (main, May  2 2024, 15:03:24) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from collections import Counter
>>> Counter() >= Counter()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '>=' not supported between instances of 'Counter' and 'Counter'

blhsing · June 11, 2024, 3:14am

If the input lists are always sorted as they are in your example then you can perform the test more cheaply as a subsequence test with an iterator:

def is_subsequence(a, b):
    seq = iter(b)
    return all(i in seq for i in a)

print(is_subsequence([1, 2, 2, 3], [1, 2, 2, 3, 4])) # outputs True
print(is_subsequence([1, 2, 2, 3], [1, 2, 3, 4])) # outputs False

Paddy3118 · June 11, 2024, 12:57pm

I have blogged this, but it depends on what you treat as “found”. I am looking for the elements of the sub-list in their given order, in the main list.