How to efficiently uniquify a very large record of List of Lists in python

aceLance · February 9, 2022, 3:40pm

Hi,
I am a relatively new python user. I have a code which does some complex and time consuming operations. So to reduce the data used by a next step in my code, I am having to implement a function that accepts a list of lists and returns a List of uniquified lists. See below for sample
Input : [ [‘a’, ‘b’, ‘c’], [‘d’,‘e’,‘f’], [‘a’,‘b’,‘c’] ]
Return : [ [‘a’, ‘b’, ‘c’], [‘d’,‘e’,‘f’] ]
Now the issue is the record of lists can be very large, in their hundreds of thousands.

The way i have it currently implemented is in a rudimentary way, where you iterate every list in the list, and append the current list into an aux list if it doesn’t exist in the aux list already. This operation seems to have incurred even additional overhead to the existing code.

So i was wondering if there is a more efficient way to do this?

sweeneyde · February 9, 2022, 4:04pm

The built-in set data type is useful for these uniquification kinds of operations. However, lists are not hashable (since you can mutate them), but you can use tuples instead which are hashable:

unique = set()
for my_list in lists:
    unique.add(tuple(my_list))

If you just need to iterate over these things, then you can directly use unique, the set of tuples. If you want to convert them back into a list of lists, you can do

result = []
for tup in unique:
    result.append(list(tup))

Both of these steps can be written more compactly using set/list comprehensions, if you wish.

unique = {tuple(my_list) for my_list in lists}
result = [list(tup) for tup in unique]

aceLance · February 9, 2022, 4:31pm

Thanks a bunch!

steven.daprano · February 10, 2022, 9:55am

Here’s a 20-year old answer!

https://code.activestate.com/recipes/52560-remove-duplicates-from-a-sequence/

And here are some more recent answers:

Possibly a better solution is to not put duplicate items in the list in
the first place. A sketch of a solution:

seen = set()
for sublist in input_lists:
    if t := tuple(sublist) not in seen:
        values.append(sublist)
        seen.add(t)