Add list.remove_duplicates() method

AvitalOhayon · December 2, 2024, 7:52pm

Rationale:
Currently there’s no straightforward way to remove duplicates while preserving order. Common solutions like list(dict.fromkeys(lst)) or using sets are either non-intuitive or don’t preserve order.

Proposed API:

lst = [1, 2, 2, 3, 1]
lst.remove_duplicates()  # returns [1, 2, 3]

ntessore · December 2, 2024, 7:58pm

Isn’t that almost canonically list(dict.fromkeys(lst))? (As in, it’s extremely well known by now what this is meant to do.)

saaketp · December 2, 2024, 8:02pm

You can use list(more_itertools.unique_everseen(lst))

https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.unique_everseen

bwoodsend · December 2, 2024, 8:03pm

It probably ought to be .deduplicated() (or something else ending in ed) to make it clear that it’s a modified copy instead of an in-place operation.

There’s a question of would there be a set/dict(-like) operation under the hood (forcing the contents to be hashable) or would it use a slower sorting based or brute force implementation to find the items it’s seen already.

jamestwebber · December 2, 2024, 8:23pm

I think it’d be reasonable to require hashable objects but allow a key function to make other objects hashable for the purpose of detecting duplicates–if that isn’t possible I don’t think “duplicate” is sufficiently well-defined.

That said, there are already enough ways to do this.

elis.byberi · December 2, 2024, 8:32pm

If you want to preserve order, you should replace duplicates with a null sentinel, such as None. Removing an item will affect the order of the remaining items.

jamestwebber · December 2, 2024, 8:34pm

I’m not sure what you mean? Presumably they mean relative order, not the index of each item.

elis.byberi · December 2, 2024, 8:50pm

Lists do not maintain metadata about when or in what order each individual insertion occurred. The only order would be the index of the items.

It is not like a dict, which preserves the insertion order:

a = {}
a['a'] = 1
a['b'] = 2
print(a)  # {'a': 1, 'b': 2}

I am not sure what order would be preserved in the case of a list, other than the index of the items.

lst = []
lst.insert(1, 'a')
lst.insert(0, 'b')
print(lst)  # ['b', 'a']

In the list above, we don’t know if ‘b’ was inserted before ‘a’ or not.

jamestwebber · December 2, 2024, 8:51pm

This isn’t about insertion order. The list itself is in an order, it’s a sequence.

ImogenBits · December 2, 2024, 8:52pm

Intuitively, I’d have said that removing duplicates just means that there are no two elements left that compare equal. Is there a nice way of writing a hashing key that achieves that for unhashable types with custom __eq__ methods?

jamestwebber · December 2, 2024, 8:55pm

I don’t think there’s a generic way, hence the need for a key param. I’d say it depends on however the custom __eq__ is implemented: what about the objects must be equal?

Maybe I’m lacking in imagination, but it feels like if you have some definition of “equal” for two objects, you can write a function to turn those characteristics into a hashable value.

ImogenBits · December 2, 2024, 9:26pm

I’d say that for most realistic use cases it is possible, but potentially annoying. If we have dataclass-like objects then we can recursively hash them by hashing the tuple of their attributes. This doesn’t work in general since it breaks if the objects are mutated, but for this purpose we can assume that that won’t happen. Implementing a generic function for this is somewhat difficult since you only want to include those fields in the hash that are relevant to the equality comparison. So you’d probably end up writing a custom one when you use the deduplication method.

In full generality, I don’t think using a key function for this is possible. An arbitrary __eq__ can behave strangely, so we’d need to do n² comparisons to make sure no equal pair exists. But if there is a key function you can just use the normal dict based method, which should run in (amortized) linear time. Not sure if those cases are common enough to justify being concerned about though. Most objects will probably either be hashable or can use a key function.

MegaIng · December 2, 2024, 9:28pm

Either require hashability or be ok with O(n^2) comparisons. Trying to automatically find a key is a bad idea.

elis.byberi · December 2, 2024, 9:30pm

The index of each item represents the order in the list. The item at index 0 comes before all other items. The index maintains the order even if an item is removed. However, the index is also crucial for a list because it differentiates a list from a mere ordered sequence. Removing an item from the middle of a list effectively creates a new list.

The OP implicitly shows that the proposed method should remove the first duplicate occurrence starting from index 0. Why?

jamestwebber · December 2, 2024, 9:34pm

I thought it was clear from the OP that this is what they are asking for: a new list, without duplicates, and unique elements appear in the order of first appearance.

You seem to be making some very specific distinction based on definitions that I’m not following. We can just drop it though.

elis.byberi · December 2, 2024, 9:42pm

Yes, it’s not worth delving into these minor technicalities.

The linked library above goes further in defining new list orders in detail, such as unique_everseen, unique_justseen, and unique (sorted order).