Method to check if an element is in a set and add if not

Consider the case of tracking previously seen values. sets are an ideal data structure for this:

seen_items: set

seen_items.add(elem)

A very common pattern (for me, at least) is to do something special with a value on the first encounter, and then possibly something else every time I see it:

if elem not in seen_items:
    seen_items.add(elem)
    first_encounter(elem)

every_encounter(elem)

It feels to me that there should be a way to add and return an indicator of whether the value was already present. For example, a check_add method which returns True (or False, I’m not sure which is more intuitive) if the value is already present:

if not seen_items.check_add(elem):
    first_encounter(elem)

every_encounter(elem)

My motivation here is mostly aesthetic: we’ve already hashed the object when checking presence, why redo the work for the insertion? But there might also be interest in such a function for its atomicity, if it could be guaranteed that, among multiple threads attempting to check_add a value, only one gets a False return value.

I can simulate this behavior with:

from collections import defaultdict
from threading import Semaphore

seen_items = defaultdict(Semaphore)

...

if seen_items[elem].acquire(timeout=0):
    first_encounter(elem)

every_encounter(elem)

But it strikes me as something that naturally belongs in set.

Curious if others see value here, or if this is another “not every two line idiom needs to go in the stdlib” idea.

2 Likes

Since x in set is an O1 operation this would be syntactic sugar. I also do this often and, IMO, it’s not worth the effort to change the language to prevent me from writing an additional line of code.

1 Like

Might be marginally useful to make set.add and set.discard both return True/False based on if it had an effect on the set, but I feel like this is rare enough to care, and cheap enough to check that it’s not really worth adding and then waiting 5 years to be able to use it on all Non-EOL python versions.

5 Likes

What’s the real time cost of looking up an element in a set vs. getting the length of a set?

I assume this would be equivalent:

def add(aset: set, elem: Hashable):
    "Return True if elem was added to set, False if set already contained elem"
    length = len(aset)
    aset.add(elem)
    return len(aset) != length
1 Like

If the set is also only being used to avoid the body of a function being executed multiple times for the same hashable input, you can also do something like this:

@lru_cache(None)  # or use this functionally inside of a scope if you dont want the cache persisting
def only_on_first_occurance(elem):
    ... 

I’m thinking it might be the job of every_encounter, not the iterating code, to decide whether or not to also call first_encounter, especially if the set you are building is just for bookkeeping, not later use. A wrapper around a set might be more appropriate.

class ItemProcessor:
    def __init__(self, first_time, every_time):
        self.s = set()
        self.first_time = first_time
        self.every_time = every_time

    def process(self, item):
        if item not in self.s:
            self.s.add(item)
            self.first_time(item)
        self.every_time(item)

p = ItemProcessor(first_encounter, every_encounter)
for x in ...:
    p.process(x)
    

In a multithreaded context checking x in set first is not equivalent to an atomic check_add because of time of check to time of use race conditions. One way to do this atomically is with dict rather than set:

# Mutable, shared across threads:
things = {}

def add(key):
    mine = object()
    added = things.setdefault(key, mine) is mine
    if added:
        handle_new_exactly_once(key)

In Python most methods on mutable containers only mutate or read from the object but not both. For atomicity it is useful to have those things combined as in the case of setdefault or something like check_add.

2 Likes

At the same time, in a multithreaded context, if you’re working with hashable inputs, you can use the hash as part of a binning function so that no two threads ever operate on overlapping data, then merge at the end.

There’s a lot of additional concerns with threading, and it’s generally better to structure work to minimize needing any check that cares about another thread’s work in the first place.

2 Likes

I think this requires that the code has knowledge of the threads that are running. The perspective I come from here is that the add function is just a function in a library that may or may not be used in a multithreaded context in downstream code. The author of add does not know how many threads there are and expects that it will usually be used in single-threaded code but does not want it to break under threading.

2 Likes

Just for pedantry’s sake, I think that x in set is O(k), where k = sizeof(x) for some function sizeof, but I agree that it’s cheap enough, regardless.

It’s not about avoiding writing the second line of code so much as about avoiding the second dive into the call stack, which could get arbitrarily baroque with __hash__() implementations, and the chance for the set to mutate between the two calls. It feels unclean to me.

I did think that the C API might expose this information, as extension writers would probably be more concerned about absolute speed and atomicity guarantees, so I was kind of expecting to propose exposing that function to Python users, but it turns out it’s not there.

I agree that it’s marginally useful, but it feels like a nice quality of life improvement that I’d be willing to put some time in on now and then wait five years for. :slight_smile:


I do appreciate the other implementation suggestions (and may use some!), but the convolutions to achieve an atomic test and set feels like they’re highlighting the missed, admittedly low urgency, opportunity here.

2 Likes

It’s easy to make set.add() return True/False depending on whether the element was added to the set or was already in it. But add() is not only a method of set, it is also a part of the Set protocol. Other implementations currently return None, and it is not easy to update them all. In some implementations (for example SQLite based) this will add significant overhead. So, some implementations will return True/False, others will return None which will be silently interpreted as false in the boolean context.

1 Like

Previous issue suggesting set.add returning a bool:

4 Likes

@Stefan2 Thanks for finding that, and the many other conversations that it links to. I failed to find them through searching.

Happy to concede that set.add shouldn’t start returning bools. Also still would like some method to do this, but will end this thread here unless some core dev supports the idea.