Method to check if an element is in a set and add if not

Consider the case of tracking previously seen values. sets are an ideal data structure for this:

seen_items: set

seen_items.add(elem)

A very common pattern (for me, at least) is to do something special with a value on the first encounter, and then possibly something else every time I see it:

if elem not in seen_items:
    seen_items.add(elem)
    first_encounter(elem)

every_encounter(elem)

It feels to me that there should be a way to add and return an indicator of whether the value was already present. For example, a check_add method which returns True (or False, I’m not sure which is more intuitive) if the value is already present:

if not seen_items.check_add(elem):
    first_encounter(elem)

every_encounter(elem)

My motivation here is mostly aesthetic: we’ve already hashed the object when checking presence, why redo the work for the insertion? But there might also be interest in such a function for its atomicity, if it could be guaranteed that, among multiple threads attempting to check_add a value, only one gets a False return value.

I can simulate this behavior with:

from collections import defaultdict
from threading import Semaphore

seen_items = defaultdict(Semaphore)

...

if seen_items[elem].acquire(timeout=0):
    first_encounter(elem)

every_encounter(elem)

But it strikes me as something that naturally belongs in set.

Curious if others see value here, or if this is another “not every two line idiom needs to go in the stdlib” idea.

3 Likes

Since x in set is an O1 operation this would be syntactic sugar. I also do this often and, IMO, it’s not worth the effort to change the language to prevent me from writing an additional line of code.

1 Like

Might be marginally useful to make set.add and set.discard both return True/False based on if it had an effect on the set, but I feel like this is rare enough to care, and cheap enough to check that it’s not really worth adding and then waiting 5 years to be able to use it on all Non-EOL python versions.

6 Likes

What’s the real time cost of looking up an element in a set vs. getting the length of a set?

I assume this would be equivalent:

def add(aset: set, elem: Hashable):
    "Return True if elem was added to set, False if set already contained elem"
    length = len(aset)
    aset.add(elem)
    return len(aset) != length
1 Like

If the set is also only being used to avoid the body of a function being executed multiple times for the same hashable input, you can also do something like this:

@lru_cache(None)  # or use this functionally inside of a scope if you dont want the cache persisting
def only_on_first_occurance(elem):
    ... 

1 Like

I’m thinking it might be the job of every_encounter, not the iterating code, to decide whether or not to also call first_encounter, especially if the set you are building is just for bookkeeping, not later use. A wrapper around a set might be more appropriate.

class ItemProcessor:
    def __init__(self, first_time, every_time):
        self.s = set()
        self.first_time = first_time
        self.every_time = every_time

    def process(self, item):
        if item not in self.s:
            self.s.add(item)
            self.first_time(item)
        self.every_time(item)

p = ItemProcessor(first_encounter, every_encounter)
for x in ...:
    p.process(x)
    

In a multithreaded context checking x in set first is not equivalent to an atomic check_add because of time of check to time of use race conditions. One way to do this atomically is with dict rather than set:

# Mutable, shared across threads:
things = {}

def add(key):
    mine = object()
    added = things.setdefault(key, mine) is mine
    if added:
        handle_new_exactly_once(key)

In Python most methods on mutable containers only mutate or read from the object but not both. For atomicity it is useful to have those things combined as in the case of setdefault or something like check_add.

2 Likes

At the same time, in a multithreaded context, if you’re working with hashable inputs, you can use the hash as part of a binning function so that no two threads ever operate on overlapping data, then merge at the end.

There’s a lot of additional concerns with threading, and it’s generally better to structure work to minimize needing any check that cares about another thread’s work in the first place.

2 Likes

I think this requires that the code has knowledge of the threads that are running. The perspective I come from here is that the add function is just a function in a library that may or may not be used in a multithreaded context in downstream code. The author of add does not know how many threads there are and expects that it will usually be used in single-threaded code but does not want it to break under threading.

3 Likes

Just for pedantry’s sake, I think that x in set is O(k), where k = sizeof(x) for some function sizeof, but I agree that it’s cheap enough, regardless.

It’s not about avoiding writing the second line of code so much as about avoiding the second dive into the call stack, which could get arbitrarily baroque with __hash__() implementations, and the chance for the set to mutate between the two calls. It feels unclean to me.

I did think that the C API might expose this information, as extension writers would probably be more concerned about absolute speed and atomicity guarantees, so I was kind of expecting to propose exposing that function to Python users, but it turns out it’s not there.

I agree that it’s marginally useful, but it feels like a nice quality of life improvement that I’d be willing to put some time in on now and then wait five years for. :slight_smile:


I do appreciate the other implementation suggestions (and may use some!), but the convolutions to achieve an atomic test and set feels like they’re highlighting the missed, admittedly low urgency, opportunity here.

2 Likes

It’s easy to make set.add() return True/False depending on whether the element was added to the set or was already in it. But add() is not only a method of set, it is also a part of the Set protocol. Other implementations currently return None, and it is not easy to update them all. In some implementations (for example SQLite based) this will add significant overhead. So, some implementations will return True/False, others will return None which will be silently interpreted as false in the boolean context.

3 Likes

Previous issue suggesting set.add returning a bool:

6 Likes

@Stefan2 Thanks for finding that, and the many other conversations that it links to. I failed to find them through searching.

Happy to concede that set.add shouldn’t start returning bools. Also still would like some method to do this, but will end this thread here unless some core dev supports the idea.

1 Like

I don’t want to give you too much hope as this is not an area where I have appreciable influence, but I for one do see a use for this. Since adjusting add is out, perhaps a new method that returns what’s currently in the set that matches, or None. This would allow for flows like

if (existing_obj := a_set.new_method_name(obj)) is obj:
    print(obj, 'already in there')
elif existing_obj is None:
    print(obj, 'now included')
else:
    print(existing_obj, 'is already there but is not', obj)

However, I’m having a hard time coming up with a name for this method (hence calling it new_method_name in the example), which is not a good indication. It’s almost akin to dict.setdefault, but different enough that it really can’t reuse that name. I’m not a fan of check_add; it just feels clunky to me. I wonder if a return_existing: bool keyword-only parameter to add wouldn’t be the best way to do this, in spite of the boolean parameter anti-pattern.

2 Likes

Probably extrapolating, but given the increase of interest in multithreading, a collection of atomic utilities might become interesting in the future, this one would surely fit there, it is rarely used but should be a simple thing indeed. I would picture something akin to success = atomic.set_insert(a_set, obj).

2 Likes

These operations (append/add/…) on set/list/dict are already atomic in CPython. They just don’t return the information that they could return to make that more useful from the caller’s perspective and the atomicity is currently implementation defined rather than being a language guarantee. If an implementation of Python should be expected to provide an atomic module then why shouldn’t the implementation be expected to provide atomic methods direclty on set at al? Providing either is equally difficult.

1 Like

I like the idea of atomic utilities. Changing existing ABCs to have methods return a bool or otherwise behave atomically will be hard because it’d be impossible for the plethora of implementations of collections.abc.* to follow the changes as well (not every implementation of collections.abc.* can be made thread-safe or atomic). Whereas with atomic utiliites in the stdlib, we only need to support and guarantee the atomicity of the classes that we can control (i.e. they can raise TypeError / NotImplementedError if they’re called with non-stdlib classes), and there will be no pressure on collections.abc.* implementing-classes to try to emulate the impossible.

1 Like

The closest equivalent I could think of when writing this up was Test-and-set instructions.

I also looked up cereggii (which I’ve seen mentioned in atomic contexts) in case it had an atomic set. It didn’t, but it does use get_and_set() for a number of its data structures:

While these retrieve the stored value, as in your proposal, they also replace it, which doesn’t seem like a valuable feature to me.

I think returning the existing value does nicely solve the problem of whether True means the value was inserted or was already present, but it would return the same value in both cases for a_set.new_method_name(None). Maybe that’s okay.

Only new name that’s occurred to me while writing this is add_first() or first_add(), which returns True if the object is inserted (first add) or False if it is not (all subsequent adds).

If we’re considering keyword parameters for add(), I might consider: check or test (True/False), get or fetch (obj / None).

I would have liked that, but it’s a case where “practicality beats purity” didn’t win. The underlying “problem” is that Python’s design loathes returning a value from a mutating operation. set.add(elt) can mutate the set, so for purity it must return None. Hasn’t everyone been surprised at least once by that list.sort() returns None? Same thing in theory.

The principle is meant to prevent unintended logic errors from chaining method calls that mutate the base object. Trying to do so blows up at once when mutating methods return None.

It’s for the same reason that, e.g., ZODB’s persistent BTree containers return None when an element is added to them. Which is even more annoying :wink: in that context because lookup time isn’t O(1).

More familiar to more people is probably the nifty sortedcontainers library, whose ordered container types similarly return None from mutating methods.

I think it’s about 20 years too late to try to fight that now.

4 Likes

Something which might work adressing all concerns raised so far is a stdlib function which uses the internals of sets to do it efficiently when it’s a set and not some subclass or set-like provided by a library, while having a less-fast path (with a lock) for things that implement collections.abc.set.

This sidesteps the issue of the set collection abc compatability, of the way mutating methods generally return None, and probably more things I haven’t thought of.

3 Likes