Method to check if an element is in a set and add if not

effigies · November 14, 2025, 6:20pm

Consider the case of tracking previously seen values. sets are an ideal data structure for this:

seen_items: set

seen_items.add(elem)

A very common pattern (for me, at least) is to do something special with a value on the first encounter, and then possibly something else every time I see it:

if elem not in seen_items:
    seen_items.add(elem)
    first_encounter(elem)

every_encounter(elem)

It feels to me that there should be a way to add and return an indicator of whether the value was already present. For example, a check_add method which returns True (or False, I’m not sure which is more intuitive) if the value is already present:

if not seen_items.check_add(elem):
    first_encounter(elem)

every_encounter(elem)

My motivation here is mostly aesthetic: we’ve already hashed the object when checking presence, why redo the work for the insertion? But there might also be interest in such a function for its atomicity, if it could be guaranteed that, among multiple threads attempting to check_add a value, only one gets a False return value.

I can simulate this behavior with:

from collections import defaultdict
from threading import Semaphore

seen_items = defaultdict(Semaphore)

...

if seen_items[elem].acquire(timeout=0):
    first_encounter(elem)

every_encounter(elem)

But it strikes me as something that naturally belongs in set.

Curious if others see value here, or if this is another “not every two line idiom needs to go in the stdlib” idea.

brass75 · November 14, 2025, 6:32pm

Since x in set is an O1 operation this would be syntactic sugar. I also do this often and, IMO, it’s not worth the effort to change the language to prevent me from writing an additional line of code.

mikeshardmind · November 14, 2025, 6:41pm

Might be marginally useful to make set.add and set.discard both return True/False based on if it had an effect on the set, but I feel like this is rare enough to care, and cheap enough to check that it’s not really worth adding and then waiting 5 years to be able to use it on all Non-EOL python versions.

notatallshaw · November 14, 2025, 6:43pm

What’s the real time cost of looking up an element in a set vs. getting the length of a set?

I assume this would be equivalent:

def add(aset: set, elem: Hashable):
    "Return True if elem was added to set, False if set already contained elem"
    length = len(aset)
    aset.add(elem)
    return len(aset) != length

mikeshardmind · November 14, 2025, 6:47pm

If the set is also only being used to avoid the body of a function being executed multiple times for the same hashable input, you can also do something like this:

@lru_cache(None)  # or use this functionally inside of a scope if you dont want the cache persisting
def only_on_first_occurance(elem):
    ...

chepner · November 14, 2025, 6:51pm

I’m thinking it might be the job of every_encounter, not the iterating code, to decide whether or not to also call first_encounter, especially if the set you are building is just for bookkeeping, not later use. A wrapper around a set might be more appropriate.

class ItemProcessor:
    def __init__(self, first_time, every_time):
        self.s = set()
        self.first_time = first_time
        self.every_time = every_time

    def process(self, item):
        if item not in self.s:
            self.s.add(item)
            self.first_time(item)
        self.every_time(item)

p = ItemProcessor(first_encounter, every_encounter)
for x in ...:
    p.process(x)

oscarbenjamin · November 14, 2025, 6:53pm

In a multithreaded context checking x in set first is not equivalent to an atomic check_add because of time of check to time of use race conditions. One way to do this atomically is with dict rather than set:

# Mutable, shared across threads:
things = {}

def add(key):
    mine = object()
    added = things.setdefault(key, mine) is mine
    if added:
        handle_new_exactly_once(key)

In Python most methods on mutable containers only mutate or read from the object but not both. For atomicity it is useful to have those things combined as in the case of setdefault or something like check_add.

mikeshardmind · November 14, 2025, 7:03pm

At the same time, in a multithreaded context, if you’re working with hashable inputs, you can use the hash as part of a binning function so that no two threads ever operate on overlapping data, then merge at the end.

There’s a lot of additional concerns with threading, and it’s generally better to structure work to minimize needing any check that cares about another thread’s work in the first place.

oscarbenjamin · November 14, 2025, 7:21pm

I think this requires that the code has knowledge of the threads that are running. The perspective I come from here is that the add function is just a function in a library that may or may not be used in a multithreaded context in downstream code. The author of add does not know how many threads there are and expects that it will usually be used in single-threaded code but does not want it to break under threading.

effigies · November 14, 2025, 8:35pm

Just for pedantry’s sake, I think that x in set is O(k), where k = sizeof(x) for some function sizeof, but I agree that it’s cheap enough, regardless.

It’s not about avoiding writing the second line of code so much as about avoiding the second dive into the call stack, which could get arbitrarily baroque with __hash__() implementations, and the chance for the set to mutate between the two calls. It feels unclean to me.

I did think that the C API might expose this information, as extension writers would probably be more concerned about absolute speed and atomicity guarantees, so I was kind of expecting to propose exposing that function to Python users, but it turns out it’s not there.

I agree that it’s marginally useful, but it feels like a nice quality of life improvement that I’d be willing to put some time in on now and then wait five years for.

I do appreciate the other implementation suggestions (and may use some!), but the convolutions to achieve an atomic test and set feels like they’re highlighting the missed, admittedly low urgency, opportunity here.

storchaka · November 14, 2025, 10:08pm

It’s easy to make set.add() return True/False depending on whether the element was added to the set or was already in it. But add() is not only a method of set, it is also a part of the Set protocol. Other implementations currently return None, and it is not easy to update them all. In some implementations (for example SQLite based) this will add significant overhead. So, some implementations will return True/False, others will return None which will be silently interpreted as false in the boolean context.

Stefan2 · November 14, 2025, 10:46pm

Previous issue suggesting set.add returning a bool:

github.com/python/cpython

make `set.add` return a boolean to indicate if insertion took place

opened 12:27AM - 09 Oct 23 UTC

closed 05:39AM - 10 Oct 23 UTC

pobrn

type-feature

# Feature or enhancement ### Proposal: (I think this is a very simple idea, bo…th conceptually and implementation-wise, that is why I did not open a thread for it on Discourse, if you'd prefer I do that, please tell me.) (Also I have failed to find any discussions about this idea, if I had missed them, please direct me to them.) This proposal seeks to make `set.add` return a `bool` value that indicates whether insertion took place or not. A return value of `True` would mean insertion has taken place, while a return value of `False` would mean that insertion has *not* taken place. This can simplify snippets that use `set.add` in conjunction with `set.__contains__`. I may be wrong, but I believe this should a (mostly) backwards compatible change since `set.add` currently returns `None`, so the return value is likely not used anywhere. For example: ```python s = set() ... # currently if x in s: raise SomeError("duplicates not allowed") s.add(x) # with the proposal if not s.add(x): raise SomeError("duplicates not allowed") ``` Prototype implementation (not really tested at the moment): <details> ```diff diff --git a/Objects/setobject.c b/Objects/setobject.c index ae3f0b8d5e..f3f939c262 100644 --- a/Objects/setobject.c +++ b/Objects/setobject.c @@ -170,16 +170,16 @@ set_add_entry(PySetObject *so, PyObject *key, Py_hash_t hash) so->used++; freeslot->key = key; freeslot->hash = hash; - return 0; + return 1; found_unused: so->fill++; so->used++; entry->key = key; entry->hash = hash; - if ((size_t)so->fill*5 < mask*3) - return 0; - return set_table_resize(so, so->used>50000 ? so->used*2 : so->used*4); + if ((size_t)so->fill*5 >= mask*3 && set_table_resize(so, so->used>50000 ? so->used*2 : so->used*4)) + return -1; + return 1; found_active: Py_DECREF(key); @@ -623,7 +623,7 @@ set_merge(PySetObject *so, PyObject *otherset) other_entry = &other->table[i]; key = other_entry->key; if (key != NULL && key != dummy) { - if (set_add_entry(so, key, other_entry->hash)) + if (set_add_entry(so, key, other_entry->hash) < 0) return -1; } } @@ -904,7 +904,7 @@ set_update_internal(PySetObject *so, PyObject *other) return -1; } while (_PyDict_Next(other, &pos, &key, &value, &hash)) { - if (set_add_entry(so, key, hash)) + if (set_add_entry(so, key, hash) < 0) return -1; } return 0; @@ -915,7 +915,7 @@ set_update_internal(PySetObject *so, PyObject *other) return -1; while ((key = PyIter_Next(it)) != NULL) { - if (set_add_key(so, key)) { + if (set_add_key(so, key) < 0) { Py_DECREF(it); Py_DECREF(key); return -1; @@ -1212,7 +1212,7 @@ set_intersection(PySetObject *so, PyObject *other) return NULL; } if (rv) { - if (set_add_entry(result, key, hash)) { + if (set_add_entry(result, key, hash) < 0) { Py_DECREF(result); Py_DECREF(key); return NULL; @@ -1237,7 +1237,7 @@ set_intersection(PySetObject *so, PyObject *other) if (rv < 0) goto error; if (rv) { - if (set_add_entry(result, key, hash)) + if (set_add_entry(result, key, hash) < 0) goto error; if (PySet_GET_SIZE(result) >= PySet_GET_SIZE(so)) { Py_DECREF(key); @@ -1528,7 +1528,7 @@ set_difference(PySetObject *so, PyObject *other) return NULL; } if (!rv) { - if (set_add_entry((PySetObject *)result, key, hash)) { + if (set_add_entry((PySetObject *)result, key, hash) < 0) { Py_DECREF(result); Py_DECREF(key); return NULL; @@ -1551,7 +1551,7 @@ set_difference(PySetObject *so, PyObject *other) return NULL; } if (!rv) { - if (set_add_entry((PySetObject *)result, key, hash)) { + if (set_add_entry((PySetObject *)result, key, hash) < 0) { Py_DECREF(result); Py_DECREF(key); return NULL; @@ -1631,7 +1631,7 @@ set_symmetric_difference_update(PySetObject *so, PyObject *other) return NULL; } if (rv == DISCARD_NOTFOUND) { - if (set_add_entry(so, key, hash)) { + if (set_add_entry(so, key, hash) < 0) { Py_DECREF(key); return NULL; } @@ -1660,7 +1660,7 @@ set_symmetric_difference_update(PySetObject *so, PyObject *other) return NULL; } if (rv == DISCARD_NOTFOUND) { - if (set_add_entry(so, key, hash)) { + if (set_add_entry(so, key, hash) < 0) { Py_DECREF(otherset); Py_DECREF(key); return NULL; @@ -1834,9 +1834,11 @@ set_richcompare(PySetObject *v, PyObject *w, int op) static PyObject * set_add(PySetObject *so, PyObject *key) { - if (set_add_key(so, key)) + int res = set_add_key(so, key); + if (res < 0) return NULL; - Py_RETURN_NONE; + + return res ? Py_True : Py_False; } PyDoc_STRVAR(add_doc, @@ -2326,7 +2328,11 @@ PySet_Add(PyObject *anyset, PyObject *key) PyErr_BadInternalCall(); return -1; } - return set_add_key((PySetObject *)anyset, key); + + if (set_add_key((PySetObject *)anyset, key) < 0) + return -1; + + return 0; } int ``` </details> ### Has this already been discussed elsewhere? This is a minor feature, which does not need previous discussion elsewhere ### Links to previous discussion of this feature: _No response_

effigies · November 15, 2025, 3:00pm

@Stefan2 Thanks for finding that, and the many other conversations that it links to. I failed to find them through searching.

Happy to concede that set.add shouldn’t start returning bools. Also still would like some method to do this, but will end this thread here unless some core dev supports the idea.