# Make `set.difference` stop early like all other set methods?

`difference` is the only method that could stop early but doesn’t:

``````other = iter([1, 2, 3, 4, 5])
print({3, 4}.isdisjoint(other), *other)

other = iter([1, 2, 3, 4, 5])
print({1, 3}.issubset(other), *other)

other = iter([1, 2, 3, 4, 5])
print({1, 2}.issuperset(other), *other)

other = iter([1, 2, 3, 4, 5])
print({1, 3}.intersection(other), *other)

other = iter([1, 2, 3, 4, 5])
print({1, 3}.difference(other), *other)
``````

Output (Attempt This Online!):

``````False 4 5
True 4 5
False 4 5
{1, 3} 4 5
set()
``````

All the others stop early, leaving the iterator with 4 and 5 left. Only `difference` doesn’t, even though it could.

(`union` and `symmetric_difference` also don’t stop early, but they of course can’t, as any further element might need to be added.)

5 Likes

Is your goal here to rely on the end state of the iterator in a real algorithm, or is the idea that this would be more performant?

@rhettinger set proposal

The code for self.difference() is currently very tight and clean. I would prefer not to add an inner-loop memory access, test, and branch that would almost never payoff in real code (subtracting giant iterators from tiny sets isn’t a normal use case and not worth optimizing).

``````        while ((key = PyIter_Next(it)) != NULL) {
if (set_discard_key(so, key) < 0) {
Py_DECREF(it);
Py_DECREF(key);
return -1;
}
``````

FWIW, the code for `isdisjoint`, `issupersset`, and `issubset` are already making a test in the inner-loop. No extra work was added. Also, early-exits do arise naturally in the use cases for these methods.

5 Likes

Would you also argue against the shortcut in `set.intersection`, as added in set intersections should short-circuit · Issue #87630 · python/cpython · GitHub ? (which you appear to have been requested on, but never reviewed)

Mostly the iterator state thing.

For example, I want to count how often a set occurs sequentially in a sequence. I can do that like this (Attempt This Online!):

``````sequence = 'aaabbbcccba'
want = {'a', 'b', 'c'}

it = iter(sequence)
count = 0
while want.issubset(it):
count += 1

print(count)
``````

That uses that `issubset` doesn’t consume more than it needs. I also tried a solution with `intersection` and `difference_update`, but that didn’t work because the latter consumes the entire iterator.

I think at least for `issubset` you’re mistaken. That doesn’t do the work itself, it uses `set_intersection`, and extra work was added for that, an extra test in the loop comparing two set sizes.

[Regarding a 2022 edit]

an extra test in the loop comparing two set sizes.

I generally don’t support edits that make everyone pay for a problem that almost no one has.

Presumably there was an underlying belief that the case being optimized was important and that adding inner-loop memory accesses, tests, and branches was worth making all the other cases pay for it.

5 Likes

Which algorithm needed `difference` specifically, rather than one of the four alternatives you listed that do work already?

Not `difference` but `difference_update` (they end up using the same code and I think likely that would remain the case if this were changed). Same task as above, but I tried to optimize it. Using `issubset` repeatedly builds sets up and throws them away. So I tried alternating between building a set up and tearing it down. This should print 200 but only prints 2:

``````sequence = 'aaabbbcccba' * 100
want = {'a', 'b', 'c'}

it = iter(sequence)
count = 0
while (have := want.intersection(it)) == want:
count += 1
have.difference_update(it)
if not have:
count += 1

print(count)
``````

Attempt This Online!

Found a way to use `have.difference_update(...)` there: repeatedly call it with an `islice` of the next `len(have)` values. Because `have` won’t become empty earlier than that. (Attempt This Online!)

``````from itertools import islice

sequence = 'aaabbbcccba' * 100
want = {'a', 'b', 'c'}

it = iter(sequence)
count = 0
while (have := want.intersection(it)) == want:
count += 1
for x in it:
have.difference_update(islice(it, len(have)))
if not have:
count += 1
break

print(count)
``````

I imagine the C code loop could do something like that as well, only compare the sizes after a certain number of iterations (and then update that number of further iterations).

And that current optimization for `intersection` could maybe likewise compare the sizes only after the proper number of iterations instead of in every iteration.

Meanwhile, I noticed a wart. As shown earlier, `issubset` stops early, when all the set’s elements have been found in `other`. But… not if the set is empty. Then it consumes the whole `other` even though the result is trivially `True`:

``````it = iter('abc')
print(set('a').issubset(it), list(it))

it = iter('abc')
print(set('').issubset(it), list(it))
``````

Output (Attempt This Online!):

``````True ['b', 'c']
True []
``````

For the second case I’d expect:

``````True ['a', 'b', 'c']
``````

To spell it out explicitly: The current design philosophy for set methods doesn’t consider how it effects an iterator passed as the second argument as part of the interface, but as a side effect of performance optimizations. If you want to have reliable behavior in this regard that doesn’t change between python versions without warning (like the behavior of `set.intersection` did), you will need to implement such methods manually. Or you need to convince some python core devs that this should be taken into account when designing these functions, and I don’t think you will be successful in that.

2 Likes

I know. This topic is me trying to convince them of that, or rather to get opinions/reasoning about it (hence the title’s question mark).

1 Like

I would say that even relying on the early-exit behavior of the methods is that do have it is already unsupported, because the documentation doesn’t guarantee that behavior. The end-state of an iterator passed to those methods is undefined and not part of the public API of the methods.

1 Like