Make `set.difference` stop early like all other set methods?

difference is the only method that could stop early but doesn’t:

other = iter([1, 2, 3, 4, 5])
print({3, 4}.isdisjoint(other), *other)

other = iter([1, 2, 3, 4, 5])
print({1, 3}.issubset(other), *other)

other = iter([1, 2, 3, 4, 5])
print({1, 2}.issuperset(other), *other)

other = iter([1, 2, 3, 4, 5])
print({1, 3}.intersection(other), *other)

other = iter([1, 2, 3, 4, 5])
print({1, 3}.difference(other), *other)

Output (Attempt This Online!):

False 4 5
True 4 5
False 4 5
{1, 3} 4 5
set()

All the others stop early, leaving the iterator with 4 and 5 left. Only difference doesn’t, even though it could.

(union and symmetric_difference also don’t stop early, but they of course can’t, as any further element might need to be added.)

5 Likes

Is your goal here to rely on the end state of the iterator in a real algorithm, or is the idea that this would be more performant?

@rhettinger set proposal

The code for self.difference() is currently very tight and clean. I would prefer not to add an inner-loop memory access, test, and branch that would almost never payoff in real code (subtracting giant iterators from tiny sets isn’t a normal use case and not worth optimizing).

        while ((key = PyIter_Next(it)) != NULL) {
            if (set_discard_key(so, key) < 0) {
                Py_DECREF(it);
                Py_DECREF(key);
                return -1;
            }

FWIW, the code for isdisjoint, issupersset, and issubset are already making a test in the inner-loop. No extra work was added. Also, early-exits do arise naturally in the use cases for these methods.

5 Likes

Would you also argue against the shortcut in set.intersection, as added in set intersections should short-circuit · Issue #87630 · python/cpython · GitHub ? (which you appear to have been requested on, but never reviewed)

Mostly the iterator state thing.

For example, I want to count how often a set occurs sequentially in a sequence. I can do that like this (Attempt This Online!):

sequence = 'aaabbbcccba'
want = {'a', 'b', 'c'}

it = iter(sequence)
count = 0
while want.issubset(it):
    count += 1

print(count)

That uses that issubset doesn’t consume more than it needs. I also tried a solution with intersection and difference_update, but that didn’t work because the latter consumes the entire iterator.

I think at least for issubset you’re mistaken. That doesn’t do the work itself, it uses set_intersection, and extra work was added for that, an extra test in the loop comparing two set sizes.

[Regarding a 2022 edit]

an extra test in the loop comparing two set sizes.

I generally don’t support edits that make everyone pay for a problem that almost no one has.

Presumably there was an underlying belief that the case being optimized was important and that adding inner-loop memory accesses, tests, and branches was worth making all the other cases pay for it.

5 Likes

Which algorithm needed difference specifically, rather than one of the four alternatives you listed that do work already?

Not difference but difference_update (they end up using the same code and I think likely that would remain the case if this were changed). Same task as above, but I tried to optimize it. Using issubset repeatedly builds sets up and throws them away. So I tried alternating between building a set up and tearing it down. This should print 200 but only prints 2:

sequence = 'aaabbbcccba' * 100
want = {'a', 'b', 'c'}

it = iter(sequence)
count = 0
while (have := want.intersection(it)) == want:
    count += 1
    have.difference_update(it)
    if not have:
        count += 1

print(count)

Attempt This Online!

Found a way to use have.difference_update(...) there: repeatedly call it with an islice of the next len(have) values. Because have won’t become empty earlier than that. (Attempt This Online!)

from itertools import islice

sequence = 'aaabbbcccba' * 100
want = {'a', 'b', 'c'}

it = iter(sequence)
count = 0
while (have := want.intersection(it)) == want:
    count += 1
    for x in it:
        have.discard(x)
        have.difference_update(islice(it, len(have)))
        if not have:
            count += 1
            break

print(count)

I imagine the C code loop could do something like that as well, only compare the sizes after a certain number of iterations (and then update that number of further iterations).

And that current optimization for intersection could maybe likewise compare the sizes only after the proper number of iterations instead of in every iteration.

Meanwhile, I noticed a wart. As shown earlier, issubset stops early, when all the set’s elements have been found in other. But… not if the set is empty. Then it consumes the whole other even though the result is trivially True:

it = iter('abc')
print(set('a').issubset(it), list(it))

it = iter('abc')
print(set('').issubset(it), list(it))

Output (Attempt This Online!):

True ['b', 'c']
True []

For the second case I’d expect:

True ['a', 'b', 'c']

To spell it out explicitly: The current design philosophy for set methods doesn’t consider how it effects an iterator passed as the second argument as part of the interface, but as a side effect of performance optimizations. If you want to have reliable behavior in this regard that doesn’t change between python versions without warning (like the behavior of set.intersection did), you will need to implement such methods manually. Or you need to convince some python core devs that this should be taken into account when designing these functions, and I don’t think you will be successful in that.

2 Likes

I know. This topic is me trying to convince them of that, or rather to get opinions/reasoning about it (hence the title’s question mark).

1 Like

I would say that even relying on the early-exit behavior of the methods is that do have it is already unsupported, because the documentation doesn’t guarantee that behavior. The end-state of an iterator passed to those methods is undefined and not part of the public API of the methods.

1 Like