Iterator challenge: keyed unique

For those who like iterators.

Rewrite more_itertools.unique_everseen for keyed variant without using loops (or any other function that uses pure python loops). No python extensions.

https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.unique_everseen

def unique(iterator: Iterator, key: Callable) -> Iterator:
    ...

Best performance wins.

Test cases:

a = range(100_000)
a_key = lambda x: x % 2
b = 'AAAABBBCCDAABBBasdkf' * 5000
b_key = str.lower
random.seed(1)
c = [random.getrandbits(16) for _ in range(100_000)]
c_key = lambda x: x < 65536 / 2

assert list(unique(iter(a), a_key)) == list(more_itertools.unique_everseen(iter(a), a_key))
assert list(unique(iter(b), b_key)) == list(more_itertools.unique_everseen(iter(b), b_key))

Timing:

from collections import deque
%timeit -n 100 deque(unique(iter(a), a_key), maxlen=0)
%timeit -n 100 deque(unique(iter(b), b_key), maxlen=0)

Best performance for what case, and measured how?

1 Like

Updated. Thanks!

Here’s the reference implementation if you don’t want to install more_itertools:

def unique_everseen(iterable, key=None):
    seenset = set()
    seenset_add = seenset.add
    seenlist = []
    seenlist_add = seenlist.append
    use_key = key is not None
    for element in iterable:
        k = key(element) if use_key else element
        try:
            if k not in seenset:
                seenset_add(k)
                yield element
        except TypeError:
            if k not in seenlist:
                seenlist_add(k)
                yield element
1 Like

Darn, I just noticed the support for non-hashable values/keys. That complicates things, probably even rules out some of my ideas since I can’t handle such TypeError exceptions in itertools…

It’s not in the tests, so feel free to ignore it. :slight_smile:

No, no lets assume that keys are hashable. Non-hashable item sets is a separate problem altogether. And I guess if one had such object it could just be substituted instead of whatever was used.

Your test cases are invalid. Neither a nor b are iterators.

1 Like

Well, they are iterators, but the extra methods should be eliminated.

a = iter(range(100_000))
a_key = lambda x: x % 2
b = iter('AAAABBBCCDAABBBasdkf' * 5000)
b_key = str.lower

Wait, that doesn’t work… You need to call iter() before passing them to the function.

They were iterables, now they are iterators.

Now your timing code is bad, as the first of the 100 rounds will exhaust the iterators and the other 99 rounds have nothing to do.

1 Like
def unique_Stefan_1(iterator: Iterator, key: Callable) -> Iterator:
    seen = set()
    seen_add = seen.add
    def add(x):
        k = key(x)
        if k in seen:
            return False
        seen_add(k)
        return True
    return filter(add, iterator)

Does this count?

def unique(iterable, key):
    values = tuple(iterable)[::-1]
    indices = {key(value): i for i, value in enumerate(values)}.values()
    return map(values.__getitem__, sorted(indices, reverse=True))

No, I don’t want to benchmark that.

Does the order matter? Because it would simplify my implementation.

I now understand that I wasn’t clear enough so my fault.

Ideally, it needs to behave as an iterator, so that if you have a chain of operations on large objects, only 1 item needs to be in memory at a time (can be 2, 3, maybe more), but the point is to iterate over it and not to have it all in memory. So yes, in this case the order matters.

However, I am interested to see benchmarks of various functions.

Does this beat filter()?

def unique(iterable, key):
    seen = set()
    seen_add = seen.add
    def add(element):
        if element in seen:
            return False
        seen_add(element)
        return True
    return (item for item in iterable if add(key(item)))

Will check and compare all later.

By the way this post was inspired by recipe that @Stefan2 helped me with.

def iter_counter(iterable):
    c1 = itl.count(1)
    it = itl.compress(iterable, c1)
    ct = map(opr.sub, c1, itl.count(1))
    return it, ct

More and more I find that many of what is done in more_itertools can be implemented faster and/or more elegantly, which is good, because I have a lot of fun with these. :slight_smile:

def unique_dgrigonis_1(iterable, key):
    seen = set()
    pred = ftl.partial(opr.contains, seen)
    it, itk = itl.tee(iterable)
    itk = map(key, itk)
    itk, itk_bool = itl.tee(itk)
    itk_bool = map(pred, itk_bool)
    it_tri = zip(it, itk, itk_bool)
    it_triu = itl.filterfalse(opr.itemgetter(2), it_tri)
    it1, it2 = itl.tee(it_triu)
    it1 = map(opr.itemgetter(0), it1)
    it2 = map(opr.itemgetter(1), it2)
    return map(opr.itemgetter(0), zip(it1, map(seen.add, it2)))
from itertools import tee

def unique_Stefan_2(iterator: Iterator, key: Callable) -> Iterator:
    class D(dict):
        def __missing__(self, x):
            self[x] = False
            return True
    vals1, vals2 = tee(iterator)
    return compress(vals1, map(D().__getitem__, map(key, vals2)))

Wait, are generator expressions and comprehensions allowed? They have Python-level loops.