collections.Collector

I’m a big fan of collections.Counter.

In some sense, collections.Counter is the evolution of defaultdict(int). It takes that class and adds sugar including appropriate constructor overloading, methods like .most_common() and set operations.

Thinking about what I still use defaultdict for, the main cases are defaultdict(list) and defaultdict(set), which I use often (perhaps more than Counter). However, I miss the sugar of a Counter.

Here are some ideas:

Construction

Construct with a mapping or an iterable of pairs (like dict):

>>> from collections import ListCollector
>>> ListCollector([('a', 1), ('b', 2), ('a', 3)])
ListCollector({'a': [1, 3], 'b': [2]})

This might be the most common usage, an alternative to itertools.groupby() which collects non-consecutive groups:

>>> ListCollector(range(5), key=lambda v: v % 2)
ListCollector({0: [0, 2, 4], 1: [1, 3]})

Sometimes you already have a sequence of sequences:

>>> ListCollector.from_iterables([('a', [1, 2]), ('a', [3, 4])])
ListCollector({'a': [1, 2, 3, 4]})

Operators

Because these are effectively defaultdicts, item access inserts and returns an empty item:

>>> x = ListCollector()
>>> x['spam'].append('eggs')

Lists support + and += for concatenation; ListCollector would apply these item-wise:

>>> x = ListCollector({'a': [1, 2], 'b': [3]})
>>> y = ListCollector({'a': [4], 'b': [5], 'c': [6]})
>>> x += y
>>> x
ListCollector({'a': [1, 2, 4], 'b': [3, 5], 'c': [6]})

Methods

As with Counter, dict methods would be identical to a normal dict, but additional methods are available to treat the collector as a collection of key-value pairs.

Counter has .elements(); these types are duck-typed to do the same. This is arguably more useful than Counter.elements().

>>> x = ListCollector({'a': [1, 2], 'b': [3]})
>>> list(x.elements())
[('a', 1), ('a', 2), ('b', 3)]

We can implement an equivalent of Counter.most_common() for this type, by comparing the len() of the values.

>>> ListCollector({'x': [1, 2], 'y': [3]}, 'z': [4, 5, 6]).most_common(2)
[('z', [4, 5, 6]), 'y': [3])]

Other implementations

If the values are hashable, then other possibilities are available. We can collect only unique items using sets, or use Counter as multisets/bags to avoid losing the frequency data.

>>> from collections import SetCollector
>>> SetCollector('Apoiehjr-8141¬', key=unicodedata.category)
SetCollector({'Lu': {'A'}, 'Ll': {'j', 'o', 'p', 'h', 'r', 'e', 'i'}, 'Pd': {'-'}, 'Nd': {'1', '4', '8'}, 'Sm': {'¬'}})
>>> CounterCollector('AARRR!!!!', key=unicodedata.category)
CounterCollector({'Lu': Counter({'A': 2, 'R': 3}), 'Po': Counter({'!': 4})})

Unlike ListCounter, both of these could support set operations (considering them as sets/counters of key-value items).

1 Like

Just fyi, if you don’t need the extra functionality, you can use defaultdict(int) over collections.Counter for performance gains since defaultdict is implemented in C.

1 Like