Ability to specify default values on itemgetter and attrgetter

ptmcg · April 30, 2024, 9:50am

I’m a big fan of itemgetter and attrgetter, but sometimes when using large lists of dicts or objects, a particular key or attribute might not be present in one of the list members. I wish I could define default values to use in the event of a missing key or attribute, something like this (with similar for attrgetter):

class itemgetter:
    """
    Return a callable object that fetches the given item(s) from its operand.
    After f = itemgetter(2), the call f(r) returns r[2].
    After g = itemgetter(2, 5, 3), the call g(r) returns (r[2], r[5], r[3])

    Accepts optional keyword arguments:
    default: Default value if the single item is not found in the input mapping.
    defaults: Sequence of default values if corresponding item is not found in the input mapping.
    """
    __slots__ = ('_items', '_call', '_defaults')
    NOT_GIVEN = object()

    def __init__(self, item, /, *items, default=NOT_GIVEN, defaults=NOT_GIVEN):
        if not items:
            self._items = (item,)
            default = defaults[0] if defaults is not self.NOT_GIVEN else default
            self._defaults = (default,)
            if default is self.NOT_GIVEN:
                def func(obj):
                    return obj[item]
            else:
                def func(obj):
                    return obj.get(item, default)
            self._call = func
        else:
            self._items = items = (item,) + items
            self._defaults = defaults = defaults if defaults is not self.NOT_GIVEN else (default,)
            if defaults is self.NOT_GIVEN:
                def func(obj):
                    return tuple(obj[i] for i in items)
            else:
                def func(obj):
                    return tuple(obj.get(i, d) for i, d in zip(items, defaults))
            self._call = func

    def __call__(self, obj, /):
        return self._call(obj)

    def __repr__(self):
        if self._defaults is self.NOT_GIVEN:
            return '%s.%s(%s)' % (self.__class__.__module__,
                                  self.__class__.__name__,
                                  ', '.join(map(repr, self._items)))
        else:
            return '%s.%s(%s, defaults=(%s,))' % (self.__class__.__module__,
                                                  self.__class__.__name__,
                                                  ', '.join(map(repr, self._items)),
                                                  ', '.join(map(repr, self._defaults)))

    def __reduce__(self):
        return self.__class__, self._items, self._defaults


if __name__ == '__main__':
    dd = {'a': 1, 'b': 2, 'c': 3}
    ig = itemgetter("a", "b", "c", defaults=(0, 0, 0))
    print(repr(ig))
    print(ig(dd))
    dd.pop("b")
    print(ig(dd))

    ig = itemgetter("a", default=0)
    print(repr(ig))
    print(ig(dd))
    dd.pop("a")
    print(ig(dd))

    ig = itemgetter("a", defaults=(0,))
    print(repr(ig))
    print(ig(dd))

    ig = itemgetter("a", "b", "c", default=0)
    print(repr(ig))
    print(ig(dd))

If I polished this up (and also implemented the code in the corresponding _operator C module), would this be supported for inclusion in the stdlib operator module?

– Paul

pf_moore · April 30, 2024, 11:09am

I’m not a huge fan of the two similar-but-different default/defaults arguments. They feel awkward (to explain and use), and there’s some surprising edge cases - itemgetter("a", "b", "c", default=0) silently ignoring “b” and “c” doesn’t seem like the right behaviour.

Something like the following might be a better design:

>>> def get_with_defaults(defaults):
...     def impl(d):
...         return {k: d.get(k, v) for (k, v) in defaults.items()}
...     return impl
...
>>> get_with_defaults({"a": 0, "b": 0, "c": 0})({"b": 12})
{'a': 0, 'b': 12, 'c': 0}

The idea here is that the function is called with a “template” dictionary, with the defaults given. Pass an actual dictionary to the returned callable and you get back the template “filled in” with the argument’s values.

pf_moore · April 30, 2024, 12:39pm

Sorry, I got confused about the expected result. But it’s simple enough to fix:

def get_with_defaults(defaults):
    def impl(d):
        return tuple(d.get(k, v) for (k, v) in defaults.items())
    return impl

Edit: Also, depending on your use case or preference, the signature get_with_defaults(**defaults) might be better…

ptmcg · April 30, 2024, 1:32pm

Thanks for the quick response! For the moment though, I’d like to first focus on the concept of changing itemgetter to accept default values in the case of missing keys in a given dict. ~~If that isn’t deemed of interest, then there is no point in wasting time on refining the signature.~~ [Actually, this discussion led to the simpler suggested form below.]

I also saw the odd corner case, and considered passing a dict for the defaults. The alternative you suggest would require a call to itemgetter for a single key with default would be:

get_a_with_default = itemgetter("a", defaults={"a": 0})

Repeating the key(s) in the call certainly disambiguates which default goes with which key. For that matter then, we might consider adding an alternative signature for itemgetter that just takes a dict, with the keys being the keys to pull from the object dicts, and the values being the defaults. This simplifies the call to:

get_a_with_default = itemgetter({"a": 0})

Now no weird default/defaults arg - in fact, no additional named arg at all. Just some isinstance testing in itemgetter.__init__ to detect that a dict was given.

ptmcg · April 30, 2024, 1:57pm

Something like this:

from _collections_abc import Mapping


class itemgetter:
    """
    Return a callable object that fetches the given item(s) from its operand.
    After f = itemgetter(2), the call f(r) returns r[2].
    After g = itemgetter(2, 5, 3), the call g(r) returns (r[2], r[5], r[3])
    If called with a mapping object, the callable will use the keys of the 
    mapping to select the items from the operand, using the respective values
    as defaults if the item is not present in the operand.
    """
    __slots__ = ('_items', '_call')

    def __init__(self, item, /, *items):

        if not items and not isinstance(item, Mapping):
            self._items = (item,)
            def func(obj):
                return obj[item]
            self._call = func
        else:
            if isinstance(item, Mapping):
                self._items = item
                def func(obj):
                    return tuple(obj.get(i, d) for i, d in item.items())
            else:
                self._items = items = (item,) + items
                def func(obj):
                    return tuple(obj[i] for i in items)
            self._call = func

    def __call__(self, obj, /):
        return self._call(obj)

    def __repr__(self):
        if isinstance(self._items, Mapping):
            return '%s.%s(%r)' % (self.__class__.__module__,
                                  self.__class__.__name__,
                                  self._items)

        return '%s.%s(%s)' % (self.__class__.__module__,
                              self.__class__.__name__,
                              ', '.join(map(repr, self._items)))

    def __reduce__(self):
        return self.__class__, self._items


if __name__ == '__main__':
    dd = {'a': 1, 'b': 2, 'c': 3}
    ig = itemgetter("a", "b", "c")
    print(repr(ig))
    print(ig(dd))

    ig = itemgetter({"a": 0, "b": 0, "c": 0})
    print(repr(ig))
    dd.pop("b")
    print(ig(dd))

    # if all keys have the same default value
    ig = itemgetter(dict.fromkeys("a b c".split(), 0))
    print(repr(ig))
    print(ig(dd))

    ig = itemgetter({"a": 0})
    print(repr(ig))
    print(ig(dd))
    dd.pop("a")
    print(ig(dd))

ptmcg · April 30, 2024, 2:17pm

I also like the symmetry with using a dict with __slots__ for defining attributes and defaults (added in Python 3.8).

pf_moore · April 30, 2024, 2:44pm

Personally, I don’t feel that it’s particularly worth it. Having said that, in my own code I tend to simply use a lambda rather than reaching for itemgetter, and so writing my own helper (which is, after all, only a couple of lines of code) seems perfectly natural to me.

Presumably the benefit of itemgetter is performance^[1]? So I think the argument for allowing defaults should probably also revolve around performance - can the defaults be implemented efficiently, and does the existence of the option to include defaults harm performance of the no-defaults case?

FWIW, itemgetter is over 4 times faster than a Python function, so yes, performance is relevant here. But even a Python function runs in nanoseconds, so I imagine it’s only important in pretty hot code…

❯ py -m timeit -s "from operator import itemgetter; d=dict(zip('abcdefghi', range(10)))" "x = itemgetter('a','c','e')(d)"
2000000 loops, best of 5: 147 nsec per loop
❯ py -m timeit -s "from operator import itemgetter; d=dict(zip('abcdefghi', range(10))); f = lambda *k: (lambda d: tuple(d[kk] for kk in k))" "x = f('a','c','e')(d)"
500000 loops, best of 5: 624 nsec per loop

I’d have a hard time believing it’s the attractive and intuitive name ↩︎

ptmcg · April 30, 2024, 2:57pm

Moving the itemgetter call into the setup string (since itemgetter is usually only called once and then used against many mappings) shows 7X performance:

> py -m timeit -s "from operator import itemgetter; d=dict(zip('abcdefghi', range(10))); fn=itemgetter('a','b','c')" "x = fn(d)"
5000000 loops, best of 5: 53.3 nsec per loop
> py -m timeit -s "from operator import itemgetter; d=dict(zip('abcdefghi', range(10))); f = lambda *k: (lambda d: tuple(d[kk] for kk in k)); fn=f('a','c','e')" "x = fn(d)"
1000000 loops, best of 5: 351 nsec per loop

(For raw time comparison with your numbers, the first timeit that you ran runs on my machine in 99.4 nsec per loop.)

ptmcg · April 30, 2024, 3:06pm

The no-defaults cases (with a single arg or with multiple args) return the exact same function as the current code does, so there is no impact to the performance of running the generated callable for all existing usages. As for the runtime of the proposed with-defaults case, one improvement I can see would be to capture item.items() into a list or tuple before defining func, and then just iterating over that (instead of calling items() repeatedly).

ptmcg · April 30, 2024, 3:31pm

Adding defaults to attrgetter may be a little trickier, since attrgetter supports access to attributes in object substructures, like fn = attrgetter('attr.subattr.subsubattr'); fn(obj) will return obj.attr.subattr.subsubattr - I suppose the default could be applied if any of the intervening attributes were missing? (Current behavior is to raise AttributeError.)

Jelle · April 30, 2024, 3:40pm

If you use a lambda tailored to the specific job, the performance difference is much smaller:

% ~/.pyenv/versions/3.12.2/bin/python -m timeit -s "from operator import itemgetter; d=dict(zip('abcdefghi', range(10))); fn = lambda d: (d['a'], d['b'], d['c'])" "x = fn(d)"
5000000 loops, best of 5: 68.9 nsec per loop
% ~/.pyenv/versions/3.12.2/bin/python -m timeit -s "from operator import itemgetter; d=dict(zip('abcdefghi', range(10))); fn=itemgetter('a','b','c')" "x = fn(d)"
5000000 loops, best of 5: 62.7 nsec per loop

That’s how I’d personally write this; I feel the operator functions generally don’t make the code easier to understand, and the performance difference is usually unlikely to matter in practice.

pf_moore · April 30, 2024, 4:07pm

Thanks. I’d also tend to write a very specific lambda in real life, so it’s interesting to see that it’s essentially just as cheap as itemgetter. I didn’t mean to make this all about performance, though - if there are other reasons people perfer itemgetter then there might be arguments for this proposal. It’s just that I, personally, don’t have much of an opinion either way.

ptmcg · April 30, 2024, 4:22pm

If I have more than 1 or 2 keys that I need to get from a list of mappings, itemgetter is way easier to work with than writing a bespoke lambda. And if this is code in a library (like an ORM-type instance), you don’t necessarily know the keys in advance such that you can write that lambda (short of generating a string and eval’ing it).

Thanks both of you for your perspectives.

kknechtel · April 30, 2024, 7:41pm

An unoptimized, simple approach:

from collections import ChainMap
from operator import itemgetter

# ChainMap accepts non-mapping arguments already, but doesn't handle the
# fact that sequences will raise `IndexError` rather than `KeyError`.
# This still allows slices and negative indices to have their usual
# semantics with lists.
class _list_mapping_wrapper:
    def __init__(self, original_list):
        self._original = original_list
    def __getitem__(self, key):
        try: return self._original[key]
        except IndexError: raise KeyError(key)

# We won't actually mutate `fallback` so a mutable default is fine
def my_itemgetter(*items, fallback={}):
    impl = itemgetter(*items)
    def result(container):
        return impl(ChainMap(_list_mapping_wrapper(container), fallback))
    return result

And now we can do:

>>> my_itemgetter('a', 'b', 'c', fallback={'c': 0})({'a':1, 'b':2})
(1, 2, 0)

ptmcg · April 30, 2024, 8:45pm

Thanks Karl, and I’ll look at my code in littletable to see if that helps me where I need to do this. I might still pull a PEP together now that I’ve thought through some of the implementation details (thanks to @pf_moore for helping me clarify a better signature for how to define these defaults). I think the PEP process may get this idea in front of more folks who use the operator module, and can better weigh the merits of defaults on attrgetter and itemgetter. The worst that can happen is that I refresh my Python-C integration skills in doing a proposed implementation and the PEP doesn’t get accepted (wouldn’t be the first time! ).