`itemgetter` split into 2 objects

ncoghlan · October 26, 2024, 4:08am

Only allowing default values for hashable keys would be problematic. However, accepting a Mapping wouldn’t force that, since any use case where keys aren’t required to be hashable for the subscript lookup step would be able to use that same data structure to hold the default values.

The defaults could be a dual mapping-or-iterable arg, where if the value isn’t a mapping, it must be an iterable with the following rules:

defaults iterable shorter than subscripts iterable: keys processed after the defaults iterator is exhausted have no default values
defaults iterable longer than subscripts iterable: extra values are ignored (allows use of itertools.repeat)

dg-pb · October 26, 2024, 11:05am

Something like this?

_NOTGIVEN = object()

class itemtuplegetter:
    def __init__(self, items, defaults=_NOTGIVEN):
        items = list(items)
        self.items = items
        if defaults is _NOTGIVEN:
            self.defaults = _NOTGIVEN
        else:
            self.items = items
            if isinstance(defaults, dict):
                self.defaults = [
                    defaults.get(item, _NOTGIVEN)
                    for item in items
                ]
            else:
                n = len(items)
                self.defaults = list(itl.islice(defaults, n))
                assert len(self.defaults) <= n

    def __call__(self, obj):
        defaults = self.defaults
        result = list()
        append = result.append
        if defaults is _NOTGIVEN:
            for item in self.items:
                append(obj[item])
        else:
            for item, default in zip(self.items, defaults):
                try:
                    append(obj[item])
                except (IndexError, KeyError):
                    append(default)
        return result

ncoghlan · October 26, 2024, 2:18pm

Along those lines, yeah (that specific code imposes limitations on the inputs that aren’t technically necessary, but Discourse isn’t a good platform for working out the finer details - we have GitHub for that).

dg-pb · October 29, 2024, 6:25am

I would also like to suggest exposing read-only items attribute.

sayandipdutta · October 31, 2024, 3:54pm

What is the semantics for len(self.defaults) < n if say n-1th item is not available? I believe it should raise the error, right?

dg-pb · October 31, 2024, 4:03pm

I believe it is best to have it like this.

In this case, one could even provide partial defaults without using dict. At the possible expense of restrictions on order. E.g.:

itemtuplegetter(['a', 'b', 'c'], defaults={'b': 1})
# is the same (except the order) as
itemtuplegetter(['b', 'a', 'c'], defaults=(1,))

dg-pb · November 4, 2024, 9:54am

I am thinking maybe it would be better to leave dict out of initial implementation?

Internal storage of defaults can not be dict due to hash-ability restriction, thus it will inevitably have to be tuple.

Of course there is an option to branch the code depending on defaults input. s.t.:

getobj = itemtuplegetter(..., defaults={'a': 1})
print(getobj.defaults)    # {'a': 1}
getobj = itemtuplegetter(..., defaults=(1,))
print(getobj.defaults)    # (1,)

But this IMO is making a bit of a mess. I.e. One has to choose between (hash-ability + performance) and (order independence). (performance, because __getitem__ of dict is expensive, while tuple access is pretty much free in C).

So I am thinking maybe better not to bother with dict for a start.
And give it some time to come up with solution that does not force user to make unnecessary choices.

So in short, I would suggest itemtuplegetter(items: Iterable, defaults: Iterable) to start with.
To me it seems to offer maximum functionality without needing to make premature choices and risking of getting stuck with them.

It could already do quite a lot:

# Arguments with defaults (by reversing order)
sys_argv = [<PATH>, 1, 2]
arg_getter = itemtuplegetter([3, 2, 1], ['c', 'b'])
c, b, a = arg_getter(sys_argv)
print((a, b, c))    # 1, 2, 'c'

sayandipdutta · November 22, 2024, 10:28pm

In the last example implementation

It seems the result of itemtuplegetter([2, 4, 6], defaults=())(range(10)) would be []. I hope this is just an oversight on your part, because if it is intended, this doesn’t make much sense to me. Additionally, do we agree on returning a tuple at the end?

Does the following meet the criteria of itemtuplegetter?

class itemtuplegetter:
    def __init__(self, items, /, *, defaults=()):
        self.items = list(items)
        self.defaults = list(islice(defaults, len(self.items)))

    def __call__(self, obj, /):
        result = []
        append = result.append

        i = 0
        for i, (item, default) in enumerate(zip(self.items, self.defaults)):
            try:
                append(obj[item])
            except (IndexError, KeyError):
                append(default)

        result.extend(obj[item] for item in self.items[i + 1 :])
        return tuple(result)

Is it okay to make items position-only? Or should that be left unspecified?

dg-pb · November 23, 2024, 1:17am

It is.

I would say, items positional only, while defaults - positional_or_keyword.

dg-pb · November 25, 2024, 3:00am

I think there is one more thing to consider. Instead of itemtuplegetter, there is an option of itemitergetter. I.e.:

iig = itemitergetter([1, 2, 3])
res = iig([0, 1, 2])
type(res)    # Iterator

Advantages:

Faster for all cases, where desired output is not tuple
C implementation would be more elegant in a sense that it would make use of nice properties of dealing with iterators. I.e. not needing to pre-calculate total size to initialize tuple.
Simply more generic return object
Naming would be less ambiguous. I.e. iter means that both input is output is iterable/iterator, while in case of tuple it only refers to output.
itemitergetter and getitemiter would be able to share a lot of code. I think itemitergetter could just call getitemitter in its __call__ without any modifications.

Disadvantages:

Slower when desired output is tuple
C implementation would be more complex in a sense that one additional Iterator object would be required
Would be the only iterator-return function in operator module. Personally, I don’t think it is a big problem, but maybe there is a desired consistency there?
So far it seems to me that getitemtuple and itemtuplegetter need 2 completely independent implementations.

Just thought it might be worthwhile to consider this before it is too late.

@ncoghlan, any thoughts?

ncoghlan · November 25, 2024, 8:59pm

The iterator result sounds like a plausible option.

Iterator inputs (where the length isn’t known) will require additional thought in that case, though (since “iterate to exhaustion to find out the length” will no longer be an available implementation technique).

dg-pb · December 13, 2024, 10:17pm

Had new doubts, so deleted the above. However, arrived at the same conclusion.

itemitergetter is not too bad for large number of items, but initialisation cost of iterator makes it non-competitive for small number items (n ~ 3).

And most of use-cases that I have seen are short-to-mid lengths.

My best bet is to stick with itemtuplegetter.
This keeps things simpler and ensures that it will retain performance and properties of current itemgetter that people seem to like.

@ncoghlan, what do you think?

getitemiter is pretty much what itemitergetter.__call__ would do.

Benchmark Here!

S="
from operator import itemgetter, getitemiter
from collections import deque
obj1 = list(range(3))
obj2 = dict.fromkeys(range(3))
obj3 = tuple(range(100_000))
obj4 = dict.fromkeys(range(100_000))
ig1 = itemgetter(*obj1)
ig2 = itemgetter(*obj1)
ig3 = itemgetter(*obj3)
ig4 = itemgetter(*obj3)
consume = deque(maxlen=0).extend
"
$PYEXE -m timeit -s $S 'ig1(obj1)'                          #  80 ns
$PYEXE -m timeit -s $S 'list(ig1(obj1))'                    # 180 ns
$PYEXE -m timeit -s $S 'consume(getitemiter(obj1, obj1))'   # 230 ns
$PYEXE -m timeit -s $S 'list(getitemiter(obj1, obj1))'      # 330 ns
$PYEXE -m timeit -s $S '[obj1[i] for i in obj1]'            # 140 ns

# ----
$PYEXE -m timeit -s $S 'ig2(obj2)'                          #  90 ns
$PYEXE -m timeit -s $S 'list(ig2(obj2))'                    # 180 ns
$PYEXE -m timeit -s $S 'consume(getitemiter(obj2, obj2))'   # 250 ns
$PYEXE -m timeit -s $S 'list(getitemiter(obj2, obj2))'      # 380 ns
$PYEXE -m timeit -s $S '[obj2[i] for i in obj1]'            # 180 ns

# ----
$PYEXE -m timeit -s $S 'ig3(obj3)'                          # 1.3 ms
$PYEXE -m timeit -s $S 'list(ig3(obj3))'                    # 1.7 ms
$PYEXE -m timeit -s $S 'consume(getitemiter(obj3, obj3))'   # 1.5 ms
$PYEXE -m timeit -s $S 'list(getitemiter(obj3, obj3))'      # 1.7 ms
$PYEXE -m timeit -s $S '[obj3[i] for i in obj3]'            # 1.8 ms

# ----
$PYEXE -m timeit -s $S 'ig4(obj4)'                          # 2.8 ms
$PYEXE -m timeit -s $S 'list(ig4(obj4))'                    # 3.1 ms
$PYEXE -m timeit -s $S 'consume(getitemiter(obj4, obj3))'   # 3.1 ms
$PYEXE -m timeit -s $S 'list(getitemiter(obj4, obj3))'      # 3.5 ms
$PYEXE -m timeit -s $S '[obj4[i] for i in obj3]'            # 3.6 ms

ncoghlan · December 17, 2024, 1:50pm

Sticking with the simplicity of itemtuplegetter makes sense to me.

There comes a point where the right answer is “Just write a dedicated custom attribute retrieval function already”, and a use case where building the attribute tuple to return is expensive enough to be problematic can reasonably be argued to be past that point.

The sweet spot for this is “I want to pull a handful of values out of this mapping”, not the kind of n-ary data lookup that people might otherwise be throwing into some flavour of data frame (pandas, polars, etc).

dg-pb · December 17, 2024, 2:15pm

Ok, then I think I am convinced tuple return is the way. (as opposed to iterator / list)

I have one more concern, naming.
Another extension (which I have implemented myself) is setitemtuple / itemtuplesetter.
I don’t think it might necessarily be needed, but I think there is a non-zero chance that it will be implemented some day.

The issue is that there is no tuple in those.

Thus, if someone can come up with naming, which is as good and would be appropriate for setter equivalents, that would be great.

There is no big deal if names differ. E.g. itemtuplegetter & multiitemsetter, but I think it would still be nicer to have pre-meditated alignment given this ever happens.

dg-pb · December 17, 2024, 3:28pm

On the other hand, itemtuplesetter could work, suggesting that it “implicitly converts inputs into tuples” (even if it doesn’t…).

Sticking with itemtuplegetter for now. There is still time t change if something better comes up.

ncoghlan · December 18, 2024, 1:53am

For the setter case, there’s no return type ambiguity, so you can build an unambiguous API based on the way itemgetter works.

~~The issue with singleton tuples never comes up.~~

Edit: scratch that, the ambiguity does come up based on whether the callable signature is (*values) or (values). For that, itemsetter and itemitersetter would be reasonable names - as you say, there’s not necessarily a tuple involved in the latter case, just an arbitrary iterable of item values.

dg-pb · December 18, 2024, 12:34pm

Hm, maybe getitemiter & itemitergetter are good names then.

tuple refers to implementation detail of return value (it might as well be a list, some other Sequence or even Iterator) and not to more general concept, which is to act on Iterables.

This way it would also apply to setitemiter, itemitersetter and possibly other similar variants of this type without limiting them on return value type. E.g.:

delitemiter
additer
…

ncoghlan · December 18, 2024, 2:25pm

Once there’s a return type involved, I unfortunately think it gets harder to justify interpreting iter as referring to Iterable rather than Iterator.

This is due to the way the iter builtin works: it accepts an iterable and produces an iterator.

For itemitersetter and iteritemdeleter, only the “accepts an iterable” aspect applies.

For itemitergetter, though, both aspects of the iter behaviour are relevant, so returning an iterable instead of an iterator would be surprising.

Hence the original conclusion that itemtuplegetter is a better name when that is what the callable is returning.

My inclination would be that all the iter variants should be proposed as additions to the more-itertools project, and only the narrower tuple-based return type disambiguation API for itemgetter be considered as a potential stdlib addition.

dg-pb · December 18, 2024, 3:07pm

True, or itertools philosophy in general.
I agree, best not to mix these up

Back to itemtuplegetter. Although I still don’t like the fact that this naming would not apply to setters, I suppose it is the best so far.

One more: getmanyitems / manyitemsgetter…

ncoghlan · December 18, 2024, 9:54pm

The key benefit of the proposed API over the status quo is that it can emit singleton tuples, so having many (or anything along those lines) in the name seems inappropriate to me.

From a type signature point of view, we’re narrowing the potential return types from itemgetter’s Any|tuple[Any] to just tuple[Any], so including tuple in the name does make sense.