Memoizing based on `id(.)` to avoid implementing and computing ` __hash__` method

Hi everyone,

As you know, functools.lru_cache requires that function arguments be hashable. However, in practice, I often find myself wanting to memoize functions or methods whose arguments are not hashable, but which I know to be immutable (or at least not mutated according to our coding conventions).

I was wondering if it would make sense to implement (and maybe propose for functools) a memoization utility that uses the id() of the arguments as cache keys.

This would have a few advantages:

  • it avoids the cost of hashing and equality checks;
  • it allows caching of non-hashable objects (e.g. np.array), as long as we’re confident they’re not modified.

Of course, the drawback is that two equal but distinct objects (same content but different memory locations) wouldn’t result in a cache hit.
And since id() values are reused after object destruction, the cache would need to remove keys when the objects are garbage-collected, for example using weakref.ref(..., callback) (see docs).

I understand that such a tool would need to be used carefully:

  • Python doesn’t provide a built-in way to declare objects as truly immutable;
  • a developer might accidentally mutate an object used as a cache key, leading to hard-to-debug issues.

That said, in my personal use case, I often run expensive data-processing functions multiple times on the same instances, which include np.arrays. These arrays are technically mutable, but in practice, we never modify them (according to internal conventions).

I’ve drafted a rough implementation of such a decorator (a bit “dirty”), and I’d be happy to share it if there’s interest.

What do you think?
Is there a better approach? Has this kind of idea already been discussed or rejected for good reasons?

This is what is already going to happen if you don’t implement __eq__ or __hash__ on the objects. I don’t think it’s worth it to implement more support than this - the usecases are really rare (essentially boiling down to “I know better than the original authors how this object aught to behave”) and it’s decently possible to create e.g. a small wrapper object for the purpose of caching by identity.

4 Likes

Thanks for your reply!

I’m not sure I follow your point — as far as I know, lru_cache requires all arguments to be hashable, since it uses a dictionary under the hood (docs). If an argument isn’t hashable, it raises a TypeError, which is why I was considering caching based on id() instead.

I agree the use case might be rare — it would be useful in my situation, but I’m mainly asking to see if others in the community would also find it valuable.

By default, if neither __eq__ nor __hash__ is implemented, an object is hashable by id. If you want non-hashable objects to be hashable, create a wrapper around them that is hashable.

5 Likes

That’s all well and good, but I don’t understand how that helps create an lru_cache for a function that takes a list of numbers.

To avoid subclassing, how about if the functools.lru_cache decorator accepted a custom hash function?

1 Like

I’ve had a need for this too once. Wanted to cache a function that took arrays as an input. Being able to provide a custom hash function sounds like an elegant solution.

This can currently be implemented as a wrapper over functools.lru_cache that “hashifies” unhashable arguments for lru_cache, which would instead decorate an unwrapper that “unhashifies” hashified arguments for the actual function:

def hashifying_lru_cache(
        func_or_None=None, maxsize=128, typed=False, hasher=id):
    def decorator(func):
        class Hashified:
            def __init__(self, obj):
                self.obj = obj
            def __eq__(self, other):
                return hasher(self.obj) == hasher(other.obj)
            def __hash__(self):
                return hasher(self.obj)
        def unwrapper(*args, **kwargs):
            return func(
                *(arg.obj if isinstance(arg, Hashified) else arg
                    for arg in args),
                **{name: value.obj if isinstance(value, Hashified) else value
                    for name, value in kwargs.items()}
            )
        def wrapper(*args, **kwargs):
            return decorated(
                *(Hashified(arg) if arg.__hash__ is None else arg
                    for arg in args),
                **{name: Hashified(value) if value.__hash__ is None else value
                    for name, value in kwargs.items()}
            )
        decorated = lru_cache(maxsize=maxsize, typed=typed)(unwrapper)
        return wrapper
    return decorator if func_or_None is None else decorator(func_or_None)

so that:

@hashifying_lru_cache
def sum_list(const_list):
    print('got', const_list)
    return sum(const_list)

lst = [1, 2, 3]
print(sum_list(lst))
print(sum_list(lst))

outputs:

got [1, 2, 3]
6
6

Demo: jIbybZ - Online Python3 Interpreter & Debugging Tool - Ideone.com

The gain should hopefully be worth the wrapping/unwrapping overhead for actually expensive functions.

6 Likes

Thanks for all your responses!

Indeed, Ben Hsing’s solution is quite lightweight and practical — I think it actually covers most of my needs. The only downside compared to what I initially had in mind (which was more complex) is that the function arguments are kept alive in memory as long as they’re in the cache, even if no other references exist elsewhere. I was originally thinking of a mechanism where cache entries could be automatically removed when the argument is garbage collected, using weakref callbacks. But realistically, Ben’s solution is probably sufficient for the vast majority of use cases.

That’s all well and good, but I don’t understand how that helps create an lru_cache for a function that takes a list of numbers.

If we have a shared coding convention that ensures such lists shouldn’t be mutated, then caching based on their id() would still work in principle. However, since lists aren’t weakrefable, that makes things trickier for the solution I had in mind. I’m not deeply familiar with CPython internals, but I was wondering whether there’s a way to mimic the weakref callback logic for such cases. If that’s not feasible, maybe one could recursively generate a key from the elements themselves instead. A similar strategy might also be needed for other built-in types that are neither hashable nor weakrefable (e.g., dicts).

Being able to provide a custom hash function sounds like an elegant solution.

I agree — and there’s actually a related discussion here. The idea was rejected in part because of the issue that “the parameters have to be stated three times: once in the main function, again in the key function lambda parameters, and again in the lambda function body.”

Yeah, or you can always specify a maxsize to keep the memory usage in check while letting IDs of deleted objects fall out of the cache by themselves over time.