Caching methods causes instances to live forever

dsal3389 · August 8, 2024, 1:00pm

will use this class for the examples

import time
from functools import cache 


class Person:
    def __init__(self, name: str, age: int) -> None:
        self.name = name
        self.age = age

    @cache
    def think_about_math(self, n: int) -> int:
        time.sleep(n)
        return n*2

lets say I have a long term script that creates/deletes Person instances

def foo(count: int) -> None:
    for n in range(count):
        p = Person(name="bob", age=99)
        p.think_about_math(n)

foo(5)

it is expected that after the function returns, it deletes all the Person instances and therefor the cache is also deleted, but this is not the case.
if I execute this code

import gc

p = Person("bob", age=99)
p.think_about_math(2)
print(len(gc.get_referrers(p)))

there are 2 references to that object even though I created only 1 reference p, so when I drop p, the instance still be alive because it has another reference.

why?

all this time I was under the impression that the cache method creates a hash of the passed parameters (including self), and store it in a dict where the value is the original function result.

but turns our that that python stores the whole function parameters in a tuple (which is hashable), and it sets the tuple object itself as the key in the dict.
here is the function that creates the tuple to store in the cache dict:
https://github.com/python/cpython/blob/main/Lib/functools.py#L464

at the end the _make_key function returns

return _HashedSeq(key)

which is just an object that inherits from list but it implements __hash__, but our class reference is in there, in that list, thats why the object reference is jumped to 2, thats why the instance and the cache will never die.

it is laso happening here in the c module of functools
https://github.com/python/cpython/blob/main/Modules/_functoolsmodule.c#L858

the return value is a python object (in that case a tuple) that stores all our references

# sinppet
885  |  key = PyTuple_New(key_size);

why it returns a python object instead of the hash iteself, i donno, maybe someone can explain? should it be changed?

Stefan2 · August 8, 2024, 1:18pm

Would you prefer fake cache hits, where a person gets a previous person’s results just because they happen to have the same hash?

dsal3389 · August 8, 2024, 1:25pm

why? _make_key should return hash instead of a tuple, why create a new reference to the instance in a tuple, set the tuple as the key in the cache dict, the dict converts the tuple key to hash behind the scenes anyway

pf_moore · August 8, 2024, 1:44pm

Yes, but dict hashes aren’t unique - hash collisions are expected and catered for in the dict implementation.

Wombat · August 8, 2024, 1:55pm

The example is incorrectly designed because the method body does not depend on self. Use staticmethod instead.

@staticmethod
@cache
def think_about_math(n: int) -> int:
    time.sleep(n)
    return n*2

Another incorrect design is decorating with cache instead of lru_cache. Applying cache to any function or method causes its inputs to live until cache_clear is called. The cache decorator translates to lru_cache(maxsize=None).

bschubert · August 8, 2024, 5:49pm

Linking follow-up GitHub issue for future context:

github.com/python/cpython

Caching methods causes instances to live forever

opened 01:35PM - 08 Aug 24 UTC

closed 02:50PM - 08 Aug 24 UTC

dsal3389

type-bug

# Bug report ### Bug description: it feels like a bug because it is unexpected… behavior: discourse link: https://discuss.python.org/t/caching-methods-causes-instances-to-live-forever/60303 content copy pasted from discourse: will use this class for the examples ```py import time from functools import cache class Person: def __init__(self, name: str, age: int) -> None: self.name = name self.age = age @cache def think_about_math(self, n: int) -> int: time.sleep(n) return n*2 ``` lets say I have a long term script that creates/deletes `Person` instances ```py def foo(count: int) -> None: for n in range(count): p = Person(name="bob", age=99) p.think_about_math(n) foo(5) ``` it is expected that after the function returns, it deletes all the `Person` instances and therefor the cache is also deleted, but this is not the case. if I execute this code > ```py > import gc > > p = Person("bob", age=99) > p.think_about_math(2) > print(len(gc.get_referrers(p))) > ``` > --- > ``` > 2 > ``` there are 2 references to that object even though I created only 1 reference `p`, so when I drop `p`, the instance still be alive because it has another reference. ## why? all this time I was under the impression that the `cache` method creates a hash of the passed parameters (including self), and store it in a `dict` where the value is the original function result. but turns our that that python stores the whole function parameters in a tuple (which is hashable), and it sets the tuple object itself as the key in the dict. here is the function that creates the tuple to store in the cache dict: [https://github.com/python/cpython/blob/main/Lib/functools.py#L464](https://github.com/python/cpython/blob/main/Lib/functools.py#L464) at the end the `_make_key` function returns ```py return _HashedSeq(key) ``` which is just an object that inherits from list but it implements `__hash__`, but our class reference is in there, in that list, thats why the object reference is jumped to 2, thats why the instance and the cache will never die. it is laso happening here in the `c` module of `functools` [https://github.com/python/cpython/blob/main/Modules/_functoolsmodule.c#L858](https://github.com/python/cpython/blob/main/Modules/_functoolsmodule.c#L858) the return value is a python object (in that case a tuple) that stores all our references ```py # sinppet 885 | key = PyTuple_New(key_size); ``` --- why it returns a python object instead of the hash iteself, i donno, maybe someone can explain? should it be changed? ### CPython versions tested on: 3.10, 3.13 ### Operating systems tested on: Linux

yoavdw · August 12, 2024, 1:33am

As a side note, this scenario is used in memray’s (a memory profiler) tutorial as a ~~hard-to-find~~ memory “leak”:

https://bloomberg.github.io/memray/tutorials/3.html

Wombat · August 12, 2024, 2:03am

It is easy to find because cache explicitly says to accumulate entries without bound. This is easy to recognize with any callable, not just with methods. Everywhere it occurs, the inputs and outputs live until the cache is cleared.

Almost all other memory leaks are harder to find. Almost anything can create a reference to an object. At least cache is clear about what it does and it is easy to fix. Replace cache with lru_cache and an explicit size limit.

yoavdw · August 12, 2024, 2:14am

I didn’t mean to get in the discussion of whether or not it’s particularly easy or hard - my bad for saying that.

I do still think this behavior is unexpected. Having shown it to beginners, they expected a new cache object to be created for each instance. I think that’s more intuitive.

This is memray’s solution to the problem:

class Person:
    def __init__(self, name: str, age: int) -> None:
        self.name = name
        self.age = age
        self.think_about_math = cache(self._think_about_math)

    def _think_about_math(self, n: int) -> int:
        time.sleep(n)
        return n*2