Simplify lru_cache

matthiasgoergens · August 19, 2022, 2:30am

Thanks you for the very warm welcome, Raymond.

Thank you for your vote of confidence in my work ethic. I am not so optimistic. In fact I was getting a bit discouraged and ready to give up, but then I saw your code review at last and your encouraging remarks, and that gave me the confidence necessary to persevere. I won’t disappoint you!

User need

The least-recently used cache is a commonly used data structure in the Python standard library. Any caching scheme benefits from making efficient use of the available memory, so that users can either cache more entries in the same space or use less memory.

The proposal shaves 7 machine words (ie 56 bytes on a 64 bit machine) of memory overhead for each cache entry. All the while improving runtime performance: slightly in the case of a 100% perfect hits-only workload, and noticeably once you get any misses at all.

Matthias Görgens:

Cache hits-only:

$ python-from-main -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(100_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 267 msec per loop
$ python-with-direct-lru -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(100_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 244 msec per loop

99% cache hits:

$ python-from-main -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(101_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 249 msec per loop
$ python-with-direct-lru -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(101_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 235 msec per loop

50% hits:

$ python-from-main -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(200_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 356 msec per loop
$ python-with-direct-lru -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(200_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 282 msec per loop

Cache misses-only:

$ python-from-main -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = list(range(1_000_000))' -- 'list(map(f, a))'
2 loops, best of 5: 177 msec per loop
$ python-with-direct-lru -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = list(range(1_000_000))' -- 'list(map(f, a))'
2 loops, best of 5: 132 msec per loop

(The benchmark above shows between 9% to 50% runtime overhead of ‘main’ compared to the proposal. I already posted more involved benchmarks and analysis further up.)

The figure of saving 7 words machine comes from me counting the number of pointers saved in the source.

To confirm that theoretical figure, I also did some benchmarks of memory usage:

import gc, math, functools, random, tracemalloc
def bench1(cachesize, reps, hits=0.5):
    r = random.Random()
    r.seed(0)
    f = functools.lru_cache(maxsize=cachesize)(lambda x: None)

    keys = [r.randrange(math.ceil(cachesize/hits)) for _ in range(reps)]

    gc.collect()
    tracemalloc.start()
    first_size, first_peak = tracemalloc.get_traced_memory()

    for k in keys:
        f(k)
    
    second_size, second_peak = tracemalloc.get_traced_memory()
    return second_size - first_size

cachesize = random.randrange(100_000)
mem = bench1(cachesize=cachesize, reps = 400_000)

The slopes of the linear regression show a growth of 93 bytes / entry for the ‘direct’ proposal, and 144 bytes / entry for ‘main’. That’s about 50% more memory usage for ‘main’, or 51 bytes. Given that the real graphs are more of a step-functions than strictly linear, that’s a pretty decent agreement with our theoretical expectation of 56 bytes saved.

Pure Python

Thanks, Raymond, for producing an approximate pure Python equivalent. As a new contributor I wasn’t aware that this is good form. I will be better in the future.

I any case, here’s a more accurate rendering:

class lru_cache:
  def __init__(self, func, maxsize=128):
    self.maxsize = maxsize
    self.func = func
    self.d = {}
    self.finger = d.make_finger() # new proposed function.

  def __call__(self, *args):
      try:
        return self.d.get_and_move_to_end(args) # new proposed function
      except KeyError:
        return self.d.setdefault(args, self.func(*args))
      finally:
        while len(self.d) > self.maxsize:
          self.d.delete_oldest(self.finger) # new proposed function.

Note how this proposal differs from Raymond’s portrayal of ‘pop-then-set’. This is important, because ‘pop-then-set’ is not only slower, it also calls out to user code more often (even if you use the ‘known-hash’ variants), and thus has more opportunities for reentrancy shenanigans.

Similarly, the proposal’s delete_oldest does not call out to user code.

Raymond, I took your feedback about breaking encapsulation to heart, and moved the data needed for the ‘finger’ into an opaque blob that dict provides, and that delete_oldest expects. That way lru_cache won’t have to know anything about dict internals. Thanks for that suggestion!

In C terms:

typedef struct {
    PyDictKeysObject *sentinel;
    Py_ssize_t skip_empty;
} PyDictFinger;

PyAPI_FUNC(PyDictFinger) _PyDict_NewFinger() {
    return (PyDictFinger) {.sentinel = NULL, .skip_empty = 0};
}

int
_PyDict_DelOldest(PyDictObject *mp, PyDictFinger *finger)
{
    PyObject *key, *value;
    Py_hash_t hash;

    if(finger->sentinel != mp->ma_keys) {
        finger->sentinel = mp->ma_keys;
        finger->skip_empty = 0;
    }

    if(!_PyDict_Next((PyObject *)mp, &finger->skip_empty, &key, &value, &hash)) {
        return -1;
    }
    return delitem_common(mp, hash, finger->skip_empty-1, value);
}

As a simpler alternative, we can also put something like Py_ssize_t skip_empty; directly into the underlying dict at a cost of one extra machine word per dict. Then you don’t need to pass the finger around.

But I am not sure about the trade-offs, and I don’t want to interfere with any of the regular dict operations in this narrow proposal.

Ideally, I would like to see operations like get_and_move_to_end and delete_oldest available available in the public C and (especially!) Python API of dict. Not only would my approximate Python code above cease to be approximate and become working code, but to be honest, investigating why these kinds of capabilities weren’t provided by regular dicts was my original motivation for looking into this topic at all.

I have some algorithmic work that would benefit from these. See also my post on and implementation of (random choice from dict in amortised O(1) time)[Random choice from a dict).

Maintaining order

Yes, the proposal re-uses the order maintenance that regular dicts already provides. That is the whole point.

Yes, to make dicts (or dynamic arrays) work in amortised constant time, you need to leave constant fraction of slack in your system. If you leave only a constant absolute amount you get amortised linear runtime for operations like insert.

That applies to regular dicts and also applies to how we use dicts in the lru_cache, both in ‘main’ and in the ‘direct’ proposal.

Yes, for the very special case of being able to predict with 100% accuracy that your workload is composed of 100% hits only, and if you are willing to use an extra 56 bytes per entry and suffer worse amortised runtime, then, yes, you can decrease the slack for the entries.

Do you think that saves much memory? You still have to pay for the 56 bytes extra per entry.

And if you guess even slightly wrong and you have a few misses every once in a while, both your worst case performance and amortised runtime performance per operation drops to O(n).

(Btw, if you can guess so perfectly, consider running the unlimited version of the cache, as that one gives you performance and storage requirements of the raw dict, with no overhead for storing linked lists nor pointer fiddling. The cachesize will stay bounded despite being notionally unlimited, because we assume that you have a perfect 100% hits-only workload.)

The OP does not have any notions. The OP ran benchmarks.

Yes, the existing implementation is relatively fast. Doesn’t that make it even more amazing that the proposal manages to be even faster, even in the best case for the existing implementation (of 100% hits-only)? And saves memory on top of all that!