Simplify lru_cache

Raymond, thanks a lot again for your help! It’s a pleasure to have someone with so much experience critiquing my suggestion.

Not updating a dictionary is cheaper than updating a dictionary. With the current implementation, we can have millions of consecutive cache hits and NEVER have to mutate the dictionary (no resizes, no reinsertions, no memory allocations, it is just cheap pointer updates).

I am familiar with that argument. But the benchmark disagrees even for the hits-only workload: it’s the same speed in both implementations, and my patch uses less memory.

I am a bit surprised at this result myself to be honest! The common case in my patch is simple enough, but every once in a while we have to pay for the compaction. I suspect the memory savings help the performance somehow. But I don’t want to make the mistake of trying to outguess the benchmark data with my half-formed ideas.

I’d be more than happy to run some other benchmarks than those I already did, if you have a suggestion.

The path we care about most is a cache hit because misses are dominated by the user function call.

Yes, that makes sense. I agree.

You probably want to look at the hits-only workload, and perhaps a workload that has 90% hits or so?

The ‘direct’ lru-cache has the same performance for the hits-only workload, and a better performance for a workload with at least some misses.

Cache miss performance is less important than hits, but it is worth a look at well. The dict implementation is not optimized for finding the oldest item. Essentially, it runs d.pop(next(iter(d))) which gets quadratically more and more expensive until a resize occurs.

What you are describing was a real problem with @storchaka’s otherwise excellent, old ‘move-to-end’ patch.

But that’s not how my patch works. My patch finds and deletes the oldest entry in amortised O(1).

As a sanity check: this problem would mainly occur in a miss-heavy workload. And sure enough, all three of @storchaka’s old patches ‘cause significant slowdown (around 2.3x) in the case of many misses’ (compared to unpatched).

My code does better than main in the case of many misses.

Currently a dict can be tuned for normal workloads and not have to worry about lru_cache workloads which are extreme. In particular, the usable fraction needs to be set high for normal workloads, high for an lru cache miss workload, and low for an lru cache hit workload.

You are right in principle.

Let’s look at the cases of what we’d want:

Workload Slack in indices Slack in entries
Normal dict high high
‘main’ lru_cache hit-only high none
‘main’ lru_cache at least some misses high high
‘direct’ lru_cache hit-only high high
‘direct’ lru_cache at least some misses high high

In practice, GROWTH_RATE determines the slack in indices, and a combination of GROWTH_RATE and USABLE_FRACTION determines the slack in entries.

If you have the ability to predict perfectly whether your lru_cache workload will have any misses, you could carefully tune your usable fraction.

What this will do in the best case is: save you a bit of memory. And if you get your prediction wrong, your performance will be hosed with frequent resizes.

What you could do instead is: save that memory by ditching the linked list, and get the same performance as the linked list approach for a hit-only workload or better performance for the some-misses workload.

This seems much more robust to me. Having the ability to tune is good, not needing any special tuning is better. The same tuning would work equally well for all workloads.

Almost none of the implementation complexity is in the linked list updates. What makes this problem hard is thinking through reentrancy and multithreading concerns.

I agree.

A lot of human effort was expended in getting the current code correct. It would not serve users well to destabilize it. It took me a several work days to go through Serhiy’s code with a fine-toothed comb and I’m unlikely to repeat that effort ever again.

Sorry, I’m missing some context here. Are you talking about his proposed patches, or the linked-list implementation?

My recommendation is to stop thinking about the lru_cache() implementation and instead focus on what the dictionary has to do to maintain order given a fixed size array of entries, a high usable fraction, and every cache hit causing an arbitrary key to be repositioned at the end, leaving a hole earlier in the array. Run some simulations and learn how the usable fraction affects the number of resizes which require a full table scan.

Great idea! I like it so much, that this is what I’ve already done to come up with my implementation in the first place. I’ve run the benchmarks, too.

Consider adopting an adversarial mindset with your PR. It will help you be objective, help identify the flaws, help understand the underlying algorithms, help perform more useful timings, and avoid the natural tendency to identify with the PR.

That’s good advice in general. I’ve done that already, but of course, there’s always more you can do. Do you have any more ideas for benchmarks to run?

I’d be especially interested in any flaws you can point out.

P.S. I also just ran a benchmark with 99% hits / 1% misses (and one million repetitions and total and many different cache size up to 100,000 and with caches warmed up first before timing starts):

def bench_mixed(cachesize, reps, hits):
    r = random.Random()
    r.seed(0)
    f = ft.lru_cache(maxsize=cachesize)(id)

    # fill cache at random:
    keys = list(range(cachesize))
    r.shuffle(keys)
    for k in keys:
        f(k)

    start = time.perf_counter()

    for c in range(reps):
        f(r.randrange(math.ceil(cachesize/hits)))
    end = time.perf_counter()
    return end - start

Basically the code above with hits=0.99.

I had expected the performance of ‘main’ and ‘direct’ to be essentially identical to their hits-only case (where they both take essentially identical amounts of time). Interestingly enough, even this tiny proportion of misses is already enough for the ‘direct’ version to pull ahead:

1 Like

Do you have any more ideas for benchmarks to run?

Focus on understanding how a dictionary degrades during lru cache operations.

For example, finding the oldest entry involves a linear search, so repeatedly deleting the oldest entry is quadratic:

% python3.11 -m timeit 'd=dict.fromkeys(range(1_000_000))' 'for i in range(100_000): d.pop(next(iter(d)))'
1 loop, best of 5: 1.59 sec per loop
% python3.11 -m timeit 'd=dict.fromkeys(range(1_000_000))' 'for i in range(200_000): d.pop(next(iter(d)))'
1 loop, best of 5: 6.34 sec per loop
% python3.11 -m timeit 'd=dict.fromkeys(range(1_000_000))' 'for i in range(300_000): d.pop(next(iter(d)))'
1 loop, best of 5: 14.2 sec per loop
% python3.11 -m timeit 'd=dict.fromkeys(range(1_000_000))' 'for i in range(400_000): d.pop(next(iter(d)))'
1 loop, best of 5: 25.9 sec per loop
% python3.11 -m timeit 'd=dict.fromkeys(range(1_000_000))' 'for i in range(500_000): d.pop(next(iter(d)))'
1 loop, best of 5: 42.1 sec per loop

To quote myself:

Yes, if you used dictiter_iternextkey for finding the oldest entry, the operation would indeed run in amortised linear time. (Or in quadratic time for a sequence of n delete-oldest operations.)

Which is exactly why my code does not do that. How else do you think I can beat what’s in ‘main’ for speed?

(If I could employ such a silly approach and still beat the code in ‘main’ for speed, something would be truly, truly weird. As you suggested earlier, the status quo is already pretty well optimized.)

Would you please just have a look at the code in question on github, instead of bringing up hypothetical flaws of a naive approach that the code doesn’t actually share? Thanks!

Here are some more benchmarks for you.

Cache hits:

$ python-from-main -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(100_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 267 msec per loop
$ python-with-direct-lru -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(100_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 244 msec per loop

Cache misses:

$ python-from-main -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = list(range(1_000_000))' -- 'list(map(f, a))'
2 loops, best of 5: 177 msec per loop
$ python-with-direct-lru -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = list(range(1_000_000))' -- 'list(map(f, a))'
2 loops, best of 5: 132 msec per loop

99% cache hits:

$ python-from-main -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(101_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 249 msec per loop
$ python-with-direct-lru -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(101_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 235 msec per loop

50% hits:

$ python-from-main -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(200_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 356 msec per loop
$ python-with-direct-lru -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(200_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 282 msec per loop

I also did some more involved benchmarks with quite a lot more data, and even some linear regressions. I already linked to them above. But I can explain them in more detail, if you are interested.

Pure Python

For those who are watching from the stands, here is a rough pure python equivalent:

class lru_cache:

    def __init__(self, func, maxsize=128):
        self.maxsize = maxsize
        self.func = func
        self.d = {}

    def __call__(self, *args):
        d = self.d
        v = d.pop(args) if args in d else self.func(*args)
        d[args] = v
        if len(self.d) > self.maxsize:
            oldest_key = next(iter(d))  # Sped-up with a search finger hack.
            d.pop(oldest_key) 
        return v            

@lru_cache            
def square(x):
    answer = x * x
    print(f'{x} --(square)--> {answer}')
    return answer

Deleting the oldest entry

The PR shows a willingness to cross lines that I won’t. To avoid a linear search for the oldest key, the code reinvents some of what the linked list was already doing by keeping a search finger, “already_deleted_entries”.

In doing so, it breaks encapsulation and directly accesses dict internals. Presumably, that is why no pure python equivalent was offered.

Maintaining order

The PR shifts the burden of maintaining order from the linked list to dictionary. The linked link approach just updates a pair of links. The dict approach has to do a delete and insert and will trigger full table scan resizes, doing work that we didn’t have to do before.

Here’s a simulation showing the resizes as cache hits occur. To keep the example short, the effect is exaggerated with a high ratio of occupied entries (25) to maximum space in the entrylist (30), but it does show that the frequency of resizes depends on how the dict is tuned (something that is not of concern in the current implementation) and that resizes are in fact triggered by cache hits (something which does not occur at all in current implementation).

>>> keylist_simulation(steps=20, n=30, lru_maxsize=25, verbose=True)
*************************----- <- Start
*****-********************---- Move entry 6 to the end
*****-**-******************--- Move entry 9 to the end
*****-**-***************-***--
*****-**-******-********-****-
*-***-**-******-********-***** We're now full
*************************----- <- Compacted.
Steps = 5   Cumulative_work = 30
*******-******************----
*******-*****************-*---
*******-*******-*********-**--
*****-*-*******-*********-***-
*****-*-******--*********-****
*************************----- <- Compacted.
Steps = 10 Cumulative_work = 60
******-*******************----
*****--********************---
*****--*******-*************--
*****--*******-**************-
*****--*******-********-******
**************************----- <- Compacted.
Steps = 15 Cumulative_work = 90

Reentrancy and concurrency

I haven’t reviewed this part yet. I know that even the pure python version had to worry about reentrancy. The PR is hand-wavy and says, “We don’t need to worry too much about reentrancy.” Hopefully that is true. It is a major headache.

Time available

Answering the OP has consumed all of my development time for a while. It is my impression (possibly false) that the OP will persist indefinitely and never accept that the PR makes the underlying dict do more work than it does now and that the dict implementation details and lru cache implementation details should remain decoupled.

It is also my impression (possibly false) that the current implementation is stable and highly performant. It has the wonderful property that cache hits have no side-effects, nothing moves or rebuilds — a fixed number links are exchanged and that is it — the latency is constant and small.

As far as I can tell, there is no user problem being solved by the PR. The OP just prefers another implementation premised on the notion that link updates are expensive (they aren’t) and that repeated dict deletions and insertions are free (they aren’t).

2 Likes

Thanks you for the very warm welcome, Raymond.

Thank you for your vote of confidence in my work ethic. I am not so optimistic. In fact I was getting a bit discouraged and ready to give up, but then I saw your code review at last and your encouraging remarks, and that gave me the confidence necessary to persevere. I won’t disappoint you!

User need

The least-recently used cache is a commonly used data structure in the Python standard library. Any caching scheme benefits from making efficient use of the available memory, so that users can either cache more entries in the same space or use less memory.

The proposal shaves 7 machine words (ie 56 bytes on a 64 bit machine) of memory overhead for each cache entry. All the while improving runtime performance: slightly in the case of a 100% perfect hits-only workload, and noticeably once you get any misses at all.

(The benchmark above shows between 9% to 50% runtime overhead of ‘main’ compared to the proposal. I already posted more involved benchmarks and analysis further up.)

The figure of saving 7 words machine comes from me counting the number of pointers saved in the source.

To confirm that theoretical figure, I also did some benchmarks of memory usage:

import gc, math, functools, random, tracemalloc
def bench1(cachesize, reps, hits=0.5):
    r = random.Random()
    r.seed(0)
    f = functools.lru_cache(maxsize=cachesize)(lambda x: None)

    keys = [r.randrange(math.ceil(cachesize/hits)) for _ in range(reps)]

    gc.collect()
    tracemalloc.start()
    first_size, first_peak = tracemalloc.get_traced_memory()

    for k in keys:
        f(k)
    
    second_size, second_peak = tracemalloc.get_traced_memory()
    return second_size - first_size
cachesize = random.randrange(100_000)
mem = bench1(cachesize=cachesize, reps = 400_000)

The slopes of the linear regression show a growth of 93 bytes / entry for the ‘direct’ proposal, and 144 bytes / entry for ‘main’. That’s about 50% more memory usage for ‘main’, or 51 bytes. Given that the real graphs are more of a step-functions than strictly linear, that’s a pretty decent agreement with our theoretical expectation of 56 bytes saved.

Pure Python

Thanks, Raymond, for producing an approximate pure Python equivalent. As a new contributor I wasn’t aware that this is good form. I will be better in the future.

I any case, here’s a more accurate rendering:

class lru_cache:
  def __init__(self, func, maxsize=128):
    self.maxsize = maxsize
    self.func = func
    self.d = {}
    self.finger = d.make_finger() # new proposed function.

  def __call__(self, *args):
      try:
        return self.d.get_and_move_to_end(args) # new proposed function
      except KeyError:
        return self.d.setdefault(args, self.func(*args))
      finally:
        while len(self.d) > self.maxsize:
          self.d.delete_oldest(self.finger) # new proposed function.

Note how this proposal differs from Raymond’s portrayal of ‘pop-then-set’. This is important, because ‘pop-then-set’ is not only slower, it also calls out to user code more often (even if you use the ‘known-hash’ variants), and thus has more opportunities for reentrancy shenanigans.

Similarly, the proposal’s delete_oldest does not call out to user code.

Raymond, I took your feedback about breaking encapsulation to heart, and moved the data needed for the ‘finger’ into an opaque blob that dict provides, and that delete_oldest expects. That way lru_cache won’t have to know anything about dict internals. Thanks for that suggestion!

In C terms:

typedef struct {
    PyDictKeysObject *sentinel;
    Py_ssize_t skip_empty;
} PyDictFinger;
PyAPI_FUNC(PyDictFinger) _PyDict_NewFinger() {
    return (PyDictFinger) {.sentinel = NULL, .skip_empty = 0};
}
int
_PyDict_DelOldest(PyDictObject *mp, PyDictFinger *finger)
{
    PyObject *key, *value;
    Py_hash_t hash;

    if(finger->sentinel != mp->ma_keys) {
        finger->sentinel = mp->ma_keys;
        finger->skip_empty = 0;
    }

    if(!_PyDict_Next((PyObject *)mp, &finger->skip_empty, &key, &value, &hash)) {
        return -1;
    }
    return delitem_common(mp, hash, finger->skip_empty-1, value);
}

As a simpler alternative, we can also put something like Py_ssize_t skip_empty; directly into the underlying dict at a cost of one extra machine word per dict. Then you don’t need to pass the finger around.

But I am not sure about the trade-offs, and I don’t want to interfere with any of the regular dict operations in this narrow proposal.

Ideally, I would like to see operations like get_and_move_to_end and delete_oldest available available in the public C and (especially!) Python API of dict. Not only would my approximate Python code above cease to be approximate and become working code, but to be honest, investigating why these kinds of capabilities weren’t provided by regular dicts was my original motivation for looking into this topic at all.

I have some algorithmic work that would benefit from these. See also my post on and implementation of (random choice from dict in amortised O(1) time)[Random choice from a dict).

Maintaining order

Yes, the proposal re-uses the order maintenance that regular dicts already provides. That is the whole point.

Yes, to make dicts (or dynamic arrays) work in amortised constant time, you need to leave constant fraction of slack in your system. If you leave only a constant absolute amount you get amortised linear runtime for operations like insert.

That applies to regular dicts and also applies to how we use dicts in the lru_cache, both in ‘main’ and in the ‘direct’ proposal.

Yes, for the very special case of being able to predict with 100% accuracy that your workload is composed of 100% hits only, and if you are willing to use an extra 56 bytes per entry and suffer worse amortised runtime, then, yes, you can decrease the slack for the entries.

Do you think that saves much memory? You still have to pay for the 56 bytes extra per entry.

And if you guess even slightly wrong and you have a few misses every once in a while, both your worst case performance and amortised runtime performance per operation drops to O(n).

(Btw, if you can guess so perfectly, consider running the unlimited version of the cache, as that one gives you performance and storage requirements of the raw dict, with no overhead for storing linked lists nor pointer fiddling. The cachesize will stay bounded despite being notionally unlimited, because we assume that you have a perfect 100% hits-only workload.)

The OP does not have any notions. The OP ran benchmarks.

Yes, the existing implementation is relatively fast. Doesn’t that make it even more amazing that the proposal manages to be even faster, even in the best case for the existing implementation (of 100% hits-only)? And saves memory on top of all that!

3 Likes

This is some impressive research. Awesome work.

Your discussion seems to have reach an impasse. May I suggest that you create an issue at GitHub - faster-cpython/ideas to draw the attention of other core developers who work on faster Python? You also have the option to contact the steering council and ask them to mediate.

8 Likes

@tiran Awesome, thanks for the suggestion!

I’ll post to faster-cpython, because there’s still some areas of my draft proposal that need feedback and work, before it’s ready to go in front of any kind of council. I’m pretty much a rank novice when it comes to contributing to Python.

@matthiasgoergens It seems that in your latest version all you really need from python is to add a finger API to dictionaries. Then you can implement your alternative lru_cache in an external module. Am I right?

Perhaps we should discuss the pros and cons of such an API separately?

@iritkatriel Yes, ideally I would like to (carefully!) add to the dict api, so that an external module could do the same work. I don’t actually care, if the standard library lru-cache is more bloated than necessary, if it’s possible to use a third-party library.

(I say ‘carefully’, because I don’t want to constrain any future evolution of dict any more than existing public API commitments already constraint it.)

See Dict API and feaner and faster lru-cache · Issue #450 · faster-cpython/ideas · GitHub for some more thoughts on the issue.

Basically, dict would need to provide ‘get-and-move-back’ and ‘delete-oldest’. But you need to be really careful about exactly how you are providing this, if you want to get good performance.

@matthiasgoergens In that case, I wonder if it would make sense for you to make a PR with just the finger API addition (with whatever unit tests and doc changes it would need to get merged). It may help push the discussion along - it’s always easier to comment on an API change that you can see than on one you need to imagine.

The lru_cache idea can be a motivating use case for the new API.

(If this happens I imagine it would require a PEP as well, but I would suggest you get some more feedback on the idea to see what its chances are, before beginning to write one).

@iritkatriel Thanks for the suggestion!

Do you know whether it’s feasible to work on that PR as part of the broader ‘faster-cpython’ process or should I start a separate track of work? (I’d prefer the former, if that’s possible with the processes that are common for CPython.)

I’m not sure I understand what you mean by the faster-cpython process, but the faster-cpython issue tracker is mostly about discussions on ideas and experiments, as well as benchmark tooling.

When it comes time to commit code to cpython, you need to create an issue on the cpython repo to describe the change you’re making. So a PR would need to be associated with a cpython issue.

1 Like

Sorry, I think I was a bit confused. Basically I thought I’d go through the discussion with people on ‘faster-cpython ideas’, and then once that’s progressed enough to be more concrete, I’ll raise the CPython github issue and open a PR?

You need a PR on the cpython repo in order to have the discussion. You can create a draft PR (there’s a button to convert an open PR to draft), and create the cpython issue only when you think the PR ready to be reviewed for real.

1 Like

It was Christian Heimes who suggested opening a ticket in the faster-cpython project to discuss this proposed change in another forum than this one. The faster-cpython team does have discussions on their tracker, before opening cpython tickets then PRs.

2 Likes

Actually we decided that we don’t like the Discussions category any more. I recommend an issue in the cpython project. (It was proposed to start with a draft PR, I’m not sure where that suggested process comes from – IIUC the workflow typically goes from informal discussion in some mailing list or forum → issue to solidify the design → PR to solidify the implementation. It’s fine to have a draft PR ready if you think the design is more easily understood by looking at that. I believe it’s still frowned upon to start with a PR and then create an issue at the last minute to satisfy some bot’s requirement.

3 Likes

What is the best place in the process to have an algorithmic discussion? Recency tracking has to be done somewhere, either through near instant linked list updates (serhiy’s code) or through repeated deletion/appending/resizing in the dict (the OP’s code)? I built a simulation and posted an easy to understand diagram for the latter and no one seems to have looked at it.

What is the best place in the process for a strategic discussion? To efficiently find the oldest entry, the dict implementation has to be modified in a way that amounts to a permanent API commitment, equivalent to OrderedDict’s popitem(False). In the past, you’ve shown a reluctance to move the OrderedDict API (and its algorithmic commitments) into regular dicts.

Lastly, as a reviewer, what I am supposed to do now? I’ve spent many hours going through this proposal and reading the OP’s code. Presumably I’m the reviewer best qualified to look at this, having originated the current dict design and having spent several days on a detailed review of Serhiy’s lru_cache() current implementation. My assessment is that the OP is going down the wrong path and will make us subtly worse off. Is there any place in our process for me to voice that conclusion and have it taken seriously? It feels as though my technical comments have been casually dismissed, that my review effort and expertise have been wasted, and that everyone is going to encourage the OP to continue down this path as if I hadn’t participated at all.

2 Likes

I don’t know. I know for sure I don’t have the time to learn the ins and outs of this algorithm (I’m only a pretty casual user of the LRU cache implementationandI know very little about modern dict internals). Most other core devs are presumably in the same position. Maybe the best experts here are Serhiy and Inada.

1 Like

I don’t think this is a correct interpretation of what happened. I read and understood what you wrote.

I suggested that the OP try to focus on the API changes in dict, and get them evaluated on their own merit (are they worth having, are there any other benefits to having them, are there any problems with them)? I was clear that I am not at all sure that this proposal will be accepted.

I don’t think it’s our place to tell people when to stop pursuing their research ideas. You can tell the OP that you don’t think their idea has a chance (as you have done). It’s up to them what they want to do with that information. At the end of the day, if the idea doesn’t have legs then no PR will be merged.

4 Likes