Simplify lru_cache

matthiasgoergens · August 13, 2022, 2:42pm

The design of functools.lru_cache predates the switch to insert-order dicts.

Hence lru_cache uses some rather awkward doubly-linked lists and has to be really careful about multi-threadind and reentrancy.

I have a prototype of a re-implementation of lru_cache that removes all this complexity in favour of piggy backing on insertion order of dicts.

storchaka · August 13, 2022, 3:18pm

I wrote three different implementations of this idea, but it was rejected.

github.com/python/cpython

Implement functools.lru_cache() using ordered dict

opened 01:52PM - 21 Sep 16 UTC

closed 06:34AM - 22 Sep 16 UTC

serhiy-storchaka

type-feature extension-modules 3.7

BPO | [28239](https://bugs.python.org/issue28239) --- | :--- Nosy | @rhettinger,… @vstinner, @ezio-melotti, @methane, @ericsnowcurrently, @serhiy-storchaka Files | <li>[lru_cache-get-del-set.patch](https://bugs.python.org/file44777/lru_cache-get-del-set.patch "Uploaded as text/plain at 2016-09-21.13:52:10 by @serhiy-storchaka")</li><li>[lru_cache-pop-set.patch](https://bugs.python.org/file44778/lru_cache-pop-set.patch "Uploaded as text/plain at 2016-09-21.13:52:27 by @serhiy-storchaka")</li><li>[lru_cache-move_to_end.patch](https://bugs.python.org/file44779/lru_cache-move_to_end.patch "Uploaded as text/plain at 2016-09-21.13:52:42 by @serhiy-storchaka")</li><li>[lru_cache-get-del-set-2.patch](https://bugs.python.org/file47350/lru_cache-get-del-set-2.patch "Uploaded as text/plain at 2017-12-24.20:37:53 by @serhiy-storchaka")</li><li>[lru_cache-pop-set-2.patch](https://bugs.python.org/file47351/lru_cache-pop-set-2.patch "Uploaded as text/plain at 2017-12-24.20:38:17 by @serhiy-storchaka")</li><li>[lru_cache-move_to_end-2.patch](https://bugs.python.org/file47352/lru_cache-move_to_end-2.patch "Uploaded as text/plain at 2017-12-24.20:38:41 by @serhiy-storchaka")</li> <sup>*Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.*</sup> <details><summary>Show more details</summary><p> GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at = <Date 2016-09-22.06:34:35.418> created_at = <Date 2016-09-21.13:52:11.914> labels = ['extension-modules', 'type-feature', '3.7'] title = 'Implement functools.lru_cache() using ordered dict' updated_at = <Date 2017-12-24.20:47:45.528> user = 'https://github.com/serhiy-storchaka' ``` bugs.python.org fields: ```python activity = <Date 2017-12-24.20:47:45.528> actor = 'serhiy.storchaka' assignee = 'rhettinger' closed = True closed_date = <Date 2016-09-22.06:34:35.418> closer = 'serhiy.storchaka' components = ['Extension Modules'] creation = <Date 2016-09-21.13:52:11.914> creator = 'serhiy.storchaka' dependencies = [] files = ['44777', '44778', '44779', '47350', '47351', '47352'] hgrepos = [] issue_num = 28239 keywords = ['patch'] message_count = 6.0 messages = ['277141', '277152', '277169', '277195', '277205', '309017'] nosy_count = 6.0 nosy_names = ['rhettinger', 'vstinner', 'ezio.melotti', 'methane', 'eric.snow', 'serhiy.storchaka'] pr_nums = [] priority = 'low' resolution = 'rejected' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue28239' versions = ['Python 3.7'] ``` </p></details>

matthiasgoergens · August 13, 2022, 3:29pm

Thanks! I’ll have a look. I guess I’ll learn something at least.

Given that yours was closed because compact, ordered dicts were new at the time, we might have more luck 6 years later.

P.S. I just ran your benchmarks, and I get basically the same time with my patched version vs the status quo of what’s in main. Both for hits and misses. I don’t see any of the slowdowns that all three of your versions saw.

(And both are quite a bit faster than Python 3.10.6 that comes with my Linux distribution.)

I’ll post some benchmarks later.

P.P.S. I had a look at your code, it looks like your logic to delete the oldest entry runs in O(number of already deleted entries since last resize). My code takes care to delete the oldest entry in O(1) amortised time. I wonder if that alone explains the slowdown you’ve seen.

storchaka · August 14, 2022, 7:43am

Yes, my second benchmark can be optimized if keep a track of the minimal used index, but it is because old items are removed in the same order as they were added, and they are packed compact from the first to the last. In more realistic examples there will be many moves to the end which will leave holes. In worst case it can look something like:

A.....BCDEF.

So you need to the worst case for your implementation and show how much it worse.

@rhettinger’s concern was that it does not guarantee O(1) time, only O(1) amortized time.

matthiasgoergens · August 14, 2022, 3:13pm

The holes in the middle exist, yes, but they are not a problem for amortised performance. Basically, keeping track of the minimal used index is there so we only ever advance past each index once, including each ‘hole’. Instead of over and over again. It’s not a clever optimization meant to avoid paying for the holes. It just makes sure we only pay once. The order of removal doesn’t matter. (I’ll post some benchmarks tomorrow. I’ve tried the scenario you described. It performs almost exactly the same as the existing implementation. Perhaps slightly faster, but the difference is so small that I’d need to do actual statistics instead of eyeballing the results.)

Keeping track of the minimal used index also doesn’t do anything for the worst case: the worst case still has to compact the whole dict. That was by design.

Btw, the code path we are talking about here only ever gets taken in case of cache misses. Cache misses already only have amortised runtime guarantees in the existing version. No difference there.

(The difference is when you have nothing but cache hits.)

I’m not sure whether Raymond categorically objected against making amortized runtime guarantees only, or whether that was just the icing on the cake in addition to being slower and ordering being fairly new at the time. Perhaps we can get him to weigh in himself?

As I wrote in the other comment, I could cook up a version of dicts that has guaranteed O(1) worst case time for the usual dict operations, but that version will probably have slightly worse constant factors in the amortised setting than the current implementation that just compacts things all at once every so often.

rhettinger · August 15, 2022, 8:55pm

Hence lru_cache uses some rather awkward doubly-linked lists

It isn’t awkward at all. It’s a rather straightforward idea based on the notion that doubly-linked lists are ideal for this task. Every cache hit extracts a link and reinserts it at the head.

If a rewrite seems simpler, it is only because the task of maintaining order has been shifted to another data structure. Underneath the hood, the regular dict maintains order by storing data a sequence of consecutive pointers. Deletions are handled by leaving holes in the sequence. New entries are appended. This works fine for day to day dictionary use, but is not well suited to an lru_cache() where sequence rearrangements dominate.

When the cache is large, full, and has a high hit rate, essentially all of the work is rearranging the sequence and while leaving the dict unchanged.

and has to be really careful about multi-threading and reentrancy.

No matter how lru_cache() is implemented it has to be “really careful about multi-threading and reentrancy”. Every dict lookup or store is potentially reentrant. It is a non-trivial task to keep the cache in a consistent state between each of those calls. See the pure python implementation of lru_cache() — it uses a slow reentrant lock and still has to take a great deal of care.

I have a prototype of a re-implementation of lru_cache

I really wish people would discuss these ideas at a conceptual level before embarking on a large rewrite. It puts us all in a difficult position when someone shows up saying, look at all this work I’ve done, especially when the idea has been considered and rejected previously.

At least in theory, the doubly-linked list approach should always beat a maintain a sequence of pointers approach. That is why production quality ordered dicts and LRU caches are almost always implemented with a dict pointing to links in a doubly linked list.

Another thought is that I don’t what problem you’re trying to solve. Serhiy’s amazing implementation of the lru_cache() has astonishingly good performance. With the current implementation, it is easy to see that there are no unfavorable workload patterns — updating order for a cache hit takes constant (very fast) time regardless of the users pattern of access.

AFAICT the only benefit of a rewrite is to shift some of the complexity back into the regular dict, but in doing so we lose the ability to choose the best underlying store and we constrain future development for regular dicts (in particular, we can’t shrink the resize point for the underlying sequence).

Raymond

P.S. Here’s quick primer on tuning the regular dict for lru_cache style workloads:

Unhappy case: 1000 entries are stored in an array and we compact/resize at 1005 entries. Every cache hit deletes an entry and appends a new one at the end. After five hits, a compaction is triggered and all 1000 entries need to be reinserted. This is a massive overhead.

Happy case: 5 entries are stored and we resize at 1005 entries. After 1000 hits, a compaction is triggered and the 1000 entries are reinserted. That gets us back to the single update per hit that we have now.

The regular dict is currently tuned for space-efficiency and to not have noticeable performance issues for typical dict workloads (where it is uncommon to intermix deletions and insertions). The regular dict is not tuned for an lru_cache() workload where every hit deletes an entry and appends a new one at the end. This is appropriate — almost the entire point of the compact and ordered dictionary was to save space.

I’ll post some benchmarks later.

Be sure to time the unhappy cases. It is very easy to fool yourself into thinking that a dict store, dict deletion, and periodic compaction is cheaper than a few pointer updates per cache hit.

Also note that any timings reflect the current tuning of the dict which is something would like to be able to change in the future without having to worry about the lru_cache().

Remember, when there is a cache hit, the current implementation makes no changes to the dict at all. Only a few DLL pointers are updated. This is VERY hard to beat.

matthiasgoergens · August 16, 2022, 8:16am

Thanks for weighing in, Raymond!

The most important thing first: I ran more benchmarks.

I used b2afe482f21b826d53886a69ea2c99d0d940c59a as the base for comparison, ie main at the time of writing. I will call the status quo ‘main’, and my patch ‘direct’, because it makes direct use of ordinary dict’s order. My code is on github..

I ran all benchmarks with ./configure --enable-optimizations on my laptop plugged into mains power.

All benchmarks first filled up the cache, then timed how long it took to use it 1,000,000 times. I varied cache sizes via random.randrange(100_000) and let the whole thing run for around a thousand times each.

Directly piggybacking on ordinary dict’s order for LRU is about 10-20% faster for a workload of cache misses-only than ‘main’. For cache hits-only performance is about the same between ‘direct’ and ‘main’.

The results for a mixed workload of hits and misses are essentially the same as for one of only misses: 10-20% speedup.

(Speed up percentages are for the whole benchmark, including overhead. If you subtracted the fixed overhead, they would be higher.)

I did not measure the improvements in memory usage.

If a rewrite seems simpler, it is only because the task of maintaining order has been shifted to another data structure.

The rewrite is simpler, because regular dicts already have all the code for maintaining order, whether the lru-cache takes advantage of that or not.

No matter how lru_cache() is implemented it has to be “really careful about multi-threading and reentrancy”.

Yes, but we can minimize the total care needed, by piggybacking on what the regular dict already needs to do regardless.

I really wish people would discuss these ideas at a conceptual level before embarking on a large rewrite. It puts us all in a difficult position when someone shows up saying, look at all this work I’ve done, especially when the idea has been considered and rejected previously.

No need to worry about my feelings. I implemented this as a learning project, and to verify my suspicion that we can be faster. Especially since the idea has been discussed before, we need new data to have a new discussion.

Nothing about the concepts changed, so another conceptual discussion would probably not be all that useful without more data.

Unhappy case: 1000 entries are stored in an array and we compact/resize at 1005 entries. Every cache hit deletes an entry and appends a new one at the end. After five hits, a compaction is triggered and all 1000 entries need to be reinserted. This is a massive overhead.

Happy case: 5 entries are stored and we resize at 1005 entries. After 1000 hits, a compaction is triggered and the 1000 entries are reinserted. That gets us back to the single update per hit that we have now.

I am intrigued by your ‘unhappy case’. How would you go about hitting that case?

As far as I can tell, if there was a way to hit your unhappy case, you would be able to hit with cache-misses only. But regular dicts are already carefully tuned to avoid this kind of slowdown. (You yourself were involved in some of that good work. See bpo-17563 and bpo-33205.)

Also as far as I can tell from looking at the logic, when a dict gets resized, it always has at least twice as many entries as needed.

Be sure to time the unhappy cases. It is very easy to fool yourself into thinking that a dict store, dict deletion, and periodic compaction is cheaper than a few pointer updates per cache hit.

That’s not how my patch works for cache hits. My cache does a few pointer updates and an amortised periodic compaction per cache hit. (It does not do a dict store and dict deletion.)

Also note that any timings reflect the current tuning of the dict which is something would like to be able to change in the future without having to worry about the lru_cache().

My patch does not introduce any more coupling than we already have:

As mentioned above, the growth rate for regular dicts already takes the needs of a miss-heavy lru_cache-type workload into account. (You were involved in that!) My patch merely re-uses this accommodation for hit-only workloads, too.

Benchmark code

Benchmark data plotted. The repository also has the raw data and the gnuplot file used to create the plot.

Summary: removing the linked list from lru-cache improves performance in any workload that has at least a few misses. It does not change amortized performance in a hit-only workload.

Both ‘main’ and ‘direct’ pay for occasional compactions when the workload has any misses. The ‘direct’ implementation also pays for occasional compactions in a hit-only workload.

The ‘direct’ implementation saves memory.

rhettinger · August 16, 2022, 11:16pm

Thoughts:

Not updating a dictionary is cheaper than updating a dictionary. With the current implementation, we can have millions of consecutive cache hits and NEVER have to mutate the dictionary (no resizes, no reinsertions, no memory allocations, it is just cheap pointer updates).
The path we care about most is a cache hit because misses are dominated by the user function call.
Cache miss performance is less important than hits, but it is worth a look at well. The dict implementation is not optimized for finding the oldest item. Essentially, it runs d.pop(next(iter(d))) which gets quadratically more and more expensive until a resize occurs.
Currently a dict can be tuned for normal workloads and not have to worry about lru_cache workloads which are extreme. In particular, the usable fraction needs to be set high for normal workloads, high for an lru cache miss workload, and low for an lru cache hit workload.
Almost none of the implementation complexity is in the linked list updates. What makes this problem hard is thinking through reentrancy and multithreading concerns.
A lot of human effort was expended in getting the current code correct. It would not serve users well to destabilize it. It took me a several work days to go through Serhiy’s code with a fine-toothed comb and I’m unlikely to repeat that effort ever again.
My recommendation is to stop thinking about the lru_cache() implementation and instead focus on what the dictionary has to do to maintain order given a fixed size array of entries, a high usable fraction, and every cache hit causing an arbitrary key to be repositioned at the end, leaving a hole earlier in the array. Run some simulations and learn how the usable fraction affects the number of resizes which require a full table scan.
Consider adopting an adversarial mindset with your PR. It will help you be objective, help identify the flaws, help understand the underlying algorithms, help perform more useful timings, and avoid the natural tendency to identify with the PR.

matthiasgoergens · August 17, 2022, 2:30am

Raymond, thanks a lot again for your help! It’s a pleasure to have someone with so much experience critiquing my suggestion.

Not updating a dictionary is cheaper than updating a dictionary. With the current implementation, we can have millions of consecutive cache hits and NEVER have to mutate the dictionary (no resizes, no reinsertions, no memory allocations, it is just cheap pointer updates).

I am familiar with that argument. But the benchmark disagrees even for the hits-only workload: it’s the same speed in both implementations, and my patch uses less memory.

I am a bit surprised at this result myself to be honest! The common case in my patch is simple enough, but every once in a while we have to pay for the compaction. I suspect the memory savings help the performance somehow. But I don’t want to make the mistake of trying to outguess the benchmark data with my half-formed ideas.

I’d be more than happy to run some other benchmarks than those I already did, if you have a suggestion.

The path we care about most is a cache hit because misses are dominated by the user function call.

Yes, that makes sense. I agree.

You probably want to look at the hits-only workload, and perhaps a workload that has 90% hits or so?

The ‘direct’ lru-cache has the same performance for the hits-only workload, and a better performance for a workload with at least some misses.

Cache miss performance is less important than hits, but it is worth a look at well. The dict implementation is not optimized for finding the oldest item. Essentially, it runs d.pop(next(iter(d))) which gets quadratically more and more expensive until a resize occurs.

What you are describing was a real problem with @storchaka’s otherwise excellent, old ‘move-to-end’ patch.

But that’s not how my patch works. My patch finds and deletes the oldest entry in amortised O(1).

As a sanity check: this problem would mainly occur in a miss-heavy workload. And sure enough, all three of @storchaka’s old patches ‘cause significant slowdown (around 2.3x) in the case of many misses’ (compared to unpatched).

My code does better than main in the case of many misses.

Currently a dict can be tuned for normal workloads and not have to worry about lru_cache workloads which are extreme. In particular, the usable fraction needs to be set high for normal workloads, high for an lru cache miss workload, and low for an lru cache hit workload.

You are right in principle.

Let’s look at the cases of what we’d want:

Workload	Slack in indices	Slack in entries
Normal dict	high	high
‘main’ lru_cache hit-only	high	none
‘main’ lru_cache at least some misses	high	high
‘direct’ lru_cache hit-only	high	high
‘direct’ lru_cache at least some misses	high	high

In practice, GROWTH_RATE determines the slack in indices, and a combination of GROWTH_RATE and USABLE_FRACTION determines the slack in entries.

If you have the ability to predict perfectly whether your lru_cache workload will have any misses, you could carefully tune your usable fraction.

What this will do in the best case is: save you a bit of memory. And if you get your prediction wrong, your performance will be hosed with frequent resizes.

What you could do instead is: save that memory by ditching the linked list, and get the same performance as the linked list approach for a hit-only workload or better performance for the some-misses workload.

This seems much more robust to me. Having the ability to tune is good, not needing any special tuning is better. The same tuning would work equally well for all workloads.

Almost none of the implementation complexity is in the linked list updates. What makes this problem hard is thinking through reentrancy and multithreading concerns.

I agree.

A lot of human effort was expended in getting the current code correct. It would not serve users well to destabilize it. It took me a several work days to go through Serhiy’s code with a fine-toothed comb and I’m unlikely to repeat that effort ever again.

Sorry, I’m missing some context here. Are you talking about his proposed patches, or the linked-list implementation?

My recommendation is to stop thinking about the lru_cache() implementation and instead focus on what the dictionary has to do to maintain order given a fixed size array of entries, a high usable fraction, and every cache hit causing an arbitrary key to be repositioned at the end, leaving a hole earlier in the array. Run some simulations and learn how the usable fraction affects the number of resizes which require a full table scan.

Great idea! I like it so much, that this is what I’ve already done to come up with my implementation in the first place. I’ve run the benchmarks, too.

Consider adopting an adversarial mindset with your PR. It will help you be objective, help identify the flaws, help understand the underlying algorithms, help perform more useful timings, and avoid the natural tendency to identify with the PR.

That’s good advice in general. I’ve done that already, but of course, there’s always more you can do. Do you have any more ideas for benchmarks to run?

I’d be especially interested in any flaws you can point out.

matthiasgoergens · August 17, 2022, 7:51am

P.S. I also just ran a benchmark with 99% hits / 1% misses (and one million repetitions and total and many different cache size up to 100,000 and with caches warmed up first before timing starts):

def bench_mixed(cachesize, reps, hits):
    r = random.Random()
    r.seed(0)
    f = ft.lru_cache(maxsize=cachesize)(id)

    # fill cache at random:
    keys = list(range(cachesize))
    r.shuffle(keys)
    for k in keys:
        f(k)

    start = time.perf_counter()

    for c in range(reps):
        f(r.randrange(math.ceil(cachesize/hits)))
    end = time.perf_counter()
    return end - start

Basically the code above with hits=0.99.

I had expected the performance of ‘main’ and ‘direct’ to be essentially identical to their hits-only case (where they both take essentially identical amounts of time). Interestingly enough, even this tiny proportion of misses is already enough for the ‘direct’ version to pull ahead:

rhettinger · August 17, 2022, 3:06pm

Do you have any more ideas for benchmarks to run?

Focus on understanding how a dictionary degrades during lru cache operations.

For example, finding the oldest entry involves a linear search, so repeatedly deleting the oldest entry is quadratic:

% python3.11 -m timeit 'd=dict.fromkeys(range(1_000_000))' 'for i in range(100_000): d.pop(next(iter(d)))'
1 loop, best of 5: 1.59 sec per loop
% python3.11 -m timeit 'd=dict.fromkeys(range(1_000_000))' 'for i in range(200_000): d.pop(next(iter(d)))'
1 loop, best of 5: 6.34 sec per loop
% python3.11 -m timeit 'd=dict.fromkeys(range(1_000_000))' 'for i in range(300_000): d.pop(next(iter(d)))'
1 loop, best of 5: 14.2 sec per loop
% python3.11 -m timeit 'd=dict.fromkeys(range(1_000_000))' 'for i in range(400_000): d.pop(next(iter(d)))'
1 loop, best of 5: 25.9 sec per loop
% python3.11 -m timeit 'd=dict.fromkeys(range(1_000_000))' 'for i in range(500_000): d.pop(next(iter(d)))'
1 loop, best of 5: 42.1 sec per loop

matthiasgoergens · August 17, 2022, 3:34pm

To quote myself:

Yes, if you used dictiter_iternextkey for finding the oldest entry, the operation would indeed run in amortised linear time. (Or in quadratic time for a sequence of n delete-oldest operations.)

Which is exactly why my code does not do that. How else do you think I can beat what’s in ‘main’ for speed?

(If I could employ such a silly approach and still beat the code in ‘main’ for speed, something would be truly, truly weird. As you suggested earlier, the status quo is already pretty well optimized.)

Would you please just have a look at the code in question on github, instead of bringing up hypothetical flaws of a naive approach that the code doesn’t actually share? Thanks!

Here are some more benchmarks for you.

Cache hits:

$ python-from-main -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(100_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 267 msec per loop
$ python-with-direct-lru -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(100_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 244 msec per loop

Cache misses:

$ python-from-main -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = list(range(1_000_000))' -- 'list(map(f, a))'
2 loops, best of 5: 177 msec per loop
$ python-with-direct-lru -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = list(range(1_000_000))' -- 'list(map(f, a))'
2 loops, best of 5: 132 msec per loop

99% cache hits:

$ python-from-main -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(101_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 249 msec per loop
$ python-with-direct-lru -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(101_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 235 msec per loop

50% hits:

$ python-from-main -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(200_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 356 msec per loop
$ python-with-direct-lru -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(200_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 282 msec per loop

I also did some more involved benchmarks with quite a lot more data, and even some linear regressions. I already linked to them above. But I can explain them in more detail, if you are interested.

rhettinger · August 17, 2022, 7:01pm

Pure Python

For those who are watching from the stands, here is a rough pure python equivalent:

class lru_cache:

    def __init__(self, func, maxsize=128):
        self.maxsize = maxsize
        self.func = func
        self.d = {}

    def __call__(self, *args):
        d = self.d
        v = d.pop(args) if args in d else self.func(*args)
        d[args] = v
        if len(self.d) > self.maxsize:
            oldest_key = next(iter(d))  # Sped-up with a search finger hack.
            d.pop(oldest_key) 
        return v            

@lru_cache            
def square(x):
    answer = x * x
    print(f'{x} --(square)--> {answer}')
    return answer

Deleting the oldest entry

The PR shows a willingness to cross lines that I won’t. To avoid a linear search for the oldest key, the code reinvents some of what the linked list was already doing by keeping a search finger, “already_deleted_entries”.

In doing so, it breaks encapsulation and directly accesses dict internals. Presumably, that is why no pure python equivalent was offered.

Maintaining order

The PR shifts the burden of maintaining order from the linked list to dictionary. The linked link approach just updates a pair of links. The dict approach has to do a delete and insert and will trigger full table scan resizes, doing work that we didn’t have to do before.

Here’s a simulation showing the resizes as cache hits occur. To keep the example short, the effect is exaggerated with a high ratio of occupied entries (25) to maximum space in the entrylist (30), but it does show that the frequency of resizes depends on how the dict is tuned (something that is not of concern in the current implementation) and that resizes are in fact triggered by cache hits (something which does not occur at all in current implementation).

>>> keylist_simulation(steps=20, n=30, lru_maxsize=25, verbose=True)
*************************----- <- Start
*****-********************---- Move entry 6 to the end
*****-**-******************--- Move entry 9 to the end
*****-**-***************-***--
*****-**-******-********-****-
*-***-**-******-********-***** We're now full
*************************----- <- Compacted.
Steps = 5   Cumulative_work = 30
*******-******************----
*******-*****************-*---
*******-*******-*********-**--
*****-*-*******-*********-***-
*****-*-******--*********-****
*************************----- <- Compacted.
Steps = 10 Cumulative_work = 60
******-*******************----
*****--********************---
*****--*******-*************--
*****--*******-**************-
*****--*******-********-******
**************************----- <- Compacted.
Steps = 15 Cumulative_work = 90

Reentrancy and concurrency

I haven’t reviewed this part yet. I know that even the pure python version had to worry about reentrancy. The PR is hand-wavy and says, “We don’t need to worry too much about reentrancy.” Hopefully that is true. It is a major headache.

Time available

Answering the OP has consumed all of my development time for a while. It is my impression (possibly false) that the OP will persist indefinitely and never accept that the PR makes the underlying dict do more work than it does now and that the dict implementation details and lru cache implementation details should remain decoupled.

It is also my impression (possibly false) that the current implementation is stable and highly performant. It has the wonderful property that cache hits have no side-effects, nothing moves or rebuilds — a fixed number links are exchanged and that is it — the latency is constant and small.

As far as I can tell, there is no user problem being solved by the PR. The OP just prefers another implementation premised on the notion that link updates are expensive (they aren’t) and that repeated dict deletions and insertions are free (they aren’t).

matthiasgoergens · August 19, 2022, 2:30am

Thanks you for the very warm welcome, Raymond.

Thank you for your vote of confidence in my work ethic. I am not so optimistic. In fact I was getting a bit discouraged and ready to give up, but then I saw your code review at last and your encouraging remarks, and that gave me the confidence necessary to persevere. I won’t disappoint you!

User need

The least-recently used cache is a commonly used data structure in the Python standard library. Any caching scheme benefits from making efficient use of the available memory, so that users can either cache more entries in the same space or use less memory.

The proposal shaves 7 machine words (ie 56 bytes on a 64 bit machine) of memory overhead for each cache entry. All the while improving runtime performance: slightly in the case of a 100% perfect hits-only workload, and noticeably once you get any misses at all.

Matthias Görgens:

Cache hits-only:

$ python-from-main -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(100_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 267 msec per loop
$ python-with-direct-lru -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(100_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 244 msec per loop

99% cache hits:

$ python-from-main -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(101_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 249 msec per loop
$ python-with-direct-lru -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(101_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 235 msec per loop

50% hits:

$ python-from-main -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(200_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 356 msec per loop
$ python-with-direct-lru -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = [random.randrange(200_000) for _ in range(1_000_000)]' -- 'list(map(f, a))'
1 loop, best of 5: 282 msec per loop

Cache misses-only:

$ python-from-main -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = list(range(1_000_000))' -- 'list(map(f, a))'
2 loops, best of 5: 177 msec per loop
$ python-with-direct-lru -m timeit -s 'from functools import lru_cache; import random; f = lru_cache(100_000)(lambda x: x); a = list(range(1_000_000))' -- 'list(map(f, a))'
2 loops, best of 5: 132 msec per loop

(The benchmark above shows between 9% to 50% runtime overhead of ‘main’ compared to the proposal. I already posted more involved benchmarks and analysis further up.)

The figure of saving 7 words machine comes from me counting the number of pointers saved in the source.

To confirm that theoretical figure, I also did some benchmarks of memory usage:

import gc, math, functools, random, tracemalloc
def bench1(cachesize, reps, hits=0.5):
    r = random.Random()
    r.seed(0)
    f = functools.lru_cache(maxsize=cachesize)(lambda x: None)

    keys = [r.randrange(math.ceil(cachesize/hits)) for _ in range(reps)]

    gc.collect()
    tracemalloc.start()
    first_size, first_peak = tracemalloc.get_traced_memory()

    for k in keys:
        f(k)
    
    second_size, second_peak = tracemalloc.get_traced_memory()
    return second_size - first_size

cachesize = random.randrange(100_000)
mem = bench1(cachesize=cachesize, reps = 400_000)

The slopes of the linear regression show a growth of 93 bytes / entry for the ‘direct’ proposal, and 144 bytes / entry for ‘main’. That’s about 50% more memory usage for ‘main’, or 51 bytes. Given that the real graphs are more of a step-functions than strictly linear, that’s a pretty decent agreement with our theoretical expectation of 56 bytes saved.

Pure Python

Thanks, Raymond, for producing an approximate pure Python equivalent. As a new contributor I wasn’t aware that this is good form. I will be better in the future.

I any case, here’s a more accurate rendering:

class lru_cache:
  def __init__(self, func, maxsize=128):
    self.maxsize = maxsize
    self.func = func
    self.d = {}
    self.finger = d.make_finger() # new proposed function.

  def __call__(self, *args):
      try:
        return self.d.get_and_move_to_end(args) # new proposed function
      except KeyError:
        return self.d.setdefault(args, self.func(*args))
      finally:
        while len(self.d) > self.maxsize:
          self.d.delete_oldest(self.finger) # new proposed function.

Note how this proposal differs from Raymond’s portrayal of ‘pop-then-set’. This is important, because ‘pop-then-set’ is not only slower, it also calls out to user code more often (even if you use the ‘known-hash’ variants), and thus has more opportunities for reentrancy shenanigans.

Similarly, the proposal’s delete_oldest does not call out to user code.

Raymond, I took your feedback about breaking encapsulation to heart, and moved the data needed for the ‘finger’ into an opaque blob that dict provides, and that delete_oldest expects. That way lru_cache won’t have to know anything about dict internals. Thanks for that suggestion!

In C terms:

typedef struct {
    PyDictKeysObject *sentinel;
    Py_ssize_t skip_empty;
} PyDictFinger;

PyAPI_FUNC(PyDictFinger) _PyDict_NewFinger() {
    return (PyDictFinger) {.sentinel = NULL, .skip_empty = 0};
}

int
_PyDict_DelOldest(PyDictObject *mp, PyDictFinger *finger)
{
    PyObject *key, *value;
    Py_hash_t hash;

    if(finger->sentinel != mp->ma_keys) {
        finger->sentinel = mp->ma_keys;
        finger->skip_empty = 0;
    }

    if(!_PyDict_Next((PyObject *)mp, &finger->skip_empty, &key, &value, &hash)) {
        return -1;
    }
    return delitem_common(mp, hash, finger->skip_empty-1, value);
}

As a simpler alternative, we can also put something like Py_ssize_t skip_empty; directly into the underlying dict at a cost of one extra machine word per dict. Then you don’t need to pass the finger around.

But I am not sure about the trade-offs, and I don’t want to interfere with any of the regular dict operations in this narrow proposal.

Ideally, I would like to see operations like get_and_move_to_end and delete_oldest available available in the public C and (especially!) Python API of dict. Not only would my approximate Python code above cease to be approximate and become working code, but to be honest, investigating why these kinds of capabilities weren’t provided by regular dicts was my original motivation for looking into this topic at all.

I have some algorithmic work that would benefit from these. See also my post on and implementation of (random choice from dict in amortised O(1) time)[Random choice from a dict).

Maintaining order

Yes, the proposal re-uses the order maintenance that regular dicts already provides. That is the whole point.

Yes, to make dicts (or dynamic arrays) work in amortised constant time, you need to leave constant fraction of slack in your system. If you leave only a constant absolute amount you get amortised linear runtime for operations like insert.

That applies to regular dicts and also applies to how we use dicts in the lru_cache, both in ‘main’ and in the ‘direct’ proposal.

Yes, for the very special case of being able to predict with 100% accuracy that your workload is composed of 100% hits only, and if you are willing to use an extra 56 bytes per entry and suffer worse amortised runtime, then, yes, you can decrease the slack for the entries.

Do you think that saves much memory? You still have to pay for the 56 bytes extra per entry.

And if you guess even slightly wrong and you have a few misses every once in a while, both your worst case performance and amortised runtime performance per operation drops to O(n).

(Btw, if you can guess so perfectly, consider running the unlimited version of the cache, as that one gives you performance and storage requirements of the raw dict, with no overhead for storing linked lists nor pointer fiddling. The cachesize will stay bounded despite being notionally unlimited, because we assume that you have a perfect 100% hits-only workload.)

The OP does not have any notions. The OP ran benchmarks.

Yes, the existing implementation is relatively fast. Doesn’t that make it even more amazing that the proposal manages to be even faster, even in the best case for the existing implementation (of 100% hits-only)? And saves memory on top of all that!

tiran · August 19, 2022, 6:02am

This is some impressive research. Awesome work.

Your discussion seems to have reach an impasse. May I suggest that you create an issue at GitHub - faster-cpython/ideas to draw the attention of other core developers who work on faster Python? You also have the option to contact the steering council and ask them to mediate.

matthiasgoergens · August 19, 2022, 6:15am

@tiran Awesome, thanks for the suggestion!

I’ll post to faster-cpython, because there’s still some areas of my draft proposal that need feedback and work, before it’s ready to go in front of any kind of council. I’m pretty much a rank novice when it comes to contributing to Python.

iritkatriel · August 19, 2022, 9:38am

@matthiasgoergens It seems that in your latest version all you really need from python is to add a finger API to dictionaries. Then you can implement your alternative lru_cache in an external module. Am I right?

Perhaps we should discuss the pros and cons of such an API separately?

matthiasgoergens · August 19, 2022, 10:17am

@iritkatriel Yes, ideally I would like to (carefully!) add to the dict api, so that an external module could do the same work. I don’t actually care, if the standard library lru-cache is more bloated than necessary, if it’s possible to use a third-party library.

(I say ‘carefully’, because I don’t want to constrain any future evolution of dict any more than existing public API commitments already constraint it.)

See Dict API and feaner and faster lru-cache · Issue #450 · faster-cpython/ideas · GitHub for some more thoughts on the issue.

Basically, dict would need to provide ‘get-and-move-back’ and ‘delete-oldest’. But you need to be really careful about exactly how you are providing this, if you want to get good performance.

iritkatriel · August 19, 2022, 10:38am

@matthiasgoergens In that case, I wonder if it would make sense for you to make a PR with just the finger API addition (with whatever unit tests and doc changes it would need to get merged). It may help push the discussion along - it’s always easier to comment on an API change that you can see than on one you need to imagine.

The lru_cache idea can be a motivating use case for the new API.

(If this happens I imagine it would require a PEP as well, but I would suggest you get some more feedback on the idea to see what its chances are, before beginning to write one).

matthiasgoergens · August 19, 2022, 2:27pm

@iritkatriel Thanks for the suggestion!

Do you know whether it’s feasible to work on that PR as part of the broader ‘faster-cpython’ process or should I start a separate track of work? (I’d prefer the former, if that’s possible with the processes that are common for CPython.)