Raymond, thanks a lot again for your help! It’s a pleasure to have someone with so much experience critiquing my suggestion.
Not updating a dictionary is cheaper than updating a dictionary. With the current implementation, we can have millions of consecutive cache hits and NEVER have to mutate the dictionary (no resizes, no reinsertions, no memory allocations, it is just cheap pointer updates).
I am familiar with that argument. But the benchmark disagrees even for the hits-only workload: it’s the same speed in both implementations, and my patch uses less memory.
I am a bit surprised at this result myself to be honest! The common case in my patch is simple enough, but every once in a while we have to pay for the compaction. I suspect the memory savings help the performance somehow. But I don’t want to make the mistake of trying to outguess the benchmark data with my half-formed ideas.
I’d be more than happy to run some other benchmarks than those I already did, if you have a suggestion.
The path we care about most is a cache hit because misses are dominated by the user function call.
Yes, that makes sense. I agree.
You probably want to look at the hits-only workload, and perhaps a workload that has 90% hits or so?
The ‘direct’ lru-cache has the same performance for the hits-only workload, and a better performance for a workload with at least some misses.
Cache miss performance is less important than hits, but it is worth a look at well. The dict implementation is not optimized for finding the oldest item. Essentially, it runs d.pop(next(iter(d))) which gets quadratically more and more expensive until a resize occurs.
What you are describing was a real problem with @storchaka’s otherwise excellent, old ‘move-to-end’ patch.
But that’s not how my patch works. My patch finds and deletes the oldest entry in amortised O(1).
As a sanity check: this problem would mainly occur in a miss-heavy workload. And sure enough, all three of @storchaka’s old patches ‘cause significant slowdown (around 2.3x) in the case of many misses’ (compared to unpatched).
My code does better than main in the case of many misses.
Currently a dict can be tuned for normal workloads and not have to worry about lru_cache workloads which are extreme. In particular, the usable fraction needs to be set high for normal workloads, high for an lru cache miss workload, and low for an lru cache hit workload.
You are right in principle.
Let’s look at the cases of what we’d want:
|Workload||Slack in indices||Slack in entries|
|‘main’ lru_cache hit-only||high||none|
|‘main’ lru_cache at least some misses||high||high|
|‘direct’ lru_cache hit-only||high||high|
|‘direct’ lru_cache at least some misses||high||high|
GROWTH_RATE determines the slack in indices, and a combination of
USABLE_FRACTION determines the slack in entries.
If you have the ability to predict perfectly whether your lru_cache workload will have any misses, you could carefully tune your usable fraction.
What this will do in the best case is: save you a bit of memory. And if you get your prediction wrong, your performance will be hosed with frequent resizes.
What you could do instead is: save that memory by ditching the linked list, and get the same performance as the linked list approach for a hit-only workload or better performance for the some-misses workload.
This seems much more robust to me. Having the ability to tune is good, not needing any special tuning is better. The same tuning would work equally well for all workloads.
Almost none of the implementation complexity is in the linked list updates. What makes this problem hard is thinking through reentrancy and multithreading concerns.
A lot of human effort was expended in getting the current code correct. It would not serve users well to destabilize it. It took me a several work days to go through Serhiy’s code with a fine-toothed comb and I’m unlikely to repeat that effort ever again.
Sorry, I’m missing some context here. Are you talking about his proposed patches, or the linked-list implementation?
My recommendation is to stop thinking about the lru_cache() implementation and instead focus on what the dictionary has to do to maintain order given a fixed size array of entries, a high usable fraction, and every cache hit causing an arbitrary key to be repositioned at the end, leaving a hole earlier in the array. Run some simulations and learn how the usable fraction affects the number of resizes which require a full table scan.
Great idea! I like it so much, that this is what I’ve already done to come up with my implementation in the first place. I’ve run the benchmarks, too.
Consider adopting an adversarial mindset with your PR. It will help you be objective, help identify the flaws, help understand the underlying algorithms, help perform more useful timings, and avoid the natural tendency to identify with the PR.
That’s good advice in general. I’ve done that already, but of course, there’s always more you can do. Do you have any more ideas for benchmarks to run?
I’d be especially interested in any flaws you can point out.