Simplify lru_cache

iritkatriel · August 19, 2022, 2:40pm

I’m not sure I understand what you mean by the faster-cpython process, but the faster-cpython issue tracker is mostly about discussions on ideas and experiments, as well as benchmark tooling.

When it comes time to commit code to cpython, you need to create an issue on the cpython repo to describe the change you’re making. So a PR would need to be associated with a cpython issue.

matthiasgoergens · August 19, 2022, 2:51pm

Sorry, I think I was a bit confused. Basically I thought I’d go through the discussion with people on ‘faster-cpython ideas’, and then once that’s progressed enough to be more concrete, I’ll raise the CPython github issue and open a PR?

iritkatriel · August 19, 2022, 2:58pm

You need a PR on the cpython repo in order to have the discussion. You can create a draft PR (there’s a button to convert an open PR to draft), and create the cpython issue only when you think the PR ready to be reviewed for real.

merwok · August 19, 2022, 3:10pm

It was Christian Heimes who suggested opening a ticket in the faster-cpython project to discuss this proposed change in another forum than this one. The faster-cpython team does have discussions on their tracker, before opening cpython tickets then PRs.

guido · August 19, 2022, 3:33pm

Actually we decided that we don’t like the Discussions category any more. I recommend an issue in the cpython project. (It was proposed to start with a draft PR, I’m not sure where that suggested process comes from – IIUC the workflow typically goes from informal discussion in some mailing list or forum → issue to solidify the design → PR to solidify the implementation. It’s fine to have a draft PR ready if you think the design is more easily understood by looking at that. I believe it’s still frowned upon to start with a PR and then create an issue at the last minute to satisfy some bot’s requirement.

rhettinger · August 19, 2022, 5:13pm

What is the best place in the process to have an algorithmic discussion? Recency tracking has to be done somewhere, either through near instant linked list updates (serhiy’s code) or through repeated deletion/appending/resizing in the dict (the OP’s code)? I built a simulation and posted an easy to understand diagram for the latter and no one seems to have looked at it.

What is the best place in the process for a strategic discussion? To efficiently find the oldest entry, the dict implementation has to be modified in a way that amounts to a permanent API commitment, equivalent to OrderedDict’s popitem(False). In the past, you’ve shown a reluctance to move the OrderedDict API (and its algorithmic commitments) into regular dicts.

Lastly, as a reviewer, what I am supposed to do now? I’ve spent many hours going through this proposal and reading the OP’s code. Presumably I’m the reviewer best qualified to look at this, having originated the current dict design and having spent several days on a detailed review of Serhiy’s lru_cache() current implementation. My assessment is that the OP is going down the wrong path and will make us subtly worse off. Is there any place in our process for me to voice that conclusion and have it taken seriously? It feels as though my technical comments have been casually dismissed, that my review effort and expertise have been wasted, and that everyone is going to encourage the OP to continue down this path as if I hadn’t participated at all.

guido · August 19, 2022, 5:36pm

I don’t know. I know for sure I don’t have the time to learn the ins and outs of this algorithm (I’m only a pretty casual user of the LRU cache implementationandI know very little about modern dict internals). Most other core devs are presumably in the same position. Maybe the best experts here are Serhiy and Inada.

iritkatriel · August 19, 2022, 6:07pm

I don’t think this is a correct interpretation of what happened. I read and understood what you wrote.

I suggested that the OP try to focus on the API changes in dict, and get them evaluated on their own merit (are they worth having, are there any other benefits to having them, are there any problems with them)? I was clear that I am not at all sure that this proposal will be accepted.

I don’t think it’s our place to tell people when to stop pursuing their research ideas. You can tell the OP that you don’t think their idea has a chance (as you have done). It’s up to them what they want to do with that information. At the end of the day, if the idea doesn’t have legs then no PR will be merged.

rhettinger · August 19, 2022, 6:19pm

Perhaps I’m wrong, but I would have put myself at the top of that list of experts

smontanaro · August 19, 2022, 6:36pm

I will echo this sentiment with an anecdote. I worked at GE’s R&D Center from the mid-80s to the mid-90s. At that time one of the researchers I knew, Harvey Cline, had something like 150 patents (including Marching Cubes). Harvey’s perspective, as I recall it, was the only way to have good ideas was to have a lot of ideas. His co-inventor, Bill Lorensen, had a similar approach. We were in the same group (I knew Bill much better than Harvey). We’d go to SIGGRAPH for a week each year. Bill would just be bubbling with ideas from the experience. You strike out with most ideas, but every now and then you hit one out of the ballpark.

So, if someone has an idea, let them run with it. Your input will be useful. They will figure out when to stop pursuing the idea.

guido · August 19, 2022, 6:37pm

I was under the impression you were looking for a second opinion or someone to have a discussion with. I didn’t want to suggest that you have a discussion with yourself.

iritkatriel · August 19, 2022, 6:46pm

@smontanaro I’d go further and say that even when you’re sure it’s a bad idea, letting someone run with it and hit a wall is better than shutting it down with “I know best and I’m telling you it won’t work”. At least if we’re trying to steward a community and encourage contributions.

rhettinger · August 19, 2022, 6:59pm

I 100% agree.

storchaka · August 19, 2022, 7:45pm

I think that Raymond and I are the largest experts of this piece of code. Raymond is de-facto the maintainer of the functools module and the author and the maintainer of the Python implementation of lru_cache(). I mentored two previous attempts of writing the C implementation and wrote the final implementation, rewriting the half of the code.

My thoughts on this are not so one-sided. I would like to simplify the code by getting rid of the linked list if it is possible without losing quality. There are two conditions. Without fulfilling both conditions, nothing should be changed. The current code is good enough and well tested.

Performance. The theoretical difference between O(1) and O(n) can be ignored if in practice the new code is always faster (because in practice the current code is O(n) if there are any misses). If your code is only hits, then perhaps a bounded lru_cache is not the right tool. This requires very careful testing.
Maintainability. The new code must not create a tight coupling between the implementations of lru_cache and dict.

My past attempts failed the first condition. I tried to make as small changes in the dict implementation as possible, but it was not enough, and I convinced myself that it is not worth to spent more time. The current proposal fails the second condition (and it may fail the first one with better testing). It virtually moves the code complexity from _functools.c to dictobject.c. And this code is so specific, that there is small chance to find a use of it in other applications. Can it be improved? Maybe. But I would not bet on it.

Having different initial position with Raymond (the O(1) vs O(n) argument for hits-only) I came to the same conclusion: it is better to keep the current code of lru_cache.

malemburg · August 19, 2022, 9:37pm

This has been an interesting discussion, but I am a bit puzzled about the high level of theory without actually looking at practical use cases, e.g. regular expressions being cached in the re module (*).

Most of the times, LRU or similar caches are used for cases, where you only ever cache a few hundred entries. Keeping more in the cache doesn’t make sense memory wise. The expense of calculating a new result is almost always much higher than doing an O(n) scan of the cache, so it pays off to be careful with what to delete, regardless of the chosen LRU implementation.

Taking such a practical perspective, performance of the cache is secondary. Thread safety and maintenance are of much higher priority.

For use cases where performance does prove to be an issue, I believe that a 3rd party extension would be a better approach.

There are a few other ways to improve practical performance, e.g. it may be beneficial to not regard all entries as equal, but instead add additional cost factors as metric which then allow taking the effort it takes to recreate an entry into account as well.

Another option would be to add the reuse distance as metric (LIRS algorithm) – certain access patterns have entries in the LRU bubble up regularly, which can be avoided using LIRS.

(*) Note: The re._compile() cache does use an (ordered) dict for LRU caching. It’s not fully thread-safe, but I doubt you can create an LRU cache in fewer lines of Python code

rhettinger · August 19, 2022, 10:26pm

There is more here than theory. For the idea to work, the regular dictionary has to be permanently altered support the OrderedDict operations od.popitem(last=False) and od.move_to_end(k, last=True). The lru_cache() part then reverts back to basically the same code we had when it first landed in 3.2.

Also, the performance of the cache would become tightly coupled to the whatever dict tuning parameters are there at the time. In contrast, the current implementation has loose coupling and a cache hit never mutates the dictionary — it is guaranteed to be forever cheap regardless of how the dict evolves. With the OP’s proposal, we lose that.

In other words, there is more to think about than just running a few benchmarks on two implementations which are both insanely fast.

matthiasgoergens · August 20, 2022, 8:10am

Sorry, I might have missed that one? I remember you mentioned something call keylist_simulation, is that the one, or something else?

If it’s something else: I’m extremely keen to have a look!

If it’s the keylist_simulation: as far as I can tell, your write-up just showed what everyone agrees with. Sorry, my part of the discussion would have just been: I agree, that if you tune your dict’s GROWTH_RATE and USABLE_FRACTION wrong, you are going to have a bad day. In the lru-cache workload with any misses at all, that means that you need to leave at least a constant fraction of slack, if you want to get O(1) amortised performance.

Did I misunderstand you?

I agree with the two conditions you brought up. And I will not continue working on changing the existing standard-library. Instead I will focus on the API changes.

I am hoping to show that we can flesh out the dict API both with generally useful methods, and without expanding any existing constraints from our API commitments. (And without impacting existing uses.) My thought here is that we are already committed to preserving insertion order, and that this already constraints us so much, that we can get the rest for ‘free’. But more on that in the other discussion.

Thanks for the excellent suggestion! I am curious and will have a look.

The proposed implementation uses less memory. (The faster amortised runtime performance is only an accident, to be honest. I was just exploring.)

Simplicity of implementation was my main concern, when I started the exploration.

Slightly off-topic, since you don’t want these complication in the standard library:

What you describe is close to what actually started me on this journey in the first place.

To give a very rough sketch, I wanted to use dict as a cache, and every so often do some random sampling from of the items in the dict to calculate some statistics, and then use those to decide what to evict. (Well, the first hurdle was that dicts don’t allow random sampling, and then my dive in the code started from there.)

malemburg · August 25, 2022, 10:56am

Right and I think I mentioned that in my posting.

Still, I may have not made my point clear enough, so here’s another try:

The lru_cache is a helper to cache function results. It is mostly used for functions where calculating the output is expensive, so the overall win of using lru_cache is reducing the number of times the function has to recalculate values for the same combination of arguments.

When evaluating methods to use for such caching, it is important to look at the overall performance gain of calling the caching function vs. calling the non-caching function. In most cases, the lru_cache overhead will be negligible compared to the cost of the function calculating a new value, so optimizing the cache itself pays of less than optimizing the used caching strategy, since you want to avoid having to recalculate function values as much as possible.

The situation is similar for the memory used by the cache itself. The results you are caching will typically require a lot more memory than the cache itself.

So overall, it makes more sense to focus on caching strategies than the performance of the cache itself. LRU is already a lot better than naive strategies, but depending on the use case, other methods may give even better performance (e.g. LIRS or weighted approaches).

I had another look at the implementation. Turns out the re._compile() cache is not using an LRU strategy. It’s missing the update part of LRU. Instead, it simply caches all values and removes the oldest entry when the cache gets full – which could well be the most often used one. There’s definitely room for improvement

guido · August 25, 2022, 4:08pm

IIRC the original re.compile cache was even stupider. When the cache was full it would just clear it completely. Worked fine, usually.

storchaka · August 25, 2022, 5:57pm

If you look at the history of that piece of the code, you would find that the caching algorithm was changed many times, perhaps 4 or 5 times, or more. Not including the tuning of the size of the cache.

One time it used lru_cache() (or maybe there were several unsuccessful attempts of using lru_cache()), but not all results should be cached, additional code added an overhead over lru_cache(). And lru_cache() implemented in Python was too slow for this. One time it used OrderedDict, but move_to_end() added an overhead, especially if implemented in Python. Nothing can beat a simple dict lookup. Yes, the most often used entry can be removed from the cache, but if it is often used, it will be added again very soon. There is still a benefit from using a cache.

Perhaps we should use the secondary cache for items removed from the primary cache, and that cache can use an LRU strategy.