Reduce the overhead of functools.lru_cache for functions with no parameters

orf · April 20, 2020, 10:48pm

functools.lru_cache() has two common uses. The first is as it was designed: an LRU cache for a function, with an optional bounded max size.

The other is as a replacement for this:

_obj = None

def get_obj():
   global _obj

   if _obj is None:
     _obj = create_some_object()
  return _obj

i.e lazy initialization of an object of some kind, with no parameters. You can find a few examples in the Django source code: https://github.com/django/django/search?q="%40lru_cache"&unscoped_q="%40lru_cache"

I’ve highlited some examples below:

There are lots of other examples in the ecosystem outside of Django - it’s a convenient way to lazy intialize objects like Boto3 clients, “default” instances or even ML models.

Looking at the source code in _functoolsmodule.c I cannot see any special casing for these kind of functions. An lru_cache on a function with no parameters is a bit of an oxymoron - there can only ever be one output and it cannot be dependent on any inputs.

I’d like to suggest that we add a special case to lru_cache() for callables that have no parameters, to avoid the current overhead of using a PyDict on every lookup and make it more akin the code sample above - a single PyObject variable that’s stored on the first call and returned on subsequent calls.

If there is a generally agreeable reception to this proposal, I’d really like to give this a shot as my first contribution to CPython.

brettcannon · April 21, 2020, 12:36am

Can you show there’s enough of a performance improvement on subsequent calls to the functions to warrant the maintenance cost of special-casing for this?

steven.daprano · April 21, 2020, 1:15am

[Tom Forbes]

_obj = None
def get_obj():
   global _obj
   if _obj is None:
     _obj = create_some_object()
  return _obj
i.e lazy initialization of an object of some kind, with no parameters.

Eiffel uses the keyword “ONCE” for this. I would prefer a similar
approach, using a specialised decorator, rather than to special-case
lru_cache:

@once
def get_obj():
    obj = Something()
    obj.extras = None
    return obj

Here’s an untested implementation:

def once(func):
    obj = None
    @functools.wraps(func)
    def inner():
        nonlocal obj
        if obj is None:
            obj = func()
        return obj
    return inner

pf_moore · April 21, 2020, 7:42am

Agreed, a special @once decorator sounds like a better approach here. Although its behaviour is straightforward enough that I usually write it myself if I need it (I don’t even have a library with it in, I just rewrite it each time).

And to be honest, I’ve never heard of the idea of using functools.lru_cache for this before. Maybe it’s a trick that’s only common in django, which makes me think that django could expose a custom version of this if the performance improvement warrants it?

uranusjr · April 21, 2020, 8:11am

I see this trick in some large, inter-connected projects to reduce startup time (vs. calculate those values on import), and Django is one of them. This is not the only solution to the problem; pip uses classes and cached properties, for example, which makes me wonder whether it would be a better solution to make descriptors work on modules so we can just use @cached_property for this… but that’s significantly expanding the scope of the solution.

EpicWink · April 21, 2020, 8:48am

I’ve seen a few projects use the new module __getattr__ to implement cached properties

orf · April 21, 2020, 8:50am

Can you show there’s enough of a performance improvement on subsequent calls to the functions to warrant the maintenance cost of special-casing for this?

I believe I can, I’ll try and create a PoC this weekend. Looking at the implementation for lru_cache I also think it’s pretty simple to do - I was going to use the PyObject member that’s currently always a PyDict as the place to store the returned object.

I think most of the savings will be memory from PyDict, but there is also hashing and PyDict calls that will be removed.

And to be honest, I’ve never heard of the idea of using functools.lru_cache for this before. Maybe it’s a trick that’s only common in django, which makes me think that django could expose a custom version of this if the performance improvement warrants it?

This pattern is pretty prevalent, far more than just Django. If you browse through the first few pages of Github code search you can find quite a few:

@once might be a semantically better, but it doesn’t exist in the stdlib, and people (myself included) have clearly cottoned on to the idea of using lru_cache as a simple, copy-paste free replacement. Plus it’s faster (Using the once implementation above)

In [14]: %timeit return_1_once()
98.8 ns ± 4.51 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [15]: %timeit return_1_lru_cache()
56.7 ns ± 1.79 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

pitrou · April 21, 2020, 9:35am

Have you demonstrated those 40 nanoseconds mattered in your application, or is this another case of microoptimization obsession?

orf · April 21, 2020, 11:12am

It undoubtably doesn’t matter at all, I’m just guessing why people use lru_cache in the way they currently do. It’s a use case that people currently have, it should be simple to implement and has the upside of being even faster and more memory efficient.

If a patch really turns out to be simple enough then it seems like everybody wins.

I guess the main contention here is if we should encourage people to use the lru_cache decorator like that, even implicitly?

pf_moore · April 21, 2020, 11:42am

If you can confirm that the maintenance cost of the special case is justified by the performance improvement, this should probably just be submitted as a bpo item and PR (treating it as a special-case optimisation, not trying to promote or emphasise it as a “call once” design pattern, or anything like that).

If you can’t justify the cost, I suspect it should just be dropped.

IMO no. By all means pursue this as an optimisation (if it proves justifiable), but don’t try to make this the “sanctioned” way to do call-once functions. If it’s to be “official”, it should be more discoverable/obvious (i.e., a @once decorator), rather than a side-effect of adding a LRU cache.

orf · April 21, 2020, 12:59pm

Ok, thank you for your guidance. I’ll try and submit a PR with this optimization and see how it goes.

I’m keen to also add a functools.once decorator, I’ll draft up a proposal and submit it to the python-ideas mailing list (or does this forum supercede that? Im not clear on that, sorry).

steven.daprano · April 21, 2020, 1:19pm

[Tom Forbes]

@once might be a semantically better, but it doesn’t exist in the

stdlib,

That’s easy to fix, if there is concensus that it is a good thing.

I think that this may be a small enough addition that it may not even

need a PEP.

Plus it’s faster (Using the once implementation above)

In [14]: %timeit return_1_once()

98.8 ns ± 4.51 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [15]: %timeit return_1_lru_cache()

56.7 ns ± 1.79 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

o_O

I can’t dispute your numbers, I get similar numbers myself. But … how?

lru_cache does so much work on every call, how is it so fast? Enquiring

minds want to know!

pf_moore · April 21, 2020, 1:35pm

The wrapper is written in C.

(Which implies that to get a useful level of speed, the wrapper applied by any proposed once decorator would need to be in C as well…)

@orf are you aware that this is going to be a C code change?

orf · April 21, 2020, 1:47pm

Yep, and I’m eager to try. I’ll start with the optimization, then take a stab at making a patch for once(). I’ll then post to the python-ideas mailing list and see if I can get a consensus on the PR, and see if it needs a PEP or not.

steven.daprano · April 22, 2020, 2:41am

[Paul Moore]

(Which implies that to get a useful level of speed, the wrapper

applied by any proposed once decorator would need to be in C as

well…)

Sorry, it doesn’t imply that at all. As Serhiy asked, paraphrasing, are

you sure that 40ns difference is a bottleneck in your application? It

may be that the Python version is “fast enough”.

It’s not that I prefer the slower solution because it is slower, but the

primary aim here for me is clarity.

Also, there ought to be a Python version, even if it is replaced with an

accelerated version:

def once(): ...



try:

    from _functools import once

except ImportError:

    pass

for the benefit of other implementations.

pf_moore · April 22, 2020, 7:19am

My point is that a pure Python version won’t¹ be faster than using a C-accelerated lru_cache, and if once can’t out-perform lru_cache there’s no point (beyond naming², which can be covered by once=lru_cache…)

I totally agree that this discussion is all about a micro-optimisation that hasn’t yet been demonstrated to be worth the cost.

Correct, that’s just standard policy though (PEP 399).

¹ Conceded, that’s an unproven assertion.I think it’s likely, though
² I assume when you say “clarity” you mean “of the user’s code”, rather than “of the implementation”.

storchaka · April 23, 2020, 6:38pm

try/catch NameError may be faster than if is None (especially after implementing a zero-overhead try). I don’t know whether it will be faster than lru_cache().

orf · April 23, 2020, 7:21pm

I didn’t want to bump the thread unnecessarily, but it occurred to me that thread safety is also a bonus with the lru_cache vs the pure-python implementations above. These work in a single threaded environment but the moment concurrent threads might call the function you end up with your “once” function being called two or more times.

So the “simple” implementations above are not equivalent, and the equivalent with locking is obviously not something nice to repeatedly write.

storchaka · April 24, 2020, 5:05pm

It is easier to implement the variant with locking in Python than in C. There is not a special C API for reentrant locks.

lru_cache() does not guarantee that the function will be called only once with the same argument. It only guarantees that the internal structure will not be broken and there will be no leaks or crashes even if it will be called multiple times.

orf · April 26, 2020, 2:13pm

I worked on a simple implementation and found that the speed benefits are probably not worth the changes, especially given that using lru_cache like this isn’t semantically correct.

I’ve started a thread on the python-ideas mailing list about adding functools.once:

https://mail.python.org/archives/list/python-ideas@python.org/thread/5OR3LJO7LOL6SC4OOGKFIVNNH4KADBPG/

Thank you to everyone who chipped in here