Allow defaultdict factory function to optionally accept input key for dynamic defaults

I propose that we support the following defaultdict behavior:

xs = defaultdict(lambda x: x)
xs[5] # => 5
xs[5] = 27
xs[5] # => 27

Allowing more dynamic defaults.

A concrete (but simple) use-case would be when building a tree for a Union-Find data structure, where nodes have parent pointers, and roots point to themselves.

Admittedly, it isn’t difficult to first check if the element exists and then add itself as its value if it doesn’t, but this short-hand would be intuitive, consistent with other Python functionality (e.g. sorting with a key function), and offer similar benefits to defaultdict as defaultdict does to dict itself.

Note that although the current defaultdict behavior could be emulated with a 1-argument factory support (e.g. lambda x: 0), in order to not break existing code the implementation should maintain support for 0-argument factories.

Some ambiguity arises with functions that have an optional argument, such int(). Again, we should prefer to not break existing code. So, we should attempt to call the function without arguments first, and if there is a TypeError with a message about 1 missing required argument, we try to call it again with the key. Perhaps there is a better way to do this using introspection, though.

I would be happy to implement this and submit it as a PR (with tests), but I wanted some guidance on:

  1. Whether this change would have a chance of being accepted.
  2. Whether I need to submit a PEP.
  3. Any other process I should be aware of.

Thank you.

1 Like

I think the strategy of trying the factory on zero arguments sounds fragile. In particular, what happens in a case like this?

def factory(*args):
    print("Factory called")
    factory_helper(*args)

def factory_helper(arg):
    ...

xs = defaultdict(factory)
xs['foo']

This will print “Factory called” twice, because the type error doesn’t get raised until factory_helper is called. That sounds potentially surprising and error-prone.

1 Like

(Obviously, defaultdict(dynamic_factory=...) would work and not pose this problem.)

2 Likes

Good example. I think using introspection could avoid having the outer function called twice, but would still necessitate assuming that the coder more often wants to use zero arguments rather than 1, with the only option of specifying otherwise being to provide a function that doesn’t use varargs - also very surprising and unclean.

(Obviously, defaultdict(dynamic_factory=…) would work and not pose this problem.)

Yeah, I think that is the way to do it. Although it appears that the defaultdict spec passes kwargs to the dict constructor, so this might break someone’s code if they for some reason wanted to add a dynamic_factory entry to their defaultdict.

Maybe it’d be better to have a new class altogether… But that feels a bit heavy for what I hoped would be a small change.

Maybe it’s better to table this until the next major Python revision.

Are you aware that the dictionary itself can already do this?

class SelfDict(dict):
    def __missing__(self, key): return key

xs = SelfDict()

Behaviour will be precisely as you describe.

6 Likes

This is basically a non-starter. As a general rule, if you think something breaks backward compatibility badly enough to delay it until the “next major Python version”, that means that it breaks compatibility so badly that it’s never going to be viable. A completely new class would be far FAR simpler, and could be done in any feature release.

(But it’s not necessary, since dict.__missing__ does the job.)

1 Like

Oh wow, thank you, I wasn’t aware of that at all. Cool!

1 Like

I just hit a point where defaultdict(dynamic_factory=...) would have been nice.

Subclassing from dict is not worth it for my application. A

if key in dict:
    return dict[key]
else:
    [construct value from key in 3 lines]
    dict[key] = value
    return value

introduces less complexity.

I’ll live either way. But it would have been nice.

@petercordia Looks like a job for functools.cache.

Thanks for the advise @Stefan2 !
Is it ok though to use @cache for functions that return mutable objects that you intend to mutate? It feels like that would result in code that’s hard to read?
I’ve only used @cache to optimise for performance, but never in ways that actually change the output of a program.