The __missing__() method (if exists) is called in dict.__getitem__() if the key was not found in the dict, and its result is returned. It is not defined in the base dict, but defaultdict defines defaultdict.__missing__() which call default_factory and also sets the result as the value for the key, so following subscribing operations will not call __missing__(), but will return the same value for the same key. It is commonly used for grouping values:
d = defaultdict(list)
...
d[key].append(value)
All is fine until we try to use d[key] concurently. Then it is possible that different d[key] will return different values for the same key.
We can fix the race condition by using setdefault() instead of __setitem__(). This will guarantee that d[key] will always return the same value for the same key. This was considered a nice bugfix, so this change was backported to 3.13 and 3.14.
Update the documentation to match the new behavior in 3.13+.
Revert the change in 3.13-3.14 (note that it was already released in bugfix releases), document the new behavior as new in 3.15, and maybe add some hints about concurrent unsafety in the 3.13-3.14 documentation.
Revert the change in all branches. Let the user use external means to ensure atomicity.
I vote for: “Revert the change in all branches. Let the user use external means to ensure atomicity.”
In the bug report, you note, “It seems that no one thought that __missing__() could be used directly.” However, there is real code that calls __missing__ directly. ChainMap is one example. And the person who filed with bug report has another example. Both rely on long standing, documented behavior that __missing__ can be called directly. Presumably there are also defaultdict` subclasses that do this well.
The defaultdict implementation should match the pure python versions of it found on the web:
class DefaultDict(dict):
def __init__(self, default_factory=None, *args, **kwargs):
super().__init__(*args, **kwargs)
# default_factory must be a callable or None
self.default_factory = default_factory
def __missing__(self, key):
if self.default_factory is None:
raise KeyError(key)
# Call the factory, store the result, and return it
value = self.default_factory()
self[key] = value
return value
def __repr__(self):
return f"DefaultDict({self.default_factory}, {super().__repr__()})"
Note that this pure python version that people expect is also what pypy does:
So, I think the edit needs to be reverted. The old code is what people expect. It Is what was documented. It is what pypy does. It wasn’t a bug. It, like many multistep operations, was just inconveniently non-atomic.
Since the change could silently break code that depends on the old behavior, it would be prudent to revert it from all branches. defaultdict can be made fully thread-safe without changing the behavior of __missing__, by explicitly acquiring a lock instead of calling setdefault.
I lean toward reverting also, although it’s unfortunate that the change has made it to bug fix releases. Definitely seek RM opinion/approval before making a change.
How do you see that documentation as saying that __missing__ can be called directly? To me it just says that the key lookup operation on dict calls it. It doesn’t say it can be called directly, nor that it cannot.
Despite that I agree that the change should be reverted in all branches, since I think in general a method that is documented should be callable. (Maybe that’s what was meant in the quote above by saying it’s documented?)
The documentation is silent on whether __missing__ is threadsafe. Given ambiguity on whether it can be called directly and ambiguity on whether it is threadsafe, I think the right move is to have a PEP to fully specify the desired behavior and only then implement any change (to docs or to behavior). Allowing “backdoor” changes like this, where we update documentation to reflect an unannounced behavior change, damages users’ confidence in Python’s stability. Explicit behavior changes are better than implicit ones.
Does anyone have any actual use that is broken by this change?
Atomicity of normal usage seems more useful to me than making any promises about what happens calling __missing__ explicitly. You don’t have to call __missing__ when you can just call the factory function directly if that is what you want.
I think it’s huge overkill to require a PEP for something this obscure. Do you genuinely imagine that any participants in a thread for such a PEP would actually have real-world experience of using this feature? It would be a clear case of “design by committee”, and wouldn’t (IMO) improve the outcome in any measurable way. Using the term “backdoor” for a change made by the normal PR process seems very extreme as well - it’s quite literally a core dev’s job to make decisions like this as part of implementing a PR.
In this case, the decision seems to have caused some problems. It’s a shame these didn’t get caught before release, and even more so that it got into bugfix releases, but mistakes happen.
I don’t have a strong opinion as to whether it’s worth changing the documented behaviour to allow atomicity. Depending on the decision about that I think one of “revert the change in 3.13-3.14” or “revert in all branches” should apply. Whichever happens, the choice should be made deliberately and documented properly, but I don’t think we should over-react and require anything more than a PR to do so.
While I said I don’t have a strong opinion, this argument does make sense to me, so count me as weakly in favour of this option.
I could be persuaded that we don’t need a PEP, but we at least need this kind of public discussion to see if we need a PEP. That is, we need the kind of discussion we’re having now, and I appreciate that we’re having it.
We won’t know unless we have the discussion somewhere more public than on the PR.
Maybe, but we won’t know unless we have the discussion somewhere more public than on the PR.
What I meant by “backdoor” is no so much the PR process as the sequence of events that goes:
seemingly small and/or beneficial change made
unexpected consequence discovered (change in behavior, divergence from documentation, etc,)
the change in step 1 is considered “status quo” because it has already been made
instead of undoing the change, we retcon the docs to specify the behavior that was accidentally implemented
In short, changes in behavior, and especially changes that tread close to lines of ambiguity in docs, should always be made with full awareness, not just left in because we didn’t notice the problem at the time the change was made. If something happens and we later realize it contradicts the docs, the default response should be to roll it back, and then maybe later we can think if we want to change the docs to clarify, and then maybe later reimplement the change deliberately in that context.
That said, I do think that some stuff happens in the normal PR process that should get more public discussion and in some cases a PEP. We saw this for instance with the ~bool deprecation change. In my view that change should absolutely have required a PEP.
Basically what I’m saying is that we shouldn’t let the fact that it got into a release cloud our thinking about what the right answer to the question is. If a decision was made by a few people on a PR and it turns out that decision had broader consequences and needs consideration of those consequences, the default response should be to roll back the change until we can make a fully informed decision about what behavior we actually want.
I get what you’re saying about “design by committee”, but I just feel that, because Python is so widely used, even seemingly tiny changes can wind up having unforeseen impacts. So I think the bar for some kind of discussion more public than a PR should be pretty low. In other words we can’t avoid having more design done by committee than maybe was the case in the past, because the stakes are too high to just go around changing stuff.
In general, a PEP is meant to summarize all the various points made in this sort of a discussion. So “we need a PEP” should be a response to “there’s a ton of discussion, it’s going all over the place, we need to gather the different thoughts into a coherent document”. When MOST people say “we need a PEP”, it should be interpreted instead as “we need a lengthy public discussion”, which is usually what is actually needed in those situations.
At least for bug fixes versions, no behavior changes should be made.
As for 3.15, I don’t think this would be necessary. This kind of data race is actually one writes and another reads/writes. Even if we guard this __missing__ to be concurrent safe, we still leave many other operations unsafe.
My 2p personally - if a user wants to call setdefault, they should do so explicitly. And if they do that, then beyond nicer syntax and ergonomics, they don’t really need a defaultdict at all.
Does anyone have any actual use that is broken by this change?
I have actually tried to write code overriding defaultdict.__missing__, and calling the original with super. I had to switch out .default_factory for my purpose (selecting a particular factory depending on the key) and rapidly gave up. Simply overriding __getitem__ on a regular dict instead was an order of magnitude less hassle.
That does probably mean very few people have ever done it. But I would say it also means the code written by those who have done it, is highly likely to be easily broken.
class MultiFactoryDefaultDict(defaultdict):
factories = {'a': list
,'b': int
}
def __missing__(self, key):
tmp = self.default_factory
self.default_factory = self.factories.get(key, tmp)
retval = super().__missing__(key)
self.default_factory = tmp
return retval
After taking another look, I withdraw this claim. It’s impossible to ensure that __missing__ only sets an item once while at the same time continuing to support it being (mis)used for the purpose of resetting an item to a default value. Consequently, I agree that the behavior should be changed in 3.15, so that defaultdict can become fully thread-safe.
Edit: By the way, the changelog entry about gh-142495 seems wrong. It mentions __setitem__() instead of setdefault().
I am very confused by this example. The factory function is supposed to look at the key and return a value. For some reason though you want to look at the key and install a different factory function. Why not just have the factory function do that?
def factory(key):
if key == 'a':
return []
elif ...
That makes sense to me. The defaultdict class is a subclas of dict that overrides the __missing__ method. You are subclassing it but then overriding the only method that it has that is different from dict. I don’t see why you would want to subclass defaultdict rather than dict in that case. It is not hard to make your own dict subclass that does what defaultdict does (or does it differently) and works for your situtation.
The 90% case for me in using defaultdict is this:
d = defaultdict(list)
for k, v in stuff:
d[k].append(v)
My understanding is that the PR with the bugfix that initiated this discussion made that usage thread-safe. What I mean is that you could share d across threads and have multiple threads do d[k].append(v) without any race conditions that would prevent all the values v from being stored in the correct lists. To me that seems like a useful change that makes the intended functionality of defaultdict usable when shared in multithreaded code.
Somehow some people commenting in this thread are suggesting to revert that change though. I can see the argument for reverting in maintenance branches but the change for future Python versions seems like a clear win to me.
I’m sure you’re right, and it’s not that clear. But that was kind of the point. This is an example of attempted code using defaultdict.__missing__, that I gave up on. I copied it from a docstring I wrote, to warn myself not to try this ever again, not from code I ever used.
Using dict.setdefault would’ve been a much better alternative too.
Maybe someone else utilises defaultdict.__missing__ with success, but I highly recommend against it.
It occurs to me today that there is an alternative solution: defaultdict could gain its own __getitem__ implementation and deprecate its __missing__ implementation. This should eliminate the risk of breaking existing code while still making defaultdict safe to use with free threading.
Something like this:
PyDoc_STRVAR(defdict_getitem_doc,
"__getitem__(key) # pseudo-code:\n\
if (value := self.get(key, NoValue)) is not NoValue: return value\n\
if self.default_factory: value = self.default_factory()\n\
elif self.__missing__: value = self.__missing__(key)\n\
else: raise KeyError(key)\n\
return self.setdefault(key, value)\n\
");
static PyObject *
defdict_getitem(PyObject *op, PyObject *key)
{
PyDictObject *mp = (PyDictObject *)op;
Py_ssize_t ix;
Py_hash_t hash;
PyObject *value;
hash = _PyObject_HashFast(key);
if (hash == -1) {
dict_unhashable_type(op, key);
return NULL;
}
Py_BEGIN_CRITICAL_SECTION(op);
ix = _Py_dict_lookup(mp, key, hash, value);
if (ix == DKIX_ERROR)
return NULL;
if (ix == DKIX_EMPTY || value == NULL) {
/* Try to call self.default_factory() */
defdictobject *dd = defdictobject_CAST(op);
PyObject *factory = dd->default_factory;
if (factory != NULL && factory != Py_None) {
value = _PyObject_CallNoArgs(factory);
} else {
/* Try to call self.__missing__(key) */
PyObject *missing;
missing = _PyObject_LookupSpecial(op, &_Py_ID(__missing__));
if (missing == NULL || missing == Py_None) {
_PyErr_SetKeyError(key);
return NULL;
}
value = PyObject_CallOneArg(missing, key);
Py_DECREF(missing);
}
if (value == NULL)
return NULL;
int ret = _PyDict_SetItem_KnownHash_LockHeld(mp, key, value, hash);
if (ret < 0)
return NULL;
}
Py_END_CRITICAL_SECTION();
return value;
}
Edit: For this to be even more backward compatible, __missing__ would have to take precedence over default_factory.
Edit 2: This solution has the advantage of calling default_factory or __missing__ only once when a key is missing.
Yeah, the description text talks how it gets called but also totally starts with what it does and together with “support the following methods” I don’t see how it’s ambiguous that you may call it too?
[As a user, if it were a private implementation detail, I’d expect to not even see the method explained, just main text of defaultdict explaining how it calls the factory and inserts the result. In this particular case, the interaction of methods is complex enough that such a black-box doc would be a bad idea.]
Re subclassing, I’ve attempted this too in the past and gave up. For me the mental wart was this:
The hallmark of defaultdict is that it inserts the new value into the dict. This is what makes chained mutation like d[k].append(v) viable! This works when you defaultdict(my_factory).
This property is lost if you subclass and override __missing__ (unless you insert yourself).
EDIT: Not entirely lost, I see the kludge Oscar disliked does retain it by calling super().__missing__(key). It is a natural direction if you come from “I want something like defaultdict just key-dependent” angle, you see factory doesn’t see the key, while __missing__ does, it’s just the interfaces are a bad fit…
On a careful reading, I now understand there are 2 contracts, deliberately separate:
dict allows subclasses to implement __missing__ method, result is ephemeral.
defaultdict implementation of (1) allows instances to provide a factory function, result saved.
=> If you override __missing__, better subclass dict directly.
=> Overriding __getitem__ may be mentally simpler than __missing__ (though the latter has better performance).
@Changaco’s idea to add thread safety to defaultdict.__getitem__ is creative but note it introduces a new variant to the __missing__ contract , and opens up future complexities of subclassing defaultdict vs. dict.
I also remember trying to subclass defaultdict before realizing that __missing__ is actually implemented in the dict base class. I think it’s a common pitfall. Making defaultdict compatible with free threading could be a good opportunity to fix this.
It seems to me that my solution could also be described as deprecating a variant of the __missing__ contract, leaving only the one in which the method is merely expected to return a default value for a specific key.
The alternative would be to drop defaultdict’s support for custom __missing__ methods. This could break existing code, but a proper exception could be raised (early if __missing__ is in the subclass definition, late if it’s added to the subclass after creation).
if a user wants to call setdefault, they should do so explicitly. And if they do that, then beyond nicer syntax and ergonomics, they don’t really need a defaultdict at all.
The flaw with setdefault is that it requires you to unconditionally construct the default every call. One of the main selling points of defaultdict, besides getting setdefault-like behavior from standard bracketed access, is that the default is constructed lazily (if access 10 unique keys 1000 times total, you only construct the default object 10 times; do it with setdefault and you construct it 1000 times).
I agree that the docs currently, explicitly, say:
If default_factory is not None, it is called without arguments to provide a default value for the given key, this value is inserted in the dictionary for the key, and returned.
which implies that a direct call to __missing__ resets to a fresh default, but I think the solution is tweaking the documentation and adding a “changed in 3.1x” note about it for the extremely rare cases where __missing__ is being called directly (we document what __missing__does, but a definition that makes it do anything when nothing is actually missing seems bizarre).
People who wanted this resetting behavior can implement it directly by doing something like: defdict[key] = val = defdict.default_factory(), then doing whatever they like with val (alias of the newly assigned value in defdict) afterwards. setdefault has long suffered from intermittent problems with atomicity that have always been considered bugs to be fixed, I’d treat equivalent atomicity problems in defaultdict.__missing__ the same.
I’d revert in bugfix releases, but change and document the new behavior in the new releases.
I agree with this approach. The new behavior is more useful and I’d argue in the spirit of what __missing__() is intended for. I don’t think you should be calling the method if the value is not missing.