Currently when you call asdict
or astuple
on a dataclass, anything it contains that isn’t another dataclass, a list, a dict or a tuple/namedtuple gets thrown to deepcopy
. Whether this is desirable or not doesn’t really matter as changing it now will probably break things and is not my goal here.
There are a number of basic types for which deepcopy(obj) is obj
is True
. In particular this covers most of the non-container types that are supported by the JSON encoder by default (I don’t think it includes enums).
From the copy module:
For example this means:
from copy import deepcopy
x = 10**40 # Large enough to avoid interning
y = 10**40
print(x is y) # False
print(deepcopy(x) is x) # True
x = "a" * 5000 # Long enough to avoid interning
y = "a" * 5000
print(x is y) # False
print(deepcopy(x) is x) # True
To avoid the deepcopy
overhead I’m proposing the function checks if the object is an instance of one of these types first and if so, returns the original object (as would be returned by deepcopy anyway).
Adding:
# Everything that uses _deepcopy_atomic directly
# This list could be reduced to the most common types if necessary
_ATOMIC_TYPES = {
types.NoneType,
types.EllipsisType,
types.NotImplementedType,
int,
float,
bool,
complex,
bytes,
str,
types.CodeType,
type,
range,
types.BuiltinFunctionType,
types.FunctionType,
# weakref.ref, # weakref is not currently imported by dataclasses directly
property,
}
to the module and
if type(obj) in _ATOMIC_TYPES:
return obj
to the start of _asdict_inner
and _astuple_inner
.
Checking that the type of the object is exactly one of those listed and not a subclass matches the behaviour of deepcopy
so this shouldn’t change anything about the output of asdict
.
Originally I was going to suggest putting the check at the end but it turns out that you get a much more noticeable benefit from checking these types before all of the other conditions (it halved the time taken in the best case).
On a ‘nice’ example where everything the dataclass contains is one of these types this change makes asdict
significantly faster than the current implementation. In a mixed example containing various lists/dicts/dataclasses/ints/strs this was also faster but not by as much. As much as I’d love to say it was always faster, in the worst case where everything has to be deepcopied and nothing is a basic type it is very slightly slower due to the extra condition.
Results (Python 3.11.2):
Best case asdict:
current=6.34s
new=2.63s
New method takes 41% of the time
Best case astuple:
current=5.67s
new=2.13s
New method takes 38% of the time
Worst case asdict:
current=3.01s
new=3.05s
New method takes 101% of the time
Worst case astuple:
current=2.96s
new=3.01s
New method takes 102% of the time
Mixed case asdict:
current=2.74
new=1.56
New method takes 57% of the time
I’d love some real world based benchmarks to use for a comparison but I don’t know of any I could use. (The mixed case example was based on the code orjson
uses to claim it is 40-50x as fast for dataclasses.)
Given JSON serialization appears to have been the intended use case of asdict and the set of types that are unchanged includes most of the natively serializable types is this potentially a reasonable performance trade? Do I need more benchmarks first?
Note: I know there are other things that could improve performance to a greater degree but those have other caveats or things to consider. I also know other changes have been discussed before.