The short, short version
- Dataclass creation is slow, this is a recurring problem in the stdlib
- They are slow because they exec source code templates to create every method
- It turns out a lot of methods are never actually used (some numbers are given later)
- Let’s not generate those methods!
- Here’s a fork with lazy dataclass methods implemented as a demo
- This makes constructing a dataclass ~85% faster if it is completely unused (common)
- ~60% faster if only
__init__is used (most common) - But ~20% slower in the worst case where every method is used on unfrozen classes (I’m sure there’s one out there but I’m yet to find a real example of this)
- ~60% faster if only
- Some things that use dataclasses also indicate this is faster
- The
pyrepltest suite is faster by ~10% - The
dataclassestest suite is faster by ~30% - I would like more real world examples of applications where start time is noticeable that make heavy use of dataclasses
- Pyperformance didn’t appear to have anything that heavily exercises this as I didn’t note any significant change when running through that suite
- The
- There are other techniques we could use to cut down on code generation that will make up for this performance penalty, but these come with increased maintenance burden.
In Detail
Dataclasses are nice to use but slow to construct and can have a significant import performance penalty on modules that use them.
The performance hit has been noted multiple times in the past:
configparserwas going to usedataclassesuntil it was noted that this tripled the import time_colorizedoes usedataclassesand is noted for making any module that relies on it significantly slower- Sympy removed dataclasses earlier this year due to import time performance
- This was also discussed in the committers category back in 2022
Part of this slowness was down to the other modules dataclasses imported such as inspect which has been improved for 3.15 with lazy imports but there is still significant overhead in the class construction itself, which can be seen in modules like _colorize that use a lot of dataclasses (each additional colour theme adds a new frozen dataclass that makes the overall module load time slower).
So in the same spirit as the new lazy imports, let’s make them faster by making them do less.
Why are dataclasses slow?
While some time is spent analysing annotations to decide how to construct the class, most of the construction time of dataclasses is spent in exec, running a generated template to create special methods.
For a class like this:
@dataclass(order=True, frozen=True, kw_only=True)
class Example:
a: int = 42
b: str = "Dent"
This is the source code generated (Yes, it really does use 1 space indents)
Generated source code
def __create_fn__(__dataclass_HAS_DEFAULT_FACTORY__,__dataclass_builtins_object__,__dataclass_dflt_a__,__dataclass_dflt_b__,__dataclasses_recursive_repr,__class__,FrozenInstanceError):
def __init__(self,*,a=__dataclass_dflt_a__,b=__dataclass_dflt_b__):
__dataclass_builtins_object__.__setattr__(self,'a',a)
__dataclass_builtins_object__.__setattr__(self,'b',b)
@__dataclasses_recursive_repr()
def __repr__(self):
return f"{self.__class__.__qualname__}(a={self.a!r}, b={self.b!r})"
def __eq__(self,other):
if self is other:
return True
if other.__class__ is self.__class__:
return self.a==other.a and self.b==other.b
return NotImplemented
def __lt__(self,other):
if other.__class__ is self.__class__:
return (self.a,self.b,)<(other.a,other.b,)
return NotImplemented
def __le__(self,other):
if other.__class__ is self.__class__:
return (self.a,self.b,)<=(other.a,other.b,)
return NotImplemented
def __gt__(self,other):
if other.__class__ is self.__class__:
return (self.a,self.b,)>(other.a,other.b,)
return NotImplemented
def __ge__(self,other):
if other.__class__ is self.__class__:
return (self.a,self.b,)>=(other.a,other.b,)
return NotImplemented
def __setattr__(self,name,value):
if type(self) is __class__ or name in {'a', 'b'}:
raise FrozenInstanceError(f"cannot assign to field {name!r}")
super(__class__, self).__setattr__(name, value)
def __delattr__(self,name):
if type(self) is __class__ or name in {'a', 'b'}:
raise FrozenInstanceError(f"cannot delete field {name!r}")
super(__class__, self).__delattr__(name)
def __hash__(self):
return hash((self.a,self.b,))
return (__init__,__repr__,__eq__,__lt__,__le__,__gt__,__ge__,__setattr__,__delattr__,__hash__,)
While this is all executed in 1 exec call, each additional method makes the class construction slower, so an ordered or frozen dataclass takes longer to create than a basic one.
Most of these methods will also be unused at runtime:
Evidence of unused methods
I added a hook to dataclasses that counts how many dataclasses are created, how many are frozen and/or ordered along with how many of each method are constructed. This is printed when python exits. Class types that are not created or methods that are never generated are not listed.
This is in its own branch for lazy dataclasses if you are curious
./python -m test test_dataclasses
This has to test all of the methods, but even so there’s a pretty clear indication of which one is more important than all of the others.
Classes Created: 608
Frozen: 100
Ordered: 11
__init__: 377
__repr__: 20
__eq__: 33
__setattr__: 14
__delattr__: 8
__hash__: 22
__lt__: 4
__le__: 5
__gt__: 4
__ge__: 4
Opening and exiting the REPL
Classes Created: 43
Frozen: 33
__init__: 37
__eq__: 3
__hash__: 1
./python -m test test_pyrepl
Classes Created: 222
Frozen: 40
__init__: 207
__eq__: 4
__hash__: 1
python -m pip list
Classes Created: 29
Frozen: 15
__init__: 3
python -m pip install -e . --group dev (on my Reannotate library)
Classes Created: 45
Frozen: 25
__init__: 20
__eq__: 1
__hash__: 1
python -m pytest (on Reannotate, which does not use dataclasses itself)
Classes Created: 75
Frozen: 34
Ordered: 1
__init__: 28
black --check Lib/dataclasses.py
Classes Created: 23
Frozen: 8
__init__: 12
pylint Lib/dataclasses.py (An actual use of frozen __setattr__ via isort!)
Classes Created: 4
Frozen: 2
__init__: 1
__setattr__: 1
poetry sync (on textual)
Classes Created: 54
Frozen: 17
Ordered: 4
__init__: 20
__eq__: 3
__hash__: 3
__lt__: 3
__gt__: 2
__ge__: 1
python -m textual (Textual demo)
Classes Created: 72
Frozen: 6
__init__: 36
__eq__: 1
python runtests.py (Django)
This required editing the logger to write to a file instead of stdout and also runs in parallel so gave multiple logs. This is truncated to only show the last process that exited (in all of the logs, only __init__ was used).
Classes Created: 18
Frozen: 18
__init__: 15
Sum excluding the dataclasses.py tests:
Classes Created: 585
Frozen: 198
Ordered: 5
__init__: 379
__repr__: 0
__eq__: 12
__setattr__: 1
__delattr__: 0
__hash__: 6
__lt__: 3
__le__: 0
__gt__: 2
__ge__: 1
How could we make this faster?
Avoiding exec based codegen entirely where possible
It turns out __setattr__ and __delattr__ for frozen dataclasses are essentially the same function for all non-empty[1] dataclasses, only differing by two values fixed at class creation. We can just create those directly without needing codegen.
Try for yourself
from dataclasses import dataclass
@dataclass(frozen=True)
class A:
a: int
b: int
@dataclass(frozen=True, slots=True)
class B:
c: str
d: str
e: str
f: str
assert A.__setattr__.__code__.co_code == B.__setattr__.__code__.co_code
assert A.__delattr__.__code__.co_code == B.__delattr__.__code__.co_code
If you want to get technical, the new function generation replaces a LOAD_CONST with a LOAD_DEREF in the resulting method. I’ve been unable to measure any performance difference resulting from this change. Creating the functions this way is significantly faster as will be seen in the frozen class construction comparisons later.
Lazy Generation
Much like how lazy imports save time by not doing work that isn’t needed, we can achieve the same thing here by lazily generating the methods the on first usage.
To implement this, each method is created by a separate function that takes the method name and the class being prepared as arguments and returns the corresponding function. Most of the actual source template creation logic in dataclasses is unchanged.
These functions are then wrapped by non-data descriptor which will generate the method, replace itself with the method and return the newly generated method.
The descriptor looks like this
class _AutoMethod:
# A non-data descriptor to autogenerate class methods on demand.
# method_generator should be a callable that takes the method name
# and the class for which the method should be generated and returns
# the appropriate method.
#
# There should only be one _AutoMethod instance *per method* not per
# class.
__slots__ = ("name", "generator")
def __init__(self, name, generator):
self.name = name
self.generator = generator
def __repr__(self):
return f"<{type(self).__name__} Method Generator for {self.name!r}>"
def __get__(self, obj, objtype=None):
if objtype is None:
objtype = type(obj)
if objtype.__dict__.get(self.name) is self:
gen_cls = objtype
else:
# This may be accessed from a subclass or through super() in
# which case objtype may not be the class this descriptor is
# assigned to. Search the MRO to find the correct class.
gen_cls = None
for c in objtype.__mro__[1:]:
if c.__dict__.get(self.name) is self:
gen_cls = c
break
else:
# Couldn't find the attribute, but perhaps this is being
# called by inspect.signature which calls __get__ with
# objtype, type(objtype) for some reason.
if mro := getattr(obj, "__mro__", None):
for c in mro:
if c.__dict__.get(self.name) is self:
gen_cls = c
break
# __get__ has been manually called with bad arguments
if gen_cls is None:
raise AttributeError(
f"Could not find {self!r} in class {objtype.__name__!r} MRO."
)
method = self.generator(self.name, gen_cls)
setattr(gen_cls, self.name, method)
return method.__get__(obj, objtype)
I’ve been using a version of this for a while with my own classbuilder based on an idea David Beazley demonstrated with his cluegen library.
There is one trade-off to this which is that if all methods are used it is slower as the methods are now generated individually rather than being bundled together in one exec call.
Lazy dataclasses branch and benchmarks
I have a fork with both of these changes implemented.
I have a few microbenchmarks and test suite comparisons that demonstrate the difference in performance.
Performance Comparisons
Unfortunately my machine is pretty noisy, but thankfully the difference in performance is large enough that they mostly fall outside of the noise range.
Standard Library Test Suites
_pyrepl, _colorize and obviously dataclasses all make reasonably heavy use of dataclasses so I decided to see if the impact of lazy imports can be seen in their test suites.
Results
Benchmark 1: dataclasses tests on main
Time (mean ± σ): 192.5 ms ± 17.5 ms [User: 178.7 ms, System: 12.2 ms]
Range (min … max): 175.0 ms … 219.0 ms 10 runs
Benchmark 2: dataclasses tests on lazy-dataclasses
Time (mean ± σ): 140.9 ms ± 5.0 ms [User: 127.1 ms, System: 12.7 ms]
Range (min … max): 130.6 ms … 146.9 ms 10 runs
Summary
dataclasses tests on lazy-dataclasses ran
1.37 ± 0.13 times faster than dataclasses tests on main
Benchmark 1: _colorize tests on main
Time (mean ± σ): 86.2 ms ± 4.0 ms [User: 73.9 ms, System: 11.5 ms]
Range (min … max): 80.3 ms … 92.4 ms 10 runs
Benchmark 2: _colorize tests on lazy-dataclasses
Time (mean ± σ): 79.6 ms ± 6.1 ms [User: 67.4 ms, System: 11.1 ms]
Range (min … max): 70.9 ms … 89.2 ms 10 runs
Summary
_colorize tests on lazy-dataclasses ran
1.08 ± 0.10 times faster than _colorize tests on main
Benchmark 1: pyrepl tests on main
Time (mean ± σ): 3.029 s ± 0.047 s [User: 2.712 s, System: 0.298 s]
Range (min … max): 2.970 s … 3.126 s 10 runs
Benchmark 2: pyrepl tests on lazy-dataclasses
Time (mean ± σ): 2.763 s ± 0.043 s [User: 2.441 s, System: 0.298 s]
Range (min … max): 2.698 s … 2.858 s 10 runs
Summary
pyrepl tests on lazy-dataclasses ran
1.10 ± 0.02 times faster than pyrepl tests on main
Direct Microbenchmarks
It’s worth testing the class construction directly to demonstrate that it’s not faster in the case that all methods are needed, but from the examples earlier this case would appear to be rare.
These tests were done by timing the generation of 10k classes with 5 fields and with different settings. The timings with methods are those where all methods are generated. This is done by accessing the method on the class (for example: Example.__init__).
The and sort ordered methods also sort a collection of 10 instances to force the generation of any methods necessary to perform sorting.
Results
Basic: 86.3% faster
Basic init: 60.0% faster
Basic init and eq: 24.9% faster
Basic methods: 13.1% slower
Frozen: 95.5% faster
Frozen init: 76.6% faster
Frozen init and eq: 57.6% faster
Frozen methods: 24.2% faster
Ordered: 95.0% faster
Ordered and sort: 62.5% faster
Ordered methods: 20.8% slower
Frozen ordered: 96.6% faster
Frozen ordered and sort: 70.4% faster
Frozen ordered methods: 6.3% faster
Slotted basic: 85.7% faster
Slotted basic init: 58.8% faster
Slotted basic init and eq: 25.3% faster
Slotted basic methods: 9.0% slower
Slotted frozen: 92.2% faster
Slotted frozen init: 73.5% faster
Slotted frozen init and eq: 53.6% faster
Slotted frozen methods: 23.2% faster
Slotted ordered: 91.7% faster
Slotted ordered and sort: 60.1% faster
Slotted ordered methods: 14.6% slower
Slotted frozen ordered: 94.2% faster
Slotted frozen ordered and sort: 67.3% faster
Slotted frozen ordered methods: 7.8% faster
Extra required changes
The fork with lazy-dataclasses did require one minor change to pprint as it currently relies on undocumented, fragile implementation details of how dataclasses constructs its __repr__ function in order to detect if the method was constructed by dataclasses.
The check in question
This was broken due to some changes around when methods get renamed when I implemented lazy methods (renaming is now done before decorators are applied).
As such the fork currently also adds a new get_methods function and internal .__dataclass_methods__ attribute as one way to indicate if a method has been generated by dataclasses. Alternatively we could attach some attribute to the methods themselves.
Final Notes
The implementation will need additional tests for laziness which do not yet exist in the fork, the initial goal was to implement laziness without breaking or modifying any existing tests which this does.
There are other techniques that could be used to improve the construction time of methods of dataclasses, either exploiting features of the structure of the functions[2] or by moving some of the method generation to a C accelerator.
I think that these may significantly increase the maintenance burden on dataclasses and the place that would see the most benefit, generating the __init__ method, is the most complicated one to optimise. Anything that adds significant complexity to only optimise generation of the other methods is probably not worthwhile.
Empty frozen dataclasses currently have a special case that removes checking against an empty set. As this method is rarely used and an empty, frozen dataclass doesn’t seem like the most useful construct I’ve removed the special case in my current fork. ↩︎
For example, all of the ordering methods are essentially the same with only one different instruction in bytecode to use their specific comparison operator. David Beazley also demonstrated caching methods with his
dataklasseslibrary for simple cases where they only differ by their field count, which is true for all dataclass methods other than__init__. ↩︎