Faster (lazier) dataclasses

The short, short version

  • Dataclass creation is slow, this is a recurring problem in the stdlib
  • They are slow because they exec source code templates to create every method
  • It turns out a lot of methods are never actually used (some numbers are given later)
  • Let’s not generate those methods!
  • Here’s a fork with lazy dataclass methods implemented as a demo
  • This makes constructing a dataclass ~85% faster if it is completely unused (common)
    • ~60% faster if only __init__ is used (most common)
    • But ~20% slower in the worst case where every method is used on unfrozen classes (I’m sure there’s one out there but I’m yet to find a real example of this)
  • Some things that use dataclasses also indicate this is faster
    • The pyrepl test suite is faster by ~10%
    • The dataclasses test suite is faster by ~30%
    • I would like more real world examples of applications where start time is noticeable that make heavy use of dataclasses
      • Pyperformance didn’t appear to have anything that heavily exercises this as I didn’t note any significant change when running through that suite
  • There are other techniques we could use to cut down on code generation that will make up for this performance penalty, but these come with increased maintenance burden.

In Detail

Dataclasses are nice to use but slow to construct and can have a significant import performance penalty on modules that use them.

The performance hit has been noted multiple times in the past:

Part of this slowness was down to the other modules dataclasses imported such as inspect which has been improved for 3.15 with lazy imports but there is still significant overhead in the class construction itself, which can be seen in modules like _colorize that use a lot of dataclasses (each additional colour theme adds a new frozen dataclass that makes the overall module load time slower).

So in the same spirit as the new lazy imports, let’s make them faster by making them do less.

Why are dataclasses slow?

While some time is spent analysing annotations to decide how to construct the class, most of the construction time of dataclasses is spent in exec, running a generated template to create special methods.

For a class like this:

@dataclass(order=True, frozen=True, kw_only=True)
class Example:
    a: int = 42
    b: str = "Dent"

This is the source code generated (Yes, it really does use 1 space indents)

Generated source code
def __create_fn__(__dataclass_HAS_DEFAULT_FACTORY__,__dataclass_builtins_object__,__dataclass_dflt_a__,__dataclass_dflt_b__,__dataclasses_recursive_repr,__class__,FrozenInstanceError):
 def __init__(self,*,a=__dataclass_dflt_a__,b=__dataclass_dflt_b__):
  __dataclass_builtins_object__.__setattr__(self,'a',a)
  __dataclass_builtins_object__.__setattr__(self,'b',b)
 @__dataclasses_recursive_repr()
 def __repr__(self):
  return f"{self.__class__.__qualname__}(a={self.a!r}, b={self.b!r})"
 def __eq__(self,other):
  if self is other:
   return True
  if other.__class__ is self.__class__:
   return self.a==other.a and self.b==other.b
  return NotImplemented
 def __lt__(self,other):
  if other.__class__ is self.__class__:
   return (self.a,self.b,)<(other.a,other.b,)
  return NotImplemented
 def __le__(self,other):
  if other.__class__ is self.__class__:
   return (self.a,self.b,)<=(other.a,other.b,)
  return NotImplemented
 def __gt__(self,other):
  if other.__class__ is self.__class__:
   return (self.a,self.b,)>(other.a,other.b,)
  return NotImplemented
 def __ge__(self,other):
  if other.__class__ is self.__class__:
   return (self.a,self.b,)>=(other.a,other.b,)
  return NotImplemented
 def __setattr__(self,name,value):
  if type(self) is __class__ or name in {'a', 'b'}:
   raise FrozenInstanceError(f"cannot assign to field {name!r}")
  super(__class__, self).__setattr__(name, value)
 def __delattr__(self,name):
  if type(self) is __class__ or name in {'a', 'b'}:
   raise FrozenInstanceError(f"cannot delete field {name!r}")
  super(__class__, self).__delattr__(name)
 def __hash__(self):
  return hash((self.a,self.b,))
 return (__init__,__repr__,__eq__,__lt__,__le__,__gt__,__ge__,__setattr__,__delattr__,__hash__,)

While this is all executed in 1 exec call, each additional method makes the class construction slower, so an ordered or frozen dataclass takes longer to create than a basic one.

Most of these methods will also be unused at runtime:

Evidence of unused methods

I added a hook to dataclasses that counts how many dataclasses are created, how many are frozen and/or ordered along with how many of each method are constructed. This is printed when python exits. Class types that are not created or methods that are never generated are not listed.

This is in its own branch for lazy dataclasses if you are curious

./python -m test test_dataclasses

This has to test all of the methods, but even so there’s a pretty clear indication of which one is more important than all of the others.

Classes Created: 608
Frozen: 100
Ordered: 11
__init__: 377
__repr__: 20
__eq__: 33
__setattr__: 14
__delattr__: 8
__hash__: 22
__lt__: 4
__le__: 5
__gt__: 4
__ge__: 4

Opening and exiting the REPL

Classes Created: 43
Frozen: 33
__init__: 37
__eq__: 3
__hash__: 1

./python -m test test_pyrepl

Classes Created: 222
Frozen: 40
__init__: 207
__eq__: 4
__hash__: 1

python -m pip list

Classes Created: 29
Frozen: 15
__init__: 3

python -m pip install -e . --group dev (on my Reannotate library)

Classes Created: 45
Frozen: 25
__init__: 20
__eq__: 1
__hash__: 1

python -m pytest (on Reannotate, which does not use dataclasses itself)

Classes Created: 75
Frozen: 34
Ordered: 1
__init__: 28

black --check Lib/dataclasses.py

Classes Created: 23
Frozen: 8
__init__: 12

pylint Lib/dataclasses.py (An actual use of frozen __setattr__ via isort!)

Classes Created: 4
Frozen: 2
__init__: 1
__setattr__: 1

poetry sync (on textual)

Classes Created: 54
Frozen: 17
Ordered: 4
__init__: 20
__eq__: 3
__hash__: 3
__lt__: 3
__gt__: 2
__ge__: 1

python -m textual (Textual demo)

Classes Created: 72
Frozen: 6
__init__: 36
__eq__: 1

python runtests.py (Django)

This required editing the logger to write to a file instead of stdout and also runs in parallel so gave multiple logs. This is truncated to only show the last process that exited (in all of the logs, only __init__ was used).

Classes Created: 18
Frozen: 18
__init__: 15

Sum excluding the dataclasses.py tests:

Classes Created: 585
Frozen: 198
Ordered: 5
__init__: 379
__repr__: 0
__eq__: 12
__setattr__: 1
__delattr__: 0
__hash__: 6
__lt__: 3
__le__: 0
__gt__: 2
__ge__: 1

How could we make this faster?

Avoiding exec based codegen entirely where possible

It turns out __setattr__ and __delattr__ for frozen dataclasses are essentially the same function for all non-empty[1] dataclasses, only differing by two values fixed at class creation. We can just create those directly without needing codegen.

Try for yourself
from dataclasses import dataclass

@dataclass(frozen=True)
class A:
    a: int
    b: int

@dataclass(frozen=True, slots=True)
class B:
    c: str
    d: str
    e: str
    f: str

assert A.__setattr__.__code__.co_code == B.__setattr__.__code__.co_code
assert A.__delattr__.__code__.co_code == B.__delattr__.__code__.co_code

If you want to get technical, the new function generation replaces a LOAD_CONST with a LOAD_DEREF in the resulting method. I’ve been unable to measure any performance difference resulting from this change. Creating the functions this way is significantly faster as will be seen in the frozen class construction comparisons later.

Lazy Generation

Much like how lazy imports save time by not doing work that isn’t needed, we can achieve the same thing here by lazily generating the methods the on first usage.

To implement this, each method is created by a separate function that takes the method name and the class being prepared as arguments and returns the corresponding function. Most of the actual source template creation logic in dataclasses is unchanged.

These functions are then wrapped by non-data descriptor which will generate the method, replace itself with the method and return the newly generated method.

The descriptor looks like this
class _AutoMethod:
    # A non-data descriptor to autogenerate class methods on demand.
    # method_generator should be a callable that takes the method name
    # and the class for which the method should be generated and returns
    # the appropriate method.
    #
    # There should only be one _AutoMethod instance *per method* not per
    # class.
    __slots__ = ("name", "generator")

    def __init__(self, name, generator):
        self.name = name
        self.generator = generator

    def __repr__(self):
        return f"<{type(self).__name__} Method Generator for {self.name!r}>"

    def __get__(self, obj, objtype=None):
        if objtype is None:
            objtype = type(obj)

        if objtype.__dict__.get(self.name) is self:
            gen_cls = objtype
        else:
            # This may be accessed from a subclass or through super() in
            # which case objtype may not be the class this descriptor is
            # assigned to. Search the MRO to find the correct class.
            gen_cls = None
            for c in objtype.__mro__[1:]:
                if c.__dict__.get(self.name) is self:
                    gen_cls = c
                    break
            else:
                # Couldn't find the attribute, but perhaps this is being
                # called by inspect.signature which calls __get__ with
                # objtype, type(objtype) for some reason.
                if mro := getattr(obj, "__mro__", None):
                    for c in mro:
                        if c.__dict__.get(self.name) is self:
                            gen_cls = c
                            break

                # __get__ has been manually called with bad arguments
                if gen_cls is None:
                    raise AttributeError(
                        f"Could not find {self!r} in class {objtype.__name__!r} MRO."
                    )

        method = self.generator(self.name, gen_cls)
        setattr(gen_cls, self.name, method)
        return method.__get__(obj, objtype)

I’ve been using a version of this for a while with my own classbuilder based on an idea David Beazley demonstrated with his cluegen library.

There is one trade-off to this which is that if all methods are used it is slower as the methods are now generated individually rather than being bundled together in one exec call.

Lazy dataclasses branch and benchmarks

I have a fork with both of these changes implemented.

I have a few microbenchmarks and test suite comparisons that demonstrate the difference in performance.

Messy benchmark scripts

Performance Comparisons

Unfortunately my machine is pretty noisy, but thankfully the difference in performance is large enough that they mostly fall outside of the noise range.

Standard Library Test Suites

_pyrepl, _colorize and obviously dataclasses all make reasonably heavy use of dataclasses so I decided to see if the impact of lazy imports can be seen in their test suites.

Results
Benchmark 1: dataclasses tests on main
  Time (mean ± σ):     192.5 ms ±  17.5 ms    [User: 178.7 ms, System: 12.2 ms]
  Range (min … max):   175.0 ms … 219.0 ms    10 runs
 
Benchmark 2: dataclasses tests on lazy-dataclasses
  Time (mean ± σ):     140.9 ms ±   5.0 ms    [User: 127.1 ms, System: 12.7 ms]
  Range (min … max):   130.6 ms … 146.9 ms    10 runs
 
Summary
  dataclasses tests on lazy-dataclasses ran
    1.37 ± 0.13 times faster than dataclasses tests on main

Benchmark 1: _colorize tests on main
  Time (mean ± σ):      86.2 ms ±   4.0 ms    [User: 73.9 ms, System: 11.5 ms]
  Range (min … max):    80.3 ms …  92.4 ms    10 runs
 
Benchmark 2: _colorize tests on lazy-dataclasses
  Time (mean ± σ):      79.6 ms ±   6.1 ms    [User: 67.4 ms, System: 11.1 ms]
  Range (min … max):    70.9 ms …  89.2 ms    10 runs
 
Summary
  _colorize tests on lazy-dataclasses ran
    1.08 ± 0.10 times faster than _colorize tests on main

Benchmark 1: pyrepl tests on main
  Time (mean ± σ):      3.029 s ±  0.047 s    [User: 2.712 s, System: 0.298 s]
  Range (min … max):    2.970 s …  3.126 s    10 runs
 
Benchmark 2: pyrepl tests on lazy-dataclasses
  Time (mean ± σ):      2.763 s ±  0.043 s    [User: 2.441 s, System: 0.298 s]
  Range (min … max):    2.698 s …  2.858 s    10 runs
 
Summary
  pyrepl tests on lazy-dataclasses ran
    1.10 ± 0.02 times faster than pyrepl tests on main

Direct Microbenchmarks

It’s worth testing the class construction directly to demonstrate that it’s not faster in the case that all methods are needed, but from the examples earlier this case would appear to be rare.

These tests were done by timing the generation of 10k classes with 5 fields and with different settings. The timings with methods are those where all methods are generated. This is done by accessing the method on the class (for example: Example.__init__).

The and sort ordered methods also sort a collection of 10 instances to force the generation of any methods necessary to perform sorting.

Results
Basic: 86.3% faster
Basic init: 60.0% faster
Basic init and eq: 24.9% faster
Basic methods: 13.1% slower
Frozen: 95.5% faster
Frozen init: 76.6% faster
Frozen init and eq: 57.6% faster
Frozen methods: 24.2% faster
Ordered: 95.0% faster
Ordered and sort: 62.5% faster
Ordered methods: 20.8% slower
Frozen ordered: 96.6% faster
Frozen ordered and sort: 70.4% faster
Frozen ordered methods: 6.3% faster
Slotted basic: 85.7% faster
Slotted basic init: 58.8% faster
Slotted basic init and eq: 25.3% faster
Slotted basic methods: 9.0% slower
Slotted frozen: 92.2% faster
Slotted frozen init: 73.5% faster
Slotted frozen init and eq: 53.6% faster
Slotted frozen methods: 23.2% faster
Slotted ordered: 91.7% faster
Slotted ordered and sort: 60.1% faster
Slotted ordered methods: 14.6% slower
Slotted frozen ordered: 94.2% faster
Slotted frozen ordered and sort: 67.3% faster
Slotted frozen ordered methods: 7.8% faster

Extra required changes

The fork with lazy-dataclasses did require one minor change to pprint as it currently relies on undocumented, fragile implementation details of how dataclasses constructs its __repr__ function in order to detect if the method was constructed by dataclasses.

The check in question

This was broken due to some changes around when methods get renamed when I implemented lazy methods (renaming is now done before decorators are applied).

As such the fork currently also adds a new get_methods function and internal .__dataclass_methods__ attribute as one way to indicate if a method has been generated by dataclasses. Alternatively we could attach some attribute to the methods themselves.

Final Notes

The implementation will need additional tests for laziness which do not yet exist in the fork, the initial goal was to implement laziness without breaking or modifying any existing tests which this does.

There are other techniques that could be used to improve the construction time of methods of dataclasses, either exploiting features of the structure of the functions[2] or by moving some of the method generation to a C accelerator.

I think that these may significantly increase the maintenance burden on dataclasses and the place that would see the most benefit, generating the __init__ method, is the most complicated one to optimise. Anything that adds significant complexity to only optimise generation of the other methods is probably not worthwhile.


  1. Empty frozen dataclasses currently have a special case that removes checking against an empty set. As this method is rarely used and an empty, frozen dataclass doesn’t seem like the most useful construct I’ve removed the special case in my current fork. ↩︎

  2. For example, all of the ordering methods are essentially the same with only one different instruction in bytecode to use their specific comparison operator. David Beazley also demonstrated caching methods with his dataklasses library for simple cases where they only differ by their field count, which is true for all dataclass methods other than __init__. ↩︎

23 Likes

It is definetelly worth experimenting different approaches -

Some years ago, I tried some different methods to create namedtuples, and got to significant speed ups, using pure Python, and different methods - some of which with different trade-offs (I remember one for which I did not len verification was blazing fast).

The results of those still live in extradict/extradict/extratuple.py at main · jsbueno/extradict · GitHub (`pip install extradict ` will bring them over). - there is some crude profiling at extradict/performance/extratuple.py at main · jsbueno/extradict · GitHub - which likely could be LLM improved these days).

Most important: I refrained from using exec in all those versions - maybe an exec-less version of dataclasses could do for some significative speed-up as well - one thing that could be tried is templating the AST itself, for example. Maybe not for all dataclass edge cases (frozen, slots,)

And last, but not least, if you have an idea for a “lazy namedtuple” to into to extradict, you are more than invited to contribute there.

2 Likes

A large part of this is not wanting to add complication for methods that are likely to go unused. Given the numbers I see for uses of each method I think laziness makes sense even if construction is later optimised for __init__.

One advantage of laziness is that if someone does come up with a faster method to generate __init__ they can wrap dataclasses.dataclass and replace the __init__ attribute after constructing the dataclass and they haven’t paid the cost of any exec calls (currently even if you remove all potential methods it still calls exec on an empty __create_fn__ function template).


If you’re thinking about namedtuples, the dataklasses optimisation may apply more easily there. The eval-ed __new__ functions are all the same for the same size tuple so you can probably generate one and patch the rest.

1 Like

I’m guessing someone has already tried generating an __as_tuple() method then replacing all those other generated methods that use tuples with static def __eq__(self, other): return self.__as_tuple() == other.__as_tuple()?

1 Like

The only reason there is a noticeable cost to calling exec is because the module code itself is cached, whereas the generated code is not cached. Ideally this is the discrepancy that is fixed.

IIRC there were previous discussions about adding custom data to .pyc files somehow. Proper macros would also solve this.

I think the idea in this thread is still worth doing, but it should be considered a workaround, not a solution.

1 Like

The __eq__ method actually no longer compares things as tuples, only the ordering methods do. If you’re feeling sneaky you can generate one method and then patch the argument to the COMPARE_OP bytecode to create the other 3.

I’m not exactly sure how you would propose to fix this other than some special casing around dataclasses in the compiler or new syntax for dataclasses, neither of which have been forthcoming since that thread linked.

I did once have a module that would install an import hook and would rewrite dataclass-like objects in the AST so that they would realised in the .pyc files but it was painful to maintain and use so I gave up on it eventually.

I am remembering Thought experiment about adding extra loadable stuff to a .pyc file

There is stuff that could be done there by adding semi generic tools to the language that not just dataclasses, but other similar tools could use as well.

I mentioned this to you on Discord but I wanted to explicitly ask here as well: what are your plans to ensure this is all thread-safe?

Fundamentally you’ll be setting things up so dataclass types are mutated at arbitrary times at runtime. How does that impact a dataclass that might be shared between arbitrary concurrently running Python code? Also how are you planning to test this sort of multithreaded sharing?

In my experience working on free-threading, multithreaded sharing of Python objects is usually poorly tested. I’m trying to get people to think about this more in the design stages of projects like this.

4 Likes

This is funny. I helped (with the initial poc) with the whole “let’s exec all this stuff together instead of separately” - and now maybe it’ll go the other way. :slight_smile:

I always sorta wonder like why is exec so slow anyway? Or why doesn’t core just bless a faster “dataclass” esque library, like msgspec? The dataclass as-is is all about compatibility, ease of subclassing and all that hurts the performance

2 Likes

Well the initial version of _AutoMethod (the one in the topic post) wasn’t as I realised upon the faintest breath of the phrase “thread-safe” reaching my ears. I think the logical race that would trigger an AttributeError is gone[1], but I don’t have the free-threading experience to test a before/after.

My naive expectation is that it probably needs a lock around the generation and class mutation to prevent it from generating and assigning multiple times, but would appreciate any help/pointers on testing and getting it right from people with more free-threading experience.


It’s fair, it was a good idea it’s just unfortunate that it’s not really possible to do both! If you are going to generate all of the methods at once it makes sense to do so in one call. I’ll agree that it is surprising just how much time is spent in exec for frozen classes.

I’ll note that my interest is largely in both allowing the stdlib to actually use dataclasses where it makes sense and improving the startup performance of other libraries that do make heavy use of dataclasses. I don’t want the general view of dataclasses to be “oh we can’t use them, they’re too slow” or it defeats the point of adding them to the stdlib in the first place.

As an external C-extension msgspec doesn’t really help for that case as fast as it is.


  1. I haven’t edited the initial post, so that specific condition is still there in that example. ↩︎

2 Likes

I just had this idea - I apologize if it sounds a “competing idea” for now - but lest I write it now, I might not remember it at all - (and maybe it could ring a bell for someone):

What if stdlib dataclasses would simply “bake” the generated classes and save them as bytecode in the appropriate `_pycache_` dir? The same heuristics for deciding if a new .pyc should be generated could be used - and changes to existing `dataclasses.py` would be minimal: just save a proper-named pyc there once a dataclass is created the first time, and check if it exists there when a new one is created.
We could at least profile this and compare with the exec performance.

(but also, I like the “lazy” way you are proposing. Maybe there are trade offs between compatibility/amount of changes to check)

I’d consider it important that the lock is only used after a “fast path” check to see if the class attr is already there. You don’t want lock contention to be so severe that it outweighs the benefit.

Testing it will be tricky, but I don’t think it’s beyond the complexity level of the feature itself. The pattern I’ve found most understandable is that the test controls two “worker threads”, whose functions are defined locally and use a mocked lock to let the main thread inspect and control them. Barriers are handy to make sure all three (main, worker 1, worker 2) are synchronized in the right places.


Just as a bit of lateral thinking exercise, are there other ways we can make this faster?

Here’s one which occurs to me:
Could we disable a lot of magic method generation explicitly, by controlling a list of exactly which methods we need?

@dataclass(methods=("__init__", "__repr__"))
class Point:
    x: int
    y: int
  • the default list would be all of the methods
  • you’d need annoying but simple compat shims for older pythons

I don’t think compiling to a .pyc works for @dataclasses given how dynamic they can be (I have done something similar before, but it was more restricted in features) but I’d be interested to see a concrete implementation to prove me wrong.


Technically we already do this with init=False, repr=False, eq=False, one of the subtler issues is that some of these things are either nice to have for development (__repr__) or necessary as guards (__setattr__ and __delattr__) but don’t actually get used at runtime under normal conditions.

3 Likes

but I’d be interested to see a concrete implementation to prove me wrong.

Yeah , I posted, then settled to think about how to generate the pyc - far from being as simple as I had in mind while writing. My mind is sleepy today - kinda “foggy”, sorry.

Still, the idea of a “code generation step”, one step that would run at development time, and generate artifacts that are not hand-coded, may not be versioned, but will be already created at runtime, is something I’ve been playing around for sometime.

While the idea of caching it sounds good, the IO overhead for read/write might make the situation way worse if we don’t use the cache very often (e.g. only write to it once and read once, where ‘normal’ dataclasses might be faster).

1 Like

So I went back and re-checked an assumption I had from years ago[1] and it seems that making a descriptor instance per-method per-class isn’t actually significantly slower than only having one per-method. This drastically simplifies the descriptor.

Simplified `_AutoMethod`

Still no locking yet though.

class _AutoMethod:
    __slots__ = ("name", "generator", "cls")

    def __init__(self, name, generator, cls):
        self.name = name
        self.generator = generator
        self.cls = cls

    def __repr__(self):
        return f"<{type(self).__name__} Method Generator for {self.name!r} on {self.cls.__qualname__!r}>"

    def __get__(self, obj, objtype=None):
        method = self.generator(self.name, self.cls)
        setattr(self.cls, self.name, method)
        return method.__get__(obj, objtype)

  1. Much of this comes from things I’ve previously used elsewhere. ↩︎