Clarification of `dataclass_transform` behavior for metaclasses

A recent pyright bug report uncovered an ambiguity in the specification of dataclass_transform.

The behavior relates to libraries that provide dataclass-like behaviors through the use of a metaclass (like pydantic) rather than through a decorator (like attrs or the stdlib dataclass module). The question is whether __init__ synthesis should be skipped when some intervening class in the MRO provides a custom __init__ method.

For decorator-based libraries, the answer is clear: it should follow the behavior of the stdlib dataclass module. For libraries that use metaclasses, the behavior is less clear and is currently unspecified. Not surprisingly, this leads to a divergence in behaviors. Pydantic’s runtime behavior differs from that currently assumed by mypy and pyright.

from pydantic import BaseModel

class A(BaseModel):
    x: int

class B(A):
   def __init__(self) -> None: ...

class C(B):
   y: int

# Which of the following is correct?
C()  # OK at runtime, error according to mypy and pyright
C(1)  # Runtime error, error according to mypy and pyright
C(1, 1)  # Runtime error, OK according to mypy and pyright

This also affects multiple inheritance use cases:

from pydantic import BaseModel

class A(BaseModel):
    x: int

class B:
    def __init__(self) -> None: ...

class C(B, A):
    y: int

C()  # OK at runtime, error according to mypy and pyright
C(1)  # Runtime error, error according to mypy and pyright
C(1, 1)  # Runtime error, OK according to mypy and pyright

# Swapping the base classes changes the behavior
class D(A, B):
    y: int

D()  # Runtime error, error according to mypy and pyright
D(x=1)  # Runtime error, error according to mypy and pyright
D(x=1, y=1)  # OK at runtime, OK according to mypy and pyright

The correct behavior for dataclass_transform is currently unspecified in this case. Perhaps we should clarify the typing spec so pydantic’s behavior and type checker assumptions are aligned. This would involve adding a new bullet to the Dataclass Semantics section that says:

  • When dataclass_transform is applied to a decorator function, synthesis of an __init__ method is skipped if the class decorated with that function defines its own __init__ method. If dataclass_transform is applied to a metaclass, synthesis of an __init__ method is skipped if any class in the MRO prior to the base class constructed from that metaclass defines its own __init__ method. When dataclass_transform is applied directly to a base class, synthesis of an __init__ method is skipped if any class in the MRO prior to that base class provides its own __init__ method.

The only downside that I see to specifying this behavior is that we might find that other libraries (beside pyantic) that use matclasses or base classes to introduce dataclass-like behaviors might differ from this specified behavior. But given the importance of pydantic in the Python ecosystem, there’s a good argument that we should standardize on its behavior in this case.

Thoughts?

I have my own implementation of the dataclass concept that optionally uses a base class with metaclass implementation[1]. Currently it matches the expectations of mypy and pyright so I’d appreciate it if there is at least a way to keep the current behaviour.

The current mypy/pyright behaviour also matches what you would expect if you wrap the dataclass function to use in a metaclass or base class.

Example of an extremely basic wrapper:

from dataclasses import dataclass
import inspect


class DataClass:
    def __init_subclass__(cls, **kwargs):
        dataclass(cls, **kwargs)

class A(DataClass):
    x: int

class B(A):
    def __init__(self) -> None:
        pass

class C(B):
    y: int


print(inspect.signature(C))  # (x: int, y: int) -> None
del A, B, C


class A(DataClass):
    x: int

class B:
    def __init__(self) -> None: ...

class C(B, A):
    y: int

print(inspect.signature(C))  # (x: int, y: int) -> None


class D(A, B):
    y: int

print(inspect.signature(D))  # (x: int, y: int) -> None

I think it’s worth providing a way to reflect the Pydantic behaviour, but I think this should be via an argument to dataclass_transform rather than being the only ‘correct’ behaviour.


  1. Partly in order to handle slots without needing to recreate the class and deal with all the related bugs dataclasses has to deal with. ↩︎

Pydantic is happy with the proposed changes. I’ve tried looking for other libraries using @dataclass_transform with a metaclass, but I only know about SQLAlchemy doing so (see the docs about MappedAsDataclass – I’m not sure they support custom __init__ anyway). I couldn’t find any other library using the decorator on metaclasses, might be good looking a bit further.

I’m relatively negative on this change.

I prefer what msgspec does here (it also uses a metaclass, though searching for it might be harder for those not used to reading c extension code)

>>> import msgspec
>>> class A(msgspec.Struct):
...     x: int
...
>>> class B(A):
...     def __init__(self): pass
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Struct types cannot define __init__

It just unambiguously rejects this, and I agree with it doing so. I don’t think it’s particularly compelling to have an __init__ which doesn’t match dataclass semantics in mixed use with dataclass transform, and this seems to be more likely to be a source of user confusion or unintended use, and will further complicate dataclasses.

I believe the right way to customize object creation here should be something else, various other options exist here for library authors that won’t have this issue, especially if they are using a metaclass.