Add `dataclass_factory` argument to `dataclasses.make_dataclass` for custom dataclass transformation support

XuehaiPan · May 13, 2024, 7:17am

Forward GitHub issue: python/cpython#118974

Feature or enhancement

Proposal:

typing.dataclass_transform (PEP 681 – Data Class Transforms) allows users define their own dataclass decorator that can be recognized by the type checker.

Here is a real-world example use case:

flax.struct.dataclass

Also, dataclasses.asdict and dataclasses.astuple allow users pass an extra argument for the factory of the returned instance.

github.com

python/cpython/blob/0fb18b02c8ad56299d6a2910be0bab8ad601ef24/Lib/dataclasses.py#L1299-L1317


      
          def asdict(obj, *, dict_factory=dict):
              """Return the fields of a dataclass instance as a new dictionary mapping
              field names to field values.
          
              Example usage::
          
                @dataclass
                class C:
                    x: int
                    y: int
          
                c = C(1, 2)
                assert asdict(c) == {'x': 1, 'y': 2}
          
              If given, 'dict_factory' will be used instead of built-in dict.
              The function applies recursively to field values that are
              dataclass instances. This will also look into built-in containers:
              tuples, lists, and dicts. Other objects are copied with 'copy.deepcopy()'.
              """

github.com

python/cpython/blob/0fb18b02c8ad56299d6a2910be0bab8ad601ef24/Lib/dataclasses.py#L1380-L1397


      
          def astuple(obj, *, tuple_factory=tuple):
              """Return the fields of a dataclass instance as a new tuple of field values.
          
              Example usage::
          
                @dataclass
                class C:
                    x: int
                    y: int
          
                c = C(1, 2)
                assert astuple(c) == (1, 2)
          
              If given, 'tuple_factory' will be used instead of built-in tuple.
              The function applies recursively to field values that are
              dataclass instances. This will also look into built-in containers:
              tuples, lists, and dicts. Other objects are copied with 'copy.deepcopy()'.
              """

However, the make_dataclass function does not support third-party dataclass factory (e.g., flax.struct.dataclass):

github.com

python/cpython/blob/0fb18b02c8ad56299d6a2910be0bab8ad601ef24/Lib/dataclasses.py#L1441-L1528


      
          def make_dataclass(cls_name, fields, *, bases=(), namespace=None, init=True,
                             repr=True, eq=True, order=False, unsafe_hash=False,
                             frozen=False, match_args=True, kw_only=False, slots=False,
                             weakref_slot=False, module=None):
              """Return a new dynamically created dataclass.
          
              The dataclass name will be 'cls_name'.  'fields' is an iterable
              of either (name), (name, type) or (name, type, Field) objects. If type is
              omitted, use the string 'typing.Any'.  Field objects are created by
              the equivalent of calling 'field(name, type [, Field-info])'.::
          
                C = make_dataclass('C', ['x', ('y', int), ('z', int, field(init=False))], bases=(Base,))
          
              is equivalent to::
          
                @dataclass
                class C(Base):
                    x: 'typing.Any'
                    y: int
                    z: int = field(init=False)

This file has been truncated. show original

It can only apply dataclasses.dataclass (see the return statement above).

This feature request issue will discuss the possibility of adding a new dataclass_factory argument to the dataclasses.make_dataclass to support third-party dataclasss transformation, similar to dict_factory for dataclasses.asdict.

# dataclasses.py

def make_dataclass(cls_name, fields, *, bases=(), namespace=None, init=True,
                   repr=True, eq=True, order=False, unsafe_hash=False,
                   frozen=False, match_args=True, kw_only=False, slots=False,
                   weakref_slot=False, module=None,
                   dataclass_factory=dataclass):
    ...

    # Apply the normal decorator.
    return dataclass_factory(cls, init=init, repr=repr, eq=eq, order=order,
                             unsafe_hash=unsafe_hash, frozen=frozen,
                             match_args=match_args, kw_only=kw_only, slots=slots,
                             weakref_slot=weakref_slot)

sobolevn · May 13, 2024, 9:58am

Can you please show an example? How would you want to use this new param?

XuehaiPan · May 13, 2024, 11:06am

I want to re-export the dataclasses functionally in my own package. Here is the snippet to illustrate my use case:

# mypkg/
# ├── __init__.py
# └── dataclasses.py

import dataclasses

from typing_extensions import dataclass_transform  # Python 3.11+

from mypkg import xxx, yyy, zzz

__all__ = ['dataclass', 'field', 'make_dataclass']

@dataclass_transform(field_specifiers=(field,))
def dataclass(cls=None, /, *, **kwargs):
    xxx(kwargs)             # do something

    if cls is not None:
        klass = dataclasses.dataclass(cls, **kwargs)
        yyy(klass, kwargs)  # do something else
        return klass

    def wrapper(cls):
        klass = dataclasses.dataclass(cls, **kwargs)
        yyy(klass, kwargs)  # do something else
        return klass

    return wrapper

def field(**kwargs):
    zzz(kwargs)  # do something
    return dataclasses.field(kwargs)

def make_dataclass(**kwargs):
    return dataclasses.make_dataclass(
        dataclass_factory=dataclass,  # my own dataclass() above
        **kwargs,
    )

The users can do:

import mypkg


@mypkg.dataclasses.dataclass
class Foo:
    x: int
    y: int


Bar = mypkg.dataclasses.make_dataclass('Bar', [('a', float), ('b', int)])

NeilGirdhar · May 13, 2024, 5:00pm

Do you really find Bar as nice as Foo? Seems significantly worse.

Can you not implement make_dataclass in your package by creating a custom type, adding in the annotations you want, and finally applying your dataclass function?

XuehaiPan · May 13, 2024, 5:30pm

Yes, the normal use case of the @dataclass decorator is more elegant and readable. But sometimes there are use cases for dynamic class creation, just like subclassing typing.NamedTuple vs. calling collections.namedtuple.

NUM_LAYERS = 32

MyNetwork = dataclasses.make_dataclass('MyNetwork', [(f'layer{i}', Layer) for i in range(NUM_LAYERS)])

NeilGirdhar · May 13, 2024, 5:51pm

I understand, but can you not generate write a make_dataclass function of your own without delegating to the dataclasses.make_dataclass using the instructions in my last comment?

Oh, I see, but you want the annotations to be right. Got it.

XuehaiPan · May 13, 2024, 6:04pm

I can do this, but I don’t think ordinary users can understand that and it is also not easy to use. I want to re-export the dataclasses functionally in my package and then ship it to PyPI.

NUM_LAYERS = 32

MyNetwork1 = dataclasses.make_dataclass('MyNetwork1', [(f'layer{i}', Layer) for i in range(NUM_LAYERS)])

MyNetwork2 = type('MyNetwork2', (object,), {'__annotations__': {f'layer{i}': Layer for i in range(NUM_LAYERS)}})
MyNetwork2 = dataclasses.dataclass(MyNetwork2)

Also, I do not want to copy-paste the code of dataclasses.make_dataclass in my package. I want to make it always sync with the stdlib.