Implicit `default` for a `dataclass_transform` field specifier

matejcik · June 13, 2024, 9:26am

The dataclass_transform field specifiers explicitly allow an implicit init argument via clever (ab)use of function overloads. The same is not allowed for default and default_factory though.

I am sorely missing a way to implicitly specify that a field is an optional argument to the synthesized __init__ method. The implicit init allows me to choose whether the argument appears in __init__ at all, but the only way to make it optional is to explicitly specify a default (and/or factory).

Motivating example: automatic default factories.

@mydataclass
class ArrayClass:
    array_field: list[int] = array()

a = ArrayClass(array_field=[1,2,3])
b = ArrayClass()
b.array_field.extend([4,5,6])

Is there a specific rationale for this omission?

(If not, might I humbly suggest extending the init behavior to the default behavior? I’m imagining something like the following to allow specifying an “unspecified but implicit default”?

@overload
def array(*, default: Literal[...] = ...):
    ...

)

erictraut · June 13, 2024, 3:32pm

This is related to this issue in the pyright issue tracker.

The “(ab)use” that you’re referring to has never been formally been added to the spec, so you’re already arguably on thin ice relying on this behavior for init, and pyright is on thin ice for supporting it. While pyright implements this behavior, mypy and other type checkers do not.

from typing import Any, dataclass_transform

def custom_field(*, init=False) -> Any: ...

@dataclass_transform(field_specifiers=(custom_field,))
def my_dataclass(cls): ...

@my_dataclass
class MyClass:
    a: int
    b: int = custom_field()

MyClass(1, 2)  # pyright generates an error here but mypy does not

If you think that this mechanism is useful and want it to be supported across type checkers, you’re welcome to propose an update to the typing spec. Here is an outline of the process.

Supporting this same mechanism for default and default_factory would involve significantly hackier code in type checkers because these parameters accept arbitrary values, and the types of these values need to be verified for compatibility with the annotated type of the field. I wouldn’t want to implement it in pyright unless it goes through formal review and becomes part of the typing spec.

matejcik · June 13, 2024, 8:48pm

Typing spec (and PEP 681 before it) have the following language:

Field specifier functions can use overloads that implicitly specify the value of init using a literal bool value type (Literal[False] or Literal[True] ).

Plus an example doing exactly that.

Of course, what I’m really interested in is what your example is doing, namely, just specify a default value of the argument in the field specifier and it gets magically picked up.

The spec doesn’t explicitly say that this should happen – I am also curious as to why? From the user’s point of view, the song-and-dance about overloads with literals is only applicable to highly specific situations, while “just pick the values from the function signature” seems generally useful. Plus, it would make the typechecker more closely match run-time behavior – if, say, my field() is returning an object with an init attribute, to be picked up by the dataclass transformer, and there’s a default value for it, that is what will end up getting picked by the transformer, right?

I don’t see in what way type-checking a default value for a default argument is any different than type-checking an explicit value?

But I am just a user of type-checkers, not implementer, which is why I’m coming to a forum first to learn more about some practical problems with this.

I might try to dive into pyright and see how implementing this would look like, but I’d appreciate any insight you can share, given that at this moment I have zero idea about how it works

Speaking of, as a user, I would be perfectly happy with a new argument to field specifier, say, init_optional: bool defaulting to False, without ever providing the actual default that I’d want to use, and leaving the behavior completely up to the transformer code.
The default field is just a convenient way to trigger the desired behavior, that is, allow the callers to either provide or leave out that argument to __init__.

Thank you, I’m now at step one of that process

erictraut · June 14, 2024, 12:17am

I had forgotten that this behavior for init was documented in the spec. Thanks for pointing that out. I was confusing it with kw_only, which is not included in the spec as having this same behavior.

It looks like mypy does implement this behavior for init, but only if you provide an explicit Literal[False] type annotation in the field specifier’s signature, as shown in the spec’s example code.

That means we’re on firmer ground than I had previously though.

Here’s a more complete example of what I think you are proposing:

from typing import Any, dataclass_transform

def custom_field(*, default: list[Any] = []) -> Any: ...

@dataclass_transform(field_specifiers=(custom_field,))
def my_dataclass(cls): ...

@my_dataclass
class ArrayClass:
    array_field: list[int] = custom_field()

a = ArrayClass()

Here’s what I mean by needing to apply additional checks for the default type.

# I've changed the default to `list[str]` in the line below
def custom_field(*, default: list[str] = []) -> Any: ...

...

@my_dataclass
class ArrayClass:
    # The following line should now generate a type error because
    # the default `list[str]`is not assignable to the `list[int]` field
    array_field: list[int] = custom_field() # Type error

matejcik · June 14, 2024, 8:50am

The following code typechecks today, in both pyright and mypy:

import typing as t
import typing_extensions as tx

def custom_field(*, default: int = 0) -> t.Any:
    ...

@tx.dataclass_transform(field_specifiers=(custom_field,))
def my_dataclass(cls):
    ...

@my_dataclass
class Bar:
    a: str = custom_field(default=0)

That is arguably an incompleteness of the typechecker, though I don’t think the spec prohibits it?

Nonetheless, it seems to me that there is very little difference in whether the typechecker evaluates (a) the type of default argument, plus its explicitly provided value at the call site, vs (b) the type of default argument, plus its implicitly set default value from the signature?

matejcik · June 14, 2024, 11:23am

Answering my own “why?”, I think I figured out the problem with just lifting the default arguments:

_NO_DEFAULT_VALUE = object()

def custom_field(*, default: Any = _NO_DEFAULT_VALUE) -> Any:
    ...

@my_dataclass
class Bar:
    a: str = custom_field()

b = Bar()  # This should be an error because `a` is not specified.

If the spec said “field specifiers must respect default arguments”, it would make it needlessly complex to specify a “default has not been provided” situation – for every combination of the other arguments, I would need to write an overload both with and without the default argument.

(This is unlike the init situation, where the semantic value of init must be one of True or False and there is no “user didn’t specify” situation. Making it safe to lift the default value from the field specifier signature.)

Even the spelling I came up with in OP would not fully solve this:

def custom_field(*, default: Literal[...] = ...):

because as far as the typechecker knows, maybe I just want the callsite to specify custom_field(default=...).

I’ll think some more about this, but right now I don’t have any more ideas about how to piggyback the feature onto the default parameter.

So the only viable idea would be to introduce a new parameter.

Spec draft

auto_default is an optional bool parameter that indicates whether this field can automatically provide a default value. If unspecified, defaults to False.
If set to True, the dataclass will generate a default value for the field in case it is neither provided as an argument to __init__, nor specified via one of default, default_factory, factory.
Field specifier functions can use overloads that implicitly specify the value of auto_default using a literal bool value type (Literal[False] or Literal[True] ).

Motivation

There is currently no way to specify that a certain field of a dataclass will be filled in automatically if the user does not provide a value. The only way to do it is to explicitly provide a default value, or a default factory, while specifying the field. That solution could be needlessly repetitive for certain kinds of DSLs.

Consider a protobuf DSL. The message structure looks like this:

message Foo {
    required string name = 1;
    optional uint32 value = 2; [default=5]
    optional uint32 amount = 3;
    repeated uint32 array = 4;
}

The protobuf specification implies the following behavior:

class Foo(proto.Message):
    name: str = proto.required(1)
    value: int = proto.optional(2, default=5)
    amount: int | None = proto.optional(3, default=None)
    array: list[int] = proto.repeated(4, default_factory=list)

From the user’s point of view, the default=None and default_factory=list are completely superfluous, they are implied by the fact that the field is optional or repeated respectively.

We would like the corresponding class to look like this:

class Foo(proto.Message):
    name: str = proto.required(1)
    value: int = proto.optional(2, default=5)
    amount: int | None = proto.optional(3)
    array: list[int] = proto.repeated(4)

Other work

I am one of at least two people who want this (given that this issue exists).
On the other hand, a feature like this is not discussed in the dataclass_transform PEP, indicating that this idea isn’t super popular in the wider community of dataclass_transform users?

Backwards compatibility

Existing implementations might already be using the name auto_default. A survey of the ecosystem would need to be done before selecting a name.

Alternatives

none come to mind right now

erictraut · June 14, 2024, 2:50pm

The following code typechecks today, in both pyright and mypy … That is arguably an incompleteness of the typechecker, though I don’t think the spec prohibits it?

The reason this type checks is because of the way your custom_field function is defined. If you look at the way dataclasses.field is defined, you’ll see that it includes several overloads. If you incorporate similar overloads into your custom_field definition, both pyright and mypy will detect type errors in this code sample.

More generally, can you get the behavior you’re looking for by replicating the overloads for dataclasses.field?

from typing import Any, dataclass_transform, overload

@overload
def custom_field(*, default: None = None) -> list[int]: ...
@overload
def custom_field[T](*, default: T) -> T: ...
def custom_field(*, default: Any = None) -> Any: ...

@dataclass_transform(field_specifiers=(custom_field,))
def my_dataclass(cls): ...

@my_dataclass
class ArrayClass:
    a: list[int] = custom_field()
    b: int = custom_field(default=2)
    c: int = custom_field(default="")  # Type error

Out of curiosity, which dataclass-like library are you using here? Or are you developing a new one?

matejcik · June 17, 2024, 2:05pm

I was not able to trick the typechecker into it yet. (no big surprise, as the spec does not really seem to support this behavior).

To be clear, I would like neither of the following to fail:

@my_dataclass
class ArrayClass:
     a: list[int] = custom_field()

a = ArrayClass()  # fails: missing parameter for "a"
b = ArrayClass(a=[1, 2, 3])

It’s a new thing. Multiple new things, to be precise, essentially DSLs for things that can be declaratively described as a dataclass.
Namely:

protobuf messages
fields of a Bitcoin PSBT structure
C-like structs (looking for a more concise syntax for construct_typing)

matejcik · June 18, 2024, 11:55am

Actually! There is a really stupid thing I can do to cover my usecase.
I can use the overloads that you mention, to make sure that the default value is typechecked right…
…and then make custom_field not a field specifier.

This way typechecker assumes that the result of custom_field() is itself a default value, and triggers the behavior that I want.

The only problem with that approach would be if I wanted the same field specifier to trigger this behavior, and (with a different overload perhaps) use one of init, kw_only, alias.

My usecases want none of those things, so I’m fine.
It still seems that someone could want those things. But then again, overall interest in this kind of feature seems to be low.

oskark · January 31, 2025, 10:03pm

I would also like to have some good solution to this case. I made a custom dataclass in which field specifiers always assume that the default is None. For some time I didn’t even know that you can pass field_specifiers to @dataclass_transform, and as @matejcik describes, not assigning field_specifiers helps with type checking in this case. One small downside is that when hovering over a class, the list of parameters is cluttered with the field specifiers (and their arguments) for each field, because they’re seen as being the defaults for fields, like:

class MyCustomDataclass(
  field1: str | None = custom_field(somecustomarg=True, anothercustomarg=False),
  field2: int | None = custom_field(somecustomarg=True, anothercustomarg=True),
  field3: list[str] | None = custom_field(somecustomarg=False, anothercustomarg=False),

And as @matejcik says, while using @dataclass_transform without field_specifiers, I can’t benefit from using other field specifier arguments.

I am one of at least two people who want this (given that this issue exists).

Well, now there’s three of us. Or maybe you can count people who I work with and are using my custom dataclass (that I may have designed in too non-standard way, so that’s on me).