Dataclasses - make use of Annotated

dataclasses are great and I make extensive use of them but there is something which I always find non-matching. Example:

from dataclasses import dataclass, field

@dataclass
class Dummy:
    a: int = field(init=False, default=5)

Having declared a to be of type int the assignment is the result of a callable field which (as an end-user) I am not sure it is returning an int

Given that Annotated can convey metainformation, would it not be really appropriate to use it for dataclasses, as in:

from dataclasses import dataclass, field
from typing import Annotated

@dataclass
class Dummy
    a: Annotated[int, field(init=False)] = 5

This should be clearer for anyone:

  • the type for a is int
  • the information that it must not be part of __init__ is in the metainformation inside Annotated with the use of field.
  • And the default value is assigned to a as expected.
7 Likes

And here is a small proof of concept

#!/usr/bin/env python
# -*- coding: utf-8; py-indent-offset:4 -*-
###############################################################################
from __future__ import annotations
from collections.abc import Callable
from dataclasses import dataclass, Field, MISSING
import inspect
from typing import Annotated, get_args, get_origin, overload


@overload
def ann_dataclass(cls: None = None, **kwargs) -> Callable[[type], type]:
    ...


@overload
def ann_dataclass(cls: type, **kwargs) -> type:
    ...


def ann_dataclass(cls: type | None = None, **kwargs) -> type | Callable[[type], type]:

    # actual decorator for when cls is not None
    def _annotify(cls: type) -> type:
        # Fetch the annotations using latest best practices
        # and from __future__ import annotations mey the default in the future
        ann = inspect.get_annotations(cls, eval_str=True)

        for name, thint in ann.items():
            if get_origin(thint) is not Annotated:
                continue

            # It is an Annotated type hint, see if there is any Field metainfo
            _type, *metainfos = get_args(thint)
            for metainfo in metainfos:
                if not isinstance(metainfo, Field):
                    continue  # not the use case, let it go

                try:
                    default = getattr(cls, name)  # check if default value exists
                except AttributeError:
                    pass
                else:
                    # standard dc check for both default and default_factory
                    if (
                        default is not MISSING
                        and metainfo.default_factory is not MISSING
                    ):
                        raise ValueError(
                            "cannot specify both default and default_factory"
                        )

                    metainfo.default = default  # can be safely assigned

                # record the actual type defined in Annotated
                metainfo.type = _type
                # put the "Field" as default value for dataclass decorator
                setattr(cls, name, metainfo)
                break  # only 1 Field to be processed ... break out

        return dataclass(cls, **kwargs)  # class ready for further processing

    if cls is None:
        return _annotify  # -> Callable[[type], type]

    return _annotify(cls)  # -> type


# Small test
if __name__ == '__main__':
    from dataclasses import field, fields
    from typing import ClassVar

    @ann_dataclass
    class A:
        cv: ClassVar[str] = 'classvar'

        a: int = 5
        b: Annotated[str, field(init=False)] = 'annotated field'

    a = A()
    print(f"{a.__annotations__ = }")
    print(f"{a.cv = }")
    print(f"{a.a = }")
    print(f"{a.b = }")
    print(f"{fields(a) = }")

The output

a.__annotations__ = {'cv': 'ClassVar[str]', 'a': 'int', 'b': 'Annotated[str, field(init=False)]'}
a.cv = 'classvar'
a.a = 5
a.b = 'annotated field'
fields(a) = (Field(name='a',type='int',default=5,default_factory=<dataclasses._MISSING_TYPE object at 0x000001E104BAA6E0>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), Field(name='b',type='Annotated[str, field(init=False)]',default='annotated field',default_factory=<dataclasses._MISSING_TYPE object at 0x000001E104BAA6E0>,init=False,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD))

Edit: recording the actual type defined in Annotated in the resulting Field

It’s worth noting that the same approach is being used in pydantic:

class Cat(BaseModel):
    name: Annotated[str, Field(title="Name")] = "unknown"

And also in FastAPI:

@app.get('/{id}')
def get_cat(id: Annotated[str, Path()] = "Garfield"):
    pass

More to the point I think. I knew FastAPI was using Annotated in function signatures, but I didn’t know that in Pydantic (which I don’t use and haven’t explored)

I think quite honestly dataclasses was added in 3.7 and typing.Annotated was added in 3.9.

That’s probably why.

Nothing says it couldn’t also support new/more styles, but you’d have to argue that the benefits of something new outweigh bifurcating what a dataclass looks like.

2 Likes

It’s clear why Annotated was not supported :slight_smile:

But the idea of using new language/library facilities does not introduce backwards compatibility issues because the existing (“old” afterwards) syntax would still be supported and would not be deprecated.

Introducing annotations also changed how classes looked like back then, without breaking existing code.

Existing stdlib modules are not changed just because a new thing is added to the language, but when it solves a problem or really improves something.

From the first post:

I am not a heavy user of typing, but I do not see a problem.
We declare a to be an int variable on instances of the class.
We also assign a to whatever field returns so that the class works.
These two things make sense to me, and type checkers also think it’s fine. No problem here!

2 Likes

Question: may it be the type checkers have been properly “trained” to ignore a Field instance (the return value of field) rather than actually doing some type checking?

In fairness, type checkers think it’s fine because they have a special case for dataclasses. One benefit of the proposal is that it removes the special case, which is a benefit to readers of code.

On the other hand, does this proposal work with the stringifying that currently happens when someone does from __future__ import annotations? Would it be better to wait for PEP 649 to be added to the language?

Yeah, but it’s odd to assign a value of type Field to a variable of type int. There aren’t many places where type checkers think it’s okay for you to do anything like that.

At least MyPy and Pyright are type checking fields. They verify that the field’s default or default-factory has the correct type.

1 Like

To me this would give 2 ways to do the same thing that look very similar. Generally one obvious way is better than 2. We already have the current way so yeah.

1 Like

The decorator crafted above as POC does work, because inspect.get_annotations is invoked with eval_str=True and the actual annotation is not touched. The Field generated by the evaluation is customized with the default value from the end-user and then replaces the default value, to let the dataclass decorator do its thing

Well it doesn’t gives 2 ways to do the same thing.

It updates a way of doing something which is “broken” because Field types are assigned to attributes which are typed with a different type.

If you think it does, we may also remove make_dataclass which also gives us a a 2nd way to create a dataclass (and by the way … fails to pass type checking)

1 Like

Expanding the answer about if string annotations work. The dataclasses module has provisions to identify ClassVar and InitVar annotations to avoid (reading the source and the comments therein)

  • Importing typing if the user has not imported it
  • Applying eval to each and every annotation

Recognizing the notation:

  • The same identification technique for ClassVar/InitVar could be applied to Annotated.

Using eval and performance

If the annotation is a string, the arguments of Annotated have to be eval, which is what the dataclasses authors tried to avoid (they don’t call all cases by working with strings, but catch “enough”)

The trick here is:

  • A Field is created for each attribute in __annotations__
  • Delaying the creation of this field is possible by first checking if Annotated is present and a Field has been created when the arguments are evaluated.
  • This means that still only one and only one Field will be created.

It seems really doable without going against the spirit of the original dataclasses code.

It certainly is doable. And I suspect that maybe if Annotations existed when dataclasses was being proposed, it mightve been used instead of what we have today.

However, just because there’s a new way of doing things, I don’t think justifies making it so there’s multiple ways of representing the same information.

What’s the value add, that outweighs the cost of additional maintenance, additional type-checker code, documentation, etc…?

Again, I think what you propose is ideal in a vacuum. But it has to be worthwhile to break “there should be one, and preferably one, way of doing things”

Edit: but this is just my opinion. Please do see what others think. But he prepared with strong arguments for why it needs to exist

2 Likes

Maybe it can be of interest: beartype, a new runtime type checker, had also to code a special case for dataclasses because the type hints do not match de default values. See [Feature request] Automagical Import Hook · Issue #43 · beartype/beartype · GitHub

I would expect that every new type checker on the market has to make exceptions for dataclasses…

1 Like

I think it’s fairly well known in the typing community that data classes need to be special cases. Until now I hadn’t known why. IMO it should be documented somewhere - ideally in both a PEP and longer-term in the formal documentation. I don’t know if it is, though - does anyone have a pointer?

From typing perspective how dataclasses fits in is reasonably well explained by PEP 681. The introduction of PEP highlights a lot of core aspects of dataclasses and how they are exceptional typing wise. It serves as main way to help other similar libraries also be understood by type checkers mostly by saying if you are similar enough to dataclasses you can say your library implements a dataclass transform.

I’d say biggest two special aspects of dataclasses from typing view is that the decorator derives an init based on fields and the specialness of how dataclasses.field behaves. Another view is that dataclasses are a decorator that extends a class with additional methods. There’s currently no good way in type system do describe that.

For example, let’s say you have a decorator that introduces one new method to your class like

def support_metrics(cls):
  def _log_metrics(metric: str, value: float) -> None:
    …

  cls.log_metrics = _log_metrics
  return cls

@support_metrics
class Foo:
  …

a = Foo()
a.log_metrics(“user_event”, 1.0) # Code is safe and fine at runtime but no type checker will understand this.

You can view dataclasses as more complex/much nicer decorator than that one. But how do you say in type system that you take a cls as input and return that same type extended with new methods? There’s currently no way to do that which is big reason why dataclasses is so special. There are various proposals/discussions for ways to support this kind of stuff but I don’t think any of them are at PEP stage.

5 Likes

One way would be to simply add type intersections. Then you could return the intersection of the type with the type of a protocol:

from typing import Any, Protocol, TypeVar, cast

class HasLogMetrics(Protocol):
  def log_metrics(self, metric: str, value: float) -> None:
        ...

T = TypeVar('T', bound=type[Any])

def support_metrics(cls: T) -> T & type[HasLogMetrics]:
  def _log_metrics(self, metric: str, value: float) -> None:
    ...

  retval = cast(T & type[HasLogMetrics], cls)
  retval.log_metrics = _log_metrics
  return retval


@support_metrics
class Foo:
  ...

a = Foo()  # `a` has type Foo & HasLogMetrics
a.log_metrics('user_event', 1.0)  # Passes.

You can test this today by removing T & and seeing that the code passes (but the original type is lost).

Back in business, the initial set of changes is rather small.

Issue: Dataclasses - Support use of Annotated including `field` as metainfo · Issue #107051 · python/cpython · GitHub
Pull-Request (including documentation update): gh-107051: Dataclasses - Support use of Annotated including field as metainfo by mementum · Pull Request #107052 · python/cpython · GitHub

Best regards

I suspect this’ll require a PEP to be accepted before the behavior can be added.

2 Likes