Draft PEP - Adding "converter" dataclasses field specifier parameter

(Pre-PEP looking for feedback and a sponsor.)

PEP: 9999
Title: Adding “converter” dataclasses field specifier parameter
Author: Joshua Cannon joshdcannon@gmail.com
Sponsor: TBD
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 01-Jan-2023

Abstract

:pep:557 added dataclasses to the Python stdlib. :pep:681 added
dataclass_transform to help type checkers understand several common
dataclass-like libraries, such as attrs, pydantic, and object
relational mapper (ORM) packages such as SQLAlchemy and Django.

A common feature these libraries provide over the standard library
implementation is the ability for the library to convert arguments given at
initialization time into the types expected for each field using a
user-provided conversion function.

Motivation

There is no existing, standard way for dataclass or third-party
dataclass-like libraries to support argument conversion in a type-checkable
way. To workaround this limitation, library authors/users are forced to choose
to:

  • Opt-in to a custom Mypy plugin. These plugins help Mypy understand the
    conversion semantics, but not other tools.
  • Shuck conversion responsibility onto the caller of the dataclass
    constructor. This can make constructing certain dataclasses unnecessarily
    verbose and repetitive.
  • Provide a custom __init__ and which declares “wider” parameter types and
    converts them when setting the appropriate attribute. This not only duplicates
    the typing annotations between the converter and __init__, but also opts
    the user out of many of the features dataclass provides.
  • Not rely on, or ignore type-checking.

None of these choices are ideal.

Rationale

Adding argument conversion semantics is useful and beneficial enough that most
dataclass-like libraries provide support. Adding this feature to the standard
library means more users are able to opt-in to these benefits without requiring
third-party libraries. Additionally third-party libraries are able to clue
type-checkers into their own conversion semantics through added support in
dataclass_transform, meaning users of those libraries benefit as well.

Specification

New converter parameter

This specification introduces a new parameter named converter to
dataclasses.field function. When an __init__ method is synthesized by
dataclass-like semantics, if an argument is provided for the field, the
dataclass object’s attribute will be assigned the result of calling the
converter with a single argument: the provided argument. If no argument is
given, the normal dataclass semantics for defaulting the attribute value
is used and conversion is not applied to the default value.

Adding this parameter also implies the following changes:

  • A converter attribute will be added to dataclasses.Field.
  • Adds converter to the field specifier parameters of arguments provided to
    typing.dataclass_transform’s field parameter.

Example


  @dataclasses.dataclass
  class InventoryItem:
      # `converter` as a type
      id: int = dataclasses.field(converter=int)
      skus: tuple[int] = dataclasses.field(converter=tuple[int])
      # `converter` as a callable
      names: tuple[str] = dataclasses.field(
        converter=lambda names: tuple(map(str.lower, names))
      )

      # Since the value is not converted, type checkers should flag the default
      # as having the wrong type.
      # There is no error at runtime however, and `quantity_on_hand` will be
      # `"0"` if no value is provided.
      quantity_on_hand: int = dataclasses.field(converter=int, default="0")

  item1 = InventoryItem("1", [234, 765], ["PYTHON PLUSHIE", "FLUFFY SNAKE"])
  # `item1` would have the following values:
  #   id=1
  #   skus=(234, 765)
  #   names=('python plushie', 'fluffy snake')
  #   quantity_on_hand='0'

Impact on typing

converter arguments are expected to be callable objects which accept a
unary argument and return a type compatible with the field’s annotated type.
The callable’s unary argument’s type is used as the type of the parameter in
the synthesized __init__ method.

Type-narrowing the argument type

For the purpose of deducing the type of the argument in the synthesized
__init__ method, the converter argument’s type can be “narrowed” using
the following rules:

  • If the converter is of type Any, it is assumed to be callable with a
    unary Any typed-argument.
  • All keyword-only parameters can be ignored.
  • **kwargs can be ignored.
  • *args can be ignored if any parameters precede it. Otherwise if *args
    is the only non-ignored parameter, the type it accepts for each positional
    argument is the type of the unary argument. E.g. given params
    (x: str, *args: str), *args can be ignored. However, given params
    (*args: str), the callable type can be narrowed to (__x: str, /).
  • Parameters with default values that aren’t the first parameter can be
    ignored. E.g. given params (x: str = "0", y: int = 1), parameter y can
    be ignored and the type can be assumed to be (x: str).

Type-checking the return type

The return type of the callable must be a type that’s compatible with the
field’s declared type. This includes the field’s type exactly, but can also be
a type that’s more specialized (such as a converter returning a list[int]
for a field annotated as list, or a converter returning an int for a
field annotated as int | str).

Overloads

The above rules should be applied to each @overload for overloaded
functions. If after these rules are applied an overload is invalid (either
because there is no overload that would accept a unary argument, or because
there is no overload that returns an acceptable type) it should be ignored.
If multiple overloads are valid after these rules are applied, the
type-checker can assume the converter’s unary argument type is the union of
each overload’s unary argument type. If no overloads are valid, it is a type
error.

Example


  # The following are valid converter types, with a comment containing the
  # synthesized __init__ argument's type.
  converter: Any  # Any
  def converter(x: int): ...  # int
  def converter(x: int | str): ...  # int | str
  def converter(x: int, y: str = "a"): ...  # int
  def converter(x: int, *args: str): ...  # int
  def converter(*args: str): ...  # str
  def converter(*args: str, x: int = 0): ...  # str

  @overload
  def converter(x: int): ...  # <- valid
  @overload
  def converter(x: int, y: str): ...  # <- ignored
  @overload
  def converter(x: list): ... # <- valid
  def converter(x, y = ...): ... # int | list

  # The following are valid converter types for a field annotated as type `list`.
  def converter(x) -> list: ...
  def converter(x) -> Any: ...
  def converter(x) -> list[int]: ...

  @overload
  def converter(x: int) -> tuple: ... # <- ignored
  @overload
  def converter(x: str) -> list: ... # <- valid
  @overload
  def converter(x: bytes) -> list: ... # <- valid
  def converter(x): ... # __init__ would use argument type `str | bytes`.

  # The following are invalid converter types.
  def converter(): ...
  def converter(**kwargs): ...
  def converter(x, y): ...
  def converter(*, x): ...
  def converter(*args, x): ...

  @overload
  def converter(): ...
  @overload
  def converter(x: int, y: str): ...
  def converter(x=..., y = ...): ...

  # The following are invalid converter types for a field annotated as type `list`.
  def converter(x) -> tuple: ...
  def converter(x) -> Sequence: ...

  @overload
  def converter(x) -> tuple: ...
  @overload
  def converter(x: int, y: str) -> list: ...
  def converter(x=..., y = ...): ...

Reference Implementation

The attrs <#attrs-converters>_ library already includes a converter
parameter matching these
semantics.

The reference implementation

Rejected Ideas

Just adding “converter” to dataclass_transform’s field_specifiers

The idea of isolating this addition to dataclass_transform was briefly
discussed in Typing-sig <#only-dataclass-transform>_ where it was suggested
to open this to dataclasses.

Additionally, adding this to dataclasses ensures anyone can reap the
benefits without requiring additional libraries.

Automatic conversion using the field’s type

One idea could be to allow the type of the field specified (e.g. str or
int) to be used as a converter for each argument provided.
Pydantic's data conversion <#pydantic-data-conversion>_ has semantics which
appear to be similar to this approach.

This works well for fairly simple types, but leads to ambiguity in expected
behavior for complex types such as generics. E.g. For tuple[int] it is
ambiguous if the converter is supposed to simply convert an iterable to a tuple,
or if it is additionally supposed to convert each element type to int.

Converting the default values

Having the synthesized __init__ also convert the default values (such as
default or the return type of default_factory) when the would make the
expected type of these parameters complex for type-checkers, and does not add
significant value.

References

… _#typeshed: GitHub - python/typeshed: Collection of library stubs for Python, with static types
… _#attrs-converters: attrs by Example - attrs 21.2.0 documentation
… _#only-dataclass-transform: Mailman 3 PEP for dataclass_transform support for converter field descriptor parameter - Typing-sig - python.org
… _#pydantic-data-conversion: Models - pydantic

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.

6 Likes

I’m going start hacking on a reference implementation in pyright as well, with the hope that I can add it to the PEP under reference implementation.

@ericvsmith Would you be willing to Sponsor this PEP?

I’ll sponsor it, even if I advocate against it. Not that I’ve decided: I’m out town on work, and I haven’t had time to read it. If you don’t hear from me by the end of next week, please ping me here.

Thanks for writing the PEP!

1 Like

Thank you very much. I appreciate it a lot.

I’ll wait until then to get the PR in the PEPs repo to give this more time to bake for you and others, and maybe start hacking on a pyright implementation.

Ping :smiley:

I guess technically that also covers alternate constructors, which I assume would be the current way to handle this?

from dataclasses import dataclass
from pathlib import Path

@dataclass
class MyPath:
    pth: Path = Path("/usr/bin/python")

    @classmethod
    def create(cls, pth: str | Path = "/usr/bin/python"):
        pth = Path(pth)
        return cls(pth) 

This definitely does add some repetition and require remembering to call .create instead of directly using the class but it should cover type narrowing correctly. I can see this being awkward with lots of parameters, although maybe it should be explicitly mentioned in the PEP?


This is under rejected ideas but is it not simpler for the actual __init__ function to convert the default values than to not convert them? Otherwise won’t it have to check specifically that the input value is the default value to know not to convert it? Unless I’m misunderstanding something it seems this is what attrs does already.

from attrs import define, field
from pathlib import Path

@define
class ConverterPath:
    pth: Path = field(default="/usr/bin/python", converter=Path)

p = ConverterPath()  # ConverterPath(pth=PosixPath('/usr/bin/python'))

Finally got to opening the PR: PEP 9999: Adding "converter" dataclasses field specifier parameter by thejcannon · Pull Request #3095 · python/peps · GitHub

It’s actually not any easier/harder to implement it either way. The more important thing though isn’t ease of implementation, but correctness.

To me pth: Path = field(default="/usr/bin/python", converter=Path) says (in English) pth is a field of type Path whose default is the string "/usr/bin/python", which also has a converter that XYZ. It’s semantically incorrect to declare that a Path field has a string default value.

That doesn’t mean it may never be the case we don’t allow conversions on the default, however let’s do the easy and correct thing first, and if someone is inclined to advocate for the harder thing they can.

To be fair I did say ‘simpler’ not ‘easier’ and I’m referring to the implementation of converters in dataclasses, not the type checking implementation for dataclass_transform. To not convert default values requires an extra check in code generation for the case of default with converter and an extra check within the __init__ function if the input is the default to skip conversion. It is simpler to just convert everything.

It may be arguably incorrect to use such a default, but the purpose was to demonstrate that this is the existing behaviour of attrs converters. This would be intentionally making the dataclasses behaviour differ. I assumed the desire for converters support on dataclass_transform comes from this implementation, so I would have expected it to reflect this existing behaviour.

attrs currently generates this for __init__ in this example:

def __init__(self, pth=attr_dict['pth'].default):
    _setattr = _cached_setattr_get(self)
    _setattr('pth', __attr_converter_pth(pth))

so __attr_converter_pth is called on any input including the defaults. (Default factories are also converted).

1 Like

Surprisingly, my motivation comes from a project not using attrs or pydantic or any of the other “dataclass”-like libraries. (We require dataclasses be immutable, so converters help us accept Iterable but convert to tuple, etc…).

Regardless, I’m on the fence. On one hand, there’s “correctness” where the default really ought to be the type of the field. On the other hand, there’s caller-simplicity. (I don’t really consider implementation simplicity a factor here. It’s a line or two we’re talking about).

In the end, this is Python so caller-simplicity should probably win. Sorry type-checkers :sweat_smile:

I’ll edit.

Don’t forget None is a common default, as well as other custom sentinel values.

Would you mind elaborating what you’re advocating for? Are you saying that we should careful in unconditionally applying the converter to the default value, because code authors might forget this common pitfall?

Is that a request for me? My post isn’t advocating for anything, simply a reminder to not forget about common practices when designing your API.

If you want a suggestion from me, either don’t convert the default value, or explicitly tell users (in documentation) that None is not supported as a default for the majority of cases where a converter is provided (eg the converter would have to explicitly handle None).

Edit: I misunderstood. I was considering the default value specified at class definition, not provided as the argument for instantiation.

I think in general you would expect default values to go through the same process as any other input.

As explained before, if you’re looking for common or current practice, you can look attrs which already implements this and converts everything. Changing this for dataclasses might be surprising for anyone switching from attrs to dataclasses or who has to work on projects with both.

The more I look at it, the more I think there’s the potential for surprising behaviour if you don’t convert the default. Examples of unexpected behaviour with what I think are the two ‘obvious’ implementations of this.

Example 1 - If how dataclasses handles defaults is kept the same (provided in the function signature) and converters check that the default matches to see if they should convert.

@dataclass
class MakeHex:
    hex_field: str = field(default='0x0', converter=hex)

MakeHex()  # Works and gives the default value
MakeHex(0)  # Works with the converter as intended
MakeHex("0x0")  # Works because it *is* the default
MakeHex("0x1")  # Fails

This comes up with any converter that doesn’t accept its own output as input.

Example 2 - Dataclasses uses sentinels for default values, similar to how default factories are handled. These are then checked specifically.

@dataclass
class SpecialCasedStr:
    val: None | str = field(default=None, converter=str)

SpecialCasedStr()  # Works - SpecialCasedStr(val=None)
SpecialCasedStr(val=None)  # "works" but not as expected 
                           # SpecialCasedStr(val="None")

This sort of case could easily occur if the class ends up being called through another function that puts in the arguments explicitly.

1 Like

I think in general that a default value should be a valid value for the field, not something that needs to be converted/processed.

Your example 1 is unexpected, I think, simply because it’s weird. If you have a field called hex_field that is defined as a string, you should be able to assign a string to it. The fact that MakeHex("0x1") fails is unexpected because "0x1" is a valid value for the field, not because the default isn’t passed to the converter.

Your example 2 is unexpected, because the converter doesn’t pass valid values through unchanged.

IMO, the problem here is that if I see val: None | str, I expect that to be the signature for the constructor. A converter, to me, allows additional types to be passed and handled in a “do what I mean” fashion, but if I pass a value that’s already valid according to the type signature, it should be accepted unchanged.

Maybe that’s why I don’t particularly like or use converters - they violate my expectations. But it’s not because of the handling of defaults, it’s because of the handling of the type.

My intuition for a converter is that its return type must be a subset of the declared type of the field, and its argument type should be distinct from the declared type of the field. The converter gets called when a value is passed which is not the declared type, and is required to convert that value to the declared type. The default value is expected to be a value of the declared type of the field, and will not be passed to the converter (just like any other already-valid type).

The short version of this is that I think it should be the job of the converter to decide how to handle defaults/default types and not the job of dataclasses to decide what gets sent to the converter. The person writing the converter knows what they want to handle while dataclasses has to assume things.

My other goal here is to avoid unnecessary friction between using attrs and dataclasses and that if dataclass_transform is going to support converters for type checkers I hope that it at least provides a way to support them as they exist currently in attrs.


The point Laurie had seemed to be that converters might not be able to handle None or sentinels so skipping the converter would help as you wouldn’t need to define a function to handle it specifically. Both of my examples were just made quickly to illustrate that detecting the ‘default’ is not necessarily without unexpected behaviour.

In order to support those valid fields the converter would need to recognise str inputs and validate them. At which point 0x0 is probably a valid input to the converter anyway.

Yes, but if None is a valid value the converter would have to be able to handle it.

In both of these cases the purpose of skipping the converter seems to be lost as the converter has to be able to handle the default value anyway.


This may make intuitive sense if you’re thinking about converters purely for changing type, but converters could also change a value without changing it’s type.

Toy example:

@dataclass
class Loud:
   shout: str = field(default="HELLO", converter=str.upper)

I don’t like them particularly either, to the point where I removed them from my own dataclasses-like implementation in favour of a more flexible __post_init__ that accepted fields as arguments.

Essentially converting this:

@my_dataclass
class X:
    x: int = 0

    def __post_init__(self, x: int | str):
        self.x = int(x) if type(x) is str else x

into this:

class X:
    x: int

    def __init__(self, x: int | str = 0):
        self.__post_init__(x=x)

    def __post_init__(self, x: int | str):
        self.x = int(x) if type(x) is str else x

I think this has its own issues, so I wouldn’t propose it here.

1 Like

I agree that whether or not a converter should be called shouldn’t depend on inspecting the type of the value. And it should probably match attrs.

1 Like

Just to drive the point home on why roll-your-own conversion is a non-starter for me.

Even in the “toy” example of converting to int, what is the type signature of int? And how many times do you want to repeat the input type (if even possible to express today).

For the record. the dataclasses code generation in my branch used a sentinel value for the default, and if the value matched would then assign the attribute the real default value. So it wasn’t checking type or value equality of the user-provided value. This is how the default_factory works (mostly).

Not advocating for it, but simply wanting the discussion to continue to reflect reality.