PEP 712: Adding a "converter" parameter to dataclasses.field

Hi, I’m a developer of the Meson build system, and we were really looking forward to being able to use convertors, they would simplify our codebase considerably by allowing us to use dataclasse in more places and remove a lot of __post_init__ uses. Because we exist very low in the stack we absolutely cannot rely on any 3rd party modules of any kind, everything we use must either be in the standard library or in Meson itself. Doing otherwise makes bootstrapping very painful for OS vendors. For us this makes attrs (which I use otherwise and quite like) unacceptable as a replacement for dataclasses.

Apart from that, there is one case I’ve run into that is impossible to handle correctly today with dataclasses that would be solved by convertors: classes with an initializer argument and an attribute with the same name but different types, take this (minimal) example:

class Klass:
    def __init__(self, attr: Iterable[str | int]):
        self.attr = [str(a) for a in attr]

This cannot be written as:

@dataclass
class Klass:
     attr: InitVar[Iterable[str | int]]

    def __post_init__(self, attr: Iterable[str | int]) -> None:
        self.attr = [str(a) for a in attr]

Apart from being very verbose mypy and pyright both reject this, and consider it correct behavior to do so. Convertors would solve this issue. This creates a situation where backwards compatibility cannot be preserved by migrating to a dataclass, since the signature of the initializer has changed.

Another issue we run into a lot in Meson is that Iterable[str | T] is somewhat dangerous to use, since str itself matches that protocol but is almost never what you mean, but there is not other good way to spell covariant list. A convertor could handle that via by something like:

def fix_str(val: Iterable[str | int]) -> list[str | int]:
     if isinstance(val, str):
         return [str]
     return list(val)

@dataclass
class Klass:
    attr: list[str | int] = field(convertor=fix_str)
2 Likes

Count me among the disappointed. I can argue more later (after the SC elections).

4 Likes

Perhaps off-topic, but can’t meson vendor necessary (convenient?) dependencies? Just curious if you considered that approach and why you chose not to do it?

The two use-cases mentioned since the decision are for build-systems, which seem to have more experienced and knowledgeable developers behind them. This means they would be more comfortable with custom hacks in dataclasses’s machinery to simplify their code-base.

I see the proposed mechanism as helping less experienced developers interface with third-party libraries, at which point the standard-library argument stops making sense.

If the only beneficiaries are build systems, I don’t think it makes to add implicit behaviour that all users must accept.

No. Most Linux distros have policies against vendoring for the obvious reasons of security and package size, plus because of our place in the stack we have a policy of supporting all non-EOL python versions, and if we can do it easily EOL’d versions as well (LTS Linux like RHEL and Ubuntu LTS), which means we have to consider whether a vendored dep would have the same lifetime support as Meson itself would. Some distros will even patch out the vendored version and insist that the version they provide is used. I can probably dig up a github issue where the finer points of why we wont vendor 3rd party software are argued ad nauseam if you’re really interested :slight_smile:

2 Likes

I am also disappointed by the SC recommendation.

One of the selling points of pydantic is that it parses data into the right type for me. A common and nontrivial example is datetime.datetime, where I can pass a timestamp string (usually copied from some JSON or external source) and have the resulting object be what I need.

from pydantic import BaseModel
from datetime import datetime

class MyModel(BaseModel):
    created: datetime

MyModel(created="2023-11-17T12:22:55Z")  # type checking error

Type checkers consider this an error, and understandably so.

But with this PEP:

  • dataclasses could do this for me; all I have to do is specify the right (str) -> datetime function in the field
  • this would happily type check
  • (assuming pydantic/etc were to support this) other libraries would get the same type checking benefit via dataclass transform.

That last bullet is the sticking point for me. The rejection recommendation doesn’t just mean that dataclasses won’t support this, but also that other libraries that convert on initialization are out of luck when it comes to type checking.

(As an afterthought, I find myself wanting this behavior in dataclasses all the time. I will almost always reach for stdlib dataclasses before pydantic—as a heavy pydantic user and big fan—because of how much simpler it is to not have to worry about dependencies or venvs beyond “just use Python 3.XX.”)

2 Likes

I would also like to see this feature in the dataclasses module.

However, rejecting it from dataclasses doesn’t have to mean we can’t add the feature to the @dataclass_transform mechanism that libraries like pydantic use. It already supports some features (e.g., alias) that dataclass doesn’t.

7 Likes

Id like to add that as an individual who primarily works on air gapped systems, I too am disappointed in the response. Out of the half dozen systems I work on, all but one of them essentially ban 3rd party software, outside of the approved base install. Having more batteries included in stdlib can help a lot. It makes maintenance easier, as I’m not too keen on writing larges amounts of custom hacks as Laurie O points to, that’s a huge maintenance burden, and given the nature of the systems I would have to regularly duplicate because you cannot transfer code from one system to another.

3 Likes

I’ve been meaning to reply to this as well. I’m also disappointed about this, and I wanted to share a few places where this would make an impact.

Here’s some code I just looked at the other day. This class is written as an ordinary rather than a dataclass probably because converters are not available. The constructor takes sequences and converts them to lists.

If it had been written as a dataclass, all of the properties could disappear and be replaced by making the dataclass frozen. The replace function could probably disappear in favor of dataclasses.replace (or now copy.replace). And the slots could disappear in favour of the slots keyword. This would massively simplify the code.

Here’s an example from my code that I’m running into. When using Jax, both Jax arrays and floats can be passed to functions. But there are significant benefits to only having Jax arrays: they can be kept on the GPU, they simplify type annotations (Array instead of float | Array), etc. I wish I could make all of these dataclasses have fields with the converter: jax.numpy.asarray.

3 Likes

I think the fact that @dataclass_transform exists and yet all type checkers have to special case attrs, and that this change will make it a lot better is a pretty good argument for the PEP.

1 Like

Seeing so many voice their disappointment, I’d like to comment in support of the SC decision:

  1. Field conversion creates a ton of questions with non-obvious solutions: should setting attributes call the converter? Should default values be converted? Can we derive converters from type annotations? Can converters depend on other fields? Can conversion be bypassed? How do deal with already-converted data? To get a picture of the potential complexity here, have a look at the huge pydantic codebase, reams of documentation on configuration options, GitHub issues, and impact of breaking changes. If we go down this path of added complexity, dataclasses would likely perpetually lag behind in features that users want. The number of use cases is simply too diverse.
  2. Implicit (type) conversion is — to put it mildly — not a universally acknowledged good idea. You need only look at the weak typing of PHP, Javascript, Scala implicits, and str/unicode in Python2. Python currently avoids a lot of these implicit type conversion ‘gotchas’. Do we want to encourage this? To my knowledge, no other mainstream language supports similarly implicit conversion on their dataclass/struct equivalent.
  3. As with weak typing, implicit type conversions require strict discipline to get right. By making it so alluringly easy, it’s all too likely for beginners to get stuck in a quagmire of implicit behavior, or performance problems. Dealing with ‘dirty’ data requires thoughtful design. Field converters promise an easy solution where there is none.
4 Likes

Thanks for participating! Just some clarifications (not attempting to be argumentative).

should setting attributes call the converter? Should default values be converted? Can we derive converters from type annotations? Can converters depend on other fields? Can conversion be bypassed? How do deal with already-converted data?

Almost all of these are answered in the PEP, and would also be answered in the documentation. Specifically, conversion happens on attribute assignment unconditionally. So, in order, yes, yes, see the “rejected ideas”, no, no, I’m not sure I understand the question (like how not to double-convert?).

1 Like

In the case of non-frozen dataclasses I don’t think that these converters add much because you can just use __init__. With frozen dataclasses that is awkward though. It took me a few tries to come up with this:

from dataclasses import dataclass

@dataclass(frozen=True)
class Point:
    x: float
    y: float

    def __init__(self, x, y):
        object.__setattr__(self, 'x', float(x))
        object.__setattr__(self, 'y', float(y))

p = Point(1, 2)

I’m not sure if anything else would break as a result of using __setattr__ like this.

This example example also shows a case where implicit type conversions are common in Python: an int can usually be passed in a place where a float is expected. There can be very good reasons to prefer having an actual float though regardless of whether the “user” of a class should be allowed to pass an int when creating an instance.

Although it still has the advantage of avoiding all that boilerplate code.

1 Like

One of the primary motivators of this PEP (and how the PEP started) is missing from your example. How should x and y be annotated?

Isn’t that simply the normal question of how type annotations deal with duck typing? You could annotate x and y as float | int if that’s your intention. Or as ConvertibleToFloat if you define a suitable protocol.

Yes, it’s extra boilerplate, but isn’t that the normal cost you have to pay for duck-typed inputs?

Almost all of these are answered in the PEP, and would also be answered in the documentation

My bad — having read the posts in this discussion, I missed the actual PEP content :sweat_smile: .

Regarding the specific topics:

  • “How to deal with already-converted data” → If I read the PEP correctly, a converter would not be skipped if the type already satisfies, correct?

    class A:
         # this would only allow str input, not an already parsed datetime?
         at: datetime = field(converter=datetime.fromisoformat)
    
  • can conversions be bypassed (No) → I would expect there to be demand for this. A typical situation being where a ‘conversion’ or ‘validation’ has already been performed, and one does not want to pay the conversion performance cost again. Pydantic has had .model_construct() for this purpose since forever.
    My personal preference (not likely shared by most people) would be for converters to leave the constructor alone, and only trigger on calling a .convert() alternative constructor.

I am beginning to see that this quite the can of worms and putting a robust, usable solution in the stdlib. Maybe a few years from now when things are more settled we could try again.

3 Likes

Seeing as how this feature is (in a sense) lifted from attrs, and the SC decision might be “just use attrs”, would you say that attrs’s support of converter= isn’t robust or usable? Or just that it isn’t as robust or usable as it would need to be to warrant being in the stdlib?

1 Like

That’s right. For a converter of type Calllabe[[X] -> Y, the parameter type must be a supertype of Y, and the user must pass a subtype of X.

Interesting idea. There are some arguments against it though:

  • Data-classes seem to go through a lot of trouble not to pollute the class interface with anything (e.g., replace is not a method).
    • This is consistent with the general OO design principle that anything that can be done using the public interface should be a bare function rather than a method.
    • Adding a replace method breaks any data-classes with a replace member.
  • And, your convert method would violate LSP in the case of inheritance.

I think these optimizations are best left to future Python optimizers, and you should just write converters that know when to skip conversion. Your convert idea only saves some is-instance checks at the cost of significant complexity (for the reader, for the writer, and for Python learners). In the cases where the savings are significant, I think you should write the convert_x factory yourself, which is the status quo anyway.