PEP 712: Adding a "converter" parameter to dataclasses.field

guido · November 22, 2023, 9:56pm

I have no opinion on attrs, since I have never used it (so far dataclasses have always scratched my – pretty mundane – itches).

Lifting something from attrs isn’t always obvious given that it (likely) depends on other features of attrs that may or may not already exist in dataclasses. Even if it were obvious, is the attrs version also stable? When did converters in attr last grow a subtly new behavior?

Also, there seems to be a lot of discussion in this thread (post SC decision) where people describe desirable behaviors that aren’t in the PEP. Are those in attrs? I think it’s up to the proponents of getting the stdlib to parity with converters in attrs to research all this.

ariebovenberg · November 23, 2023, 8:02am

Agree that it’s not a good idea to add a method to the namespace. The use case for bypassing validation still stands though. Yes, in some cases it’s a simple isinstance, but once you hit any collection it becomes an issue. Take the example in the PEP:

@dataclass
class InventoryItem:
    names: field(converter=lambda names: tuple(map(str.lower, names)))

An an isinstance(x, tuple) is not enough: you’ll need a O(n) check at least. Just this month I encountered the impact of this in a codebase naively using pydantic: constructing a collection-heavy class took a whopping 5 seconds looping and validating (already-valid) objects!

[…] and you should just write converters that know when to skip conversion.

As the example shows, writing good converters is tricky and it’s way to easy to overlook fundamental issues — until you discover these edge cases at runtime.

An additional usecase for the ‘bypass validation’ situation: what if a class wants to internally modify an attribute? Makes sense to bypass conversion privately right? See this example:

@dataclass
class InventoryItem:
    names: field(converter=lambda names: tuple(map(str.lower, names)))
    
    def remove_evil_names(self):
        # setting attribute needlessly triggers O(n) converter
        self.names = tuple(n for n in self.names if "evil" not in n)

The ‘converting on setting attributes’ invites new questions as well: when exactly does it trigger? Taking a slight variation on the PEP example (list instead of tuple):

@dataclass
class InventoryItem:
    names: field(converter=lambda names: list(map(str.lower, names)))
    
    def add_name(self, n: str):
        # converter *not* triggered.
        self.names.append(n)
        
        # converter *is* triggered
        self.names = self.names + [n]
        
        # converter *is* triggered
        self.names += [n]
        # ...but not after this innocent refactoring...
        ns = self.names
        ns += [n]

This is of course obvious to experienced Python developers, but totally confusing for beginners.

Sorry if I sound too argumentative — having worked with many pydantic-heavy codebases has turned me strongly against the whole ‘struct with builtin validation/conversion’ idea .

NeilGirdhar · November 23, 2023, 8:34am

Don’t worry about being argumentative .

I considered the collection possibility, which is what I had in mind when I wrote “In the cases where the savings are significant…”

Ultimately, each of dataclass’s parameters eliminates another kind of common boilerplate code. They don’t work for every case. If you use kw_only, the whole dataclass becomes kw_only. So if you only want some parameters to be kw_only, you have to do it a different way.

It’s the same with the converter idea. It works great when what you want is unconditional conversion. If you don’t want that, then you have to it the old, long way.

Trying to squeeze in the more esoteric cases you’re talking about has a significant cost on the common cases. The way it’s laid out in the PEP, conversion and construction can be done as SomeClass(x, y, z). With your idea, you’d have to enlist some other function convert(SomeClass, x, y, z). And the justification for that idea is that you may want to avoid conversion in some rare case.

I’d rather you just created your special conversion method for the cases you mention, and let the common case be simple. But that’s my preference based on the examples I’ve seen (I listed three in my comments). Maybe your experience is that these cases are much more common than I think, and that’s why we have different intuitions.

ariebovenberg · November 23, 2023, 9:21am

You’ve already convinced me that a separate convert() is not a good idea— I’m interested in the broader topic of “how to bypass conversion”. If I understand correctly, your take is “if you need to bypass it, don’t use converters at all”?

If so, I would again agree, but note:

The need to bypass conversion is really common, as evidenced by pytantic’s .construct() being added early on.
It’s to be expected that devs will get ‘lured into’ using converters without realizing the consequences. I see this with pydantic: the naive approach of ‘just add validation everywhere’ easily creates a slow/unmaintainable codebase.

Can we educate users about the drawbacks? I’m doubtful. Think of mutable default arguments. Experience has shown no matter how many warnings/documentation you offer: if you make it too easy, people will just do it wrong—with disastrous consequences. The only solution is to keep footguns out of easy reach, and I fear a convert= parameter does the opposite.

While I’m not against conversion per-se, I’m uneasy about making such a “tricky to get right” feature so easy to (ab)use.

NeilGirdhar · November 23, 2023, 10:13am

I completely agree with everything you’re saying (and in the last comment too). Thinking about it more, I don’t actually want converters on __setattr__. I think I just want converters for construction. That’s also the only thing the three examples I showed need.

And yes if you want to sidestep constructor converters, I think you should provide an alternative class factory that converts and not use the converter feature.

How do you feel about constructor converters?

oscarbenjamin · November 23, 2023, 1:41pm

In which case what is wrong with using an __init__ method for that?

Taking the example from the PEP:

def str_or_none(x: Any) -> str | None:
  return str(x) if x is not None else None

@dataclasses.dataclass
class InventoryItem:

    id: int = dataclasses.field(converter=int)
    skus: tuple[int, ...] = dataclasses.field(converter=tuple[int, ...])

    vendor: str | None = dataclasses.field(converter=str_or_none)
    names: tuple[str, ...] = dataclasses.field(
      converter=lambda names: tuple(map(str.lower, names))
    )

    stock_image_path: pathlib.PurePosixPath = dataclasses.field(
      converter=pathlib.PurePosixPath, default="assets/unknown.png"
    )

    shelves: tuple = dataclasses.field(
      converter=tuple, default_factory=list
    )

With __init__ that is

@dataclasses.dataclass
class InventoryItem:

    id: int
    skus: tuple[int, ...]
    vendor: str | None
    names: tuple[str, ...]
    stock_image_path: pathlib.PurePosixPath
    shelves: tuple

    def __init__(self,
        id: int | str,
        skus: Iterable[int | str],
        vendor: Vendor | None,
        names: Iterable[str],
        stock_image_path: str | pathlib.PurePosixPath = "assets/unknown.png",
        shelves: Iterable = (),
        ):
            self.id = int(id)
            self.skus = tuple(map(int, skus))
            self.vendor = str(vendor) if vendor is not None else None,
            self.names = tuple(map(str.lower, names))
            self.stock_image_path = pathlib.PurePosixPath(stock_image_path)
            self.shelves = tuple(shelves)

Some might consider this boilerplate but I don’t because nothing here is really redundant. The types for the fields are not redundant. The signature of __init__ with types and defaults for parameters is not redundant. The code in the body of the __init__ method is not redundant. The field names are repeated a few times but no line of code here is redundant. If there were no converters then there would be redundancy because the types in the signature of __init__ would be the same as the types of the fields and each line in the body of __init__ would just be self.x = x. Without converters the __init__ method looks like redundant boilerplate but as soon as you want to have actual code in __init__ it is not boilerplate any more.

The example with __init__ has a few more lines of code but that comes from the inclusion of types in the signature of __init__. It might seem like the types of the parameters for __init__ are redundant but they are not. For example the parameter for str_or_none might be typed as Any but that does not necessarily mean that you would want to accept Any as an input for the vendor parameter in the InventoryItem constructor. I have guessed here that the type should be Vendor | None but in the original code it is unclear what it is supposed to be.

I don’t think that trying to make something that should usually be code in an __init__ method look declarative makes anything easier to understand or makes it any easier to write the code. It is better to put the code in an __init__ method all in one place rather than writing auxiliary functions like str_or_none and noun-ifying simple code into “converters” and “default factories”. It is definitely easier to understand what the signature of __init__ is if you can see the __init__ method rather than scanning through default factories and converter functions. It is also easier to understand what is actually executing in the constructor if you can see the body of the __init__ method. The fact that behind the scenes the dataclass decorator will go and textually build the code for this __init__ method is a clear sign that maybe what you should be doing is just writing an __init__ method.

What does not quite work with __init__ is frozen dataclasses. It does not seem to be possible to use either __init__ or __new__ with a frozen dataclass without using object.__setattr__ which is awkward. You can add an alternate classmethod constructor like InventoryItem.new(...) but then that cannot be used with the ordinary InventoryItem(...) syntax. Maybe there is a way to improve defining conversions or validation for frozen dataclasses.

NeilGirdhar · November 23, 2023, 9:23pm

Sorry, I can’t agree with that. The types of the fields are given, yes. But,

the signature of __init__ is redundant since the types can be deduced from the converter,
The code body of __init__ is redundant
- for the parameter names (as you say),
- but also for the application of the converter, which could otherwise be rolled into a custom field function.

Then you can create a custom converter function that accepts the types that you do want, and use that as your converter.

Good point (in favor of converters).

Consider this class. With a converter, I would not add any extra code to the class. The converter would live entirely in the distribution_parameter field specifier. Whereas, if I want conversion, I do need to write a lot of boilerplate:

the __init__ header,
the assignment line (and it’s a frozen dataclass, so I have to use object.__setattr__), and
the application of the converter.

zenbot · November 24, 2023, 7:14pm

Piping up as someone that was really looking forward to this feature and is disappointed with the SC’s decision.

For background, I’m on a team that uses dataclasses extensively, mostly uses frozen dataclasses, and is also starting to default to slots=True for those dataclasses. Unfrozen dataclasses are typically used only when we can’t construct a frozen dataclass all in one go.

Given that we freeze everything, options involving __init__ or __post_init__ are off the table. They’re possible without slots=True, but involve reaching into __dict__ in a deeply gross way. With slots=True they’re (to my knowledge) impossible.

Alternate constructors are basically our only option, and they’re not a great one for a few reasons, e.g.:

from dataclasses import dataclass
from typing import Self

@dataclass(frozen=True)
class Foo:
    a: int
    b: int  # we'd like to use a converter here to allow a string
    c: str  # we'd like to force this to lower-case

    @classmethod
    def construct_with_b_as_int_or_str_and_guaranteed_lower_c(cls, a: int, b: int | str, c: str) -> Self:
        return cls(a, int(b), c.lower())

This is unpleasant and potentially dangerous for a couple of reasons:

there’s a bunch of duplication
there’s an opportunity for drift between the signatures of the alternate constructor and __init__
we can only guarantee that Foo.c is lower-cased if the alternate constructor is used (we could raise an exception in the __init__ if Foo.c isn’t lower-cased, but we don’t want that, we want to convert it)

Q: Is there a compelling reason for dataclass field converters to be in the standard library that we’re just missing?

For us — dataclasses are already in the standard library, we’re using them extensively, and we’d like to continue using them rather than migrate to a third-party alternative like attrs. (There can also be organizational resistance to bringing in third-party packages — they typically need a license and security review, etc.)

DavidCEllis · November 24, 2023, 9:55pm

You have to use object.__setattr__ with the way frozen dataclasses are implemented. If you look at the code for dataclasses, you’ll see that this is what dataclasses itself generates for frozen dataclasses^[1].

__dataclass_builtins_object__ is object, via the dictionary provided to the __create_fn__ call. ↩︎

ericvsmith · November 24, 2023, 10:09pm

This is also documented here: dataclasses — Data Classes — Python 3.12.0 documentation

thejcannon · November 24, 2023, 10:44pm

Albeit, your type-checker or friendly neighborhood linter won’t understand what you’re doing

ariebovenberg · November 28, 2023, 2:38pm

One more thing I didn’t find mentioned in the PEP text linked above: how does this affect replace()? Does it run the converter, or not? As with attribute setting, attrs and pydantic each make a different decision in their equivalents.

tusharsadhwani · November 28, 2023, 2:57pm

A solution that just came to mind:

Since type annotations are pretty much mandatory for dataclasses, why not run the converter only if the provided types don’t match the dataclass type signature?

The converter will be run on type mismatch in every situation: during init, on assignment to a field, on ._replace().

If the given types in the constructor call / assignment etc. match the field type, the converter is never run.

To simplify the implementation as much as possible, we can do things like mandate taking an object as the input to the converter function, and the output type be the same as the annotation on the dataclass field.

ariebovenberg · November 28, 2023, 3:05pm

@tusharsadhwani interesting idea, but I don’t think this would work in the general case. Imagine this case:

@dataclass
class A:
    foo: int = field(converter=lambda x: abs(int(x)))

In this case, -3 would satisfy the type, but would give a different result than the converter.

tusharsadhwani · November 28, 2023, 3:30pm

I personally don’t think converters that return the same type are that common. perhaps we can have a simpler implementation at the cost of not allowing those?

jamestwebber · November 28, 2023, 3:45pm

This is perhaps too convoluted a solution, but field could get an additional keyword option to choose the behavior. E.g. “always run” or “only for type conversion” or “only on __init__”

Tinche · November 28, 2023, 4:10pm

This is going to almost impossible in practice. Firstly, you’d need to encode a ton of logic typically found in typecheckers into this check. Secondly, it’d just be super slow.

Imagine a class like:

from dataclasses import dataclass

@dataclass
class A:
    a: list[SomeClass[int] | SomeOtherClass[int, SomeProtocol]]

And if this list has a thousand elements, you’d need to iterate through every element and do this check to determine whether to convert?

thejcannon · November 28, 2023, 4:22pm

This is also in the “Rejected Ideas” section of the PEP.

barry · June 6, 2024, 10:53pm

The 2024 Python Steering Council has decided to reject “PEP 712 – Adding a “converter” parameter to dataclasses.field”. Our apologies for not sending official notice sooner. It was a difficult decision, much discussed over two Steering Council terms.

Our reasons for the rejection include:

We did not find evidence of a strong consensus that this feature was needed in the standard library, despite some proponents arguing in favor in order to reduce their dependence on third party packages. For those who need such functionality, we think those existing third party libraries such as attrs and Pydantic (which the PEP references) are acceptable alternatives.
This feature seems to us like an accumulation of what could be considered more cruft in the standard library, leading us ever farther away from the “simple” use cases that dataclasses are ideal for.
Reading the “How to Teach This” section of the PEP gives us pause that the pitfalls and gotchas are significant, with a heightened confusion and complexity outweighing any potential benefits.
The PEP seems more focused toward helping type checkers than people using the library.

We know that our decision will disappoint proponents of this feature, but we don’t find compelling enough arguments in favor of its acceptance. The SC thanks the PEP author and sponsor for the well-written PEP, and everyone who contributed to the Discourse thread for the thorough discussions.

Barry on behalf of the Python Steering Council.

thejcannon · June 6, 2024, 11:51pm

[98 different posts in the thread 98 different posts. Steering Council shoots down, announce it around, 99 different posts in the thread]
[99 different posts in the thread 99 different posts. PEP author voiced sound, announce it around, 100 different posts in the thread]

I’m just happy to finally have a pronouncement. The PEP was originally born out of wanting this in dataclass_transform, so guess I owe the world (and myself) the-PEP-I-wanted-to-write

Seeya next time (and thank you )

(continuing here)