PEP 712: Adding a "converter" parameter to dataclasses.field

hlovatt · August 30, 2023, 10:33pm

I would find this useful. I have an application that takes JSON data and converts them to data classes and I do type conversions from JSON representations to more specific Python representations in __post_init__, e.g. ISO date string to datetime and literal 0 which comes through as an int to float.

ericvsmith · October 9, 2023, 11:57am

Maybe I’m mis-understanding things, but I don’t think attrs has the behavior of “converts on attribute setting”:

>>> @attr.s
... class C:
...  x: int = attr.ib(converter=int)
...
>>> c = C("10")
>>> c
C(x=10)
>>> c.x = "20"
>>> c
C(x='20')

I couldn’t find this referenced in the attrs documentation, but maybe I wasn’t looking in the right place.

Tinche · October 9, 2023, 12:12pm

It does for the new-style APIs (define, frozen). I remember this being pointed out in the docs but they must’ve deteriorated. We strongly encourage new-style APIs.

ericvsmith · October 9, 2023, 12:15pm

Ah, thanks @Tinche. Here’s an example:

>>> from attrs import define, field
>>> @define
... class C:
...     x: int = field(converter=int)
...
>>> c = C("10")
>>> c
C(x=10)
>>> c.x = "20"
>>> c
C(x=20)

I think the PEP should be updated to note this, and save someone who’s not as familiar with attrs (like me!) the time to research it.

thejcannon · October 9, 2023, 1:01pm

Nothing like a miniature panic attack early on the morning I knew I tested it!

I’m happy to edit the PEP (assuming that’s kosher)

ericvsmith · October 11, 2023, 11:15am

Another thing that occurs to me is interactions with pattern matching. What happens here?:

@dataclass
class Point:
    x: int = field(converter=int)
    y: int


match Point(x="0", y=0):
    case Point(x="0", y=0):
        print("Origin")
    case Point():
        print("Somewhere else")
    case _:
        print("Not a point")

Naively, I’d expect it to print “Origin”, not “Somewhere else”. I realize the Point in the case statement isn’t creating an object, but clearly it’s meant to parallel object creation, at least visually.

This should be mentioned in the PEP: the match statement ignores any converter. But it does, for example, respect init=False.

fonini · October 18, 2023, 2:12am

I think class patterns have a pretty clear rule: they match attributes of the subject. Dataclasses, with or without converters, are not special here – any more than custom user classes in which __init__ parameters don’t correlate with instance attributes ^[1].

As I user, whenever I see a class pattern in a match statement, I mentally translate it to a sequence of checks like “isinstance + hasattr + the attr value matches the subpattern”. In your example, I believe that case Point(x="0"): ... in general should be read as a shorthand for:

elif (
    isinstance(subject, Point)
    and hasattr(subject, 'x')
    and subject.x == "0"
):
    ...

This means that the behavior for dataclasses with converters should be clear enough: the result of accessing the attribute x on the instance Point(x="0") will be matched to the string "0". Since accessing the attribute returns an integer and 0!='0', the pattern will not match.

As far as I understand, dataclasses are only special in that they automatically generate a __match_args__, which I don’t think is relevant for this discussion ↩︎

pf_moore · October 18, 2023, 8:11am

I disagree. The fact that the match statement uses syntax that looks like the initialiser is the important thing for me (and I believe it was a deliberate design choice as well). The discrepancy here wouldn’t be a complete disaster, but I’m pretty sure it would be a source of confusion and possibly bugs, so I think @ericvsmith is right and this should be explicitly discussed in the PEP.

jamestwebber · October 18, 2023, 5:50pm

Although this trip-up can happen with any class/object that converts inputs during initialization. e.g. a contrived example:

match int('0'):
    case int(real='0'):
        print("I'm zero")
    case int(real=0):
        print("No, I'M zero")
    case _:
        print("another int")

It isn’t ideal but it’s also not really new behavior, but maybe this PEP would make such bugs more common by making this type of code more tempting to write.

fonini · October 18, 2023, 6:05pm

I believe this is a less contrived example that still exhibits the same unwanted behavior:

match int('0'):
    case int('0'):  # will not match
        print("I'm zero")
    case _:
        print("Another int")  # will be printed

Specifically, the case int('0') does not raise an exception: it’s just a pattern that won’t match, because what is means is the check isinstance(subject, int) and subject == '0'. I believe that this behavior of the class pattern is a ship that has already sailed.

I think what’s at stake here is: should Python start avoiding __init__ arguments that don’t correlate with instance attributes just because match is now part of the language? I don’t think so, but that’s definitely just my opinion.

pf_moore · October 18, 2023, 6:46pm

All anyone is asking for is for it to be discussed in the PEP. It’s a question for the “how do we teach this” section, at a minimum, as that is precisely where non-intuitive behaviour should be called out explicitly.

NeilGirdhar · October 18, 2023, 8:20pm

In the PEP discussion, it might be worth pointing out that most type checkers should be able to catch this kind of error. For example,

match int('0'):
    case int('0'):  # PyRight says: pattern will never be matched for subject type "int"
        print("I'm zero")   
    case _:
        pass

thejcannon · October 18, 2023, 8:59pm

Updated: PEP 712: Update with suggestions/clarifications from discussion by thejcannon · Pull Request #3496 · python/peps · GitHub

patrick-kidger · November 8, 2023, 9:01pm

I’ve just come across this PEP.
Overall I like it! I have two very nitty comments.

First of all, for reference: we use frozen dataclasses basically everywhere in Equinox, and we already have an extension to field that adds a converter argument. I think we’re basically doing the same thing as this PEP in all cases.*

The interaction with __post_init__ isn’t specified. From experience we’ve found that conversion before __post_init__ is most useful.
Whether to have it run inside cls.__init__ or type(cls).__call__ is not discussed. From experience having it run inside cls.__init__ is most useful, as this makes it substantially easier for runtime type checking libraries – we’ve developed jaxtyping – to perform their checks.

(* Actually, ever-so-technically there is one discrepancy: in custom __init__ methods on frozen dataclasses, we allow the self.foo = bar syntax, and in this case and unlike this PEP, we do perform conversion. We’ve found this an important for usability, and is our sole divergence from standard dataclasses, which normally mandate the use of object.__setattr__. I don’t think this discrepancy really counts here, as this is already somewhere we’ve made a concious choice to deviate from standard dataclasses.)

thejcannon · November 13, 2023, 12:10pm

So, specifically the PEP says it runs during attribute assignment (or in __init__ for frozen data classes).

I think it’s safe to assume attribute assignment happens in __init__ for non-frozen dataclasses. If it isn’t, I don’t think the PEP regarding value conversion would be the right place to specifically call out when it happens, since that’s a more generic behavior.

Glad you like the PEP though, and thanks for the suggestions!

thejcannon · November 13, 2023, 12:13pm

In Pantsbuild,we used to have something similar (still using standard dataclasses), but I switched us to using it the way the docs suggest.

I miss the ergonomics of normal attribute assignment. Now you got me pondering a PEP for frozen_after_init. I’ll probably run it in a new thread once this PEP is done.

gpshead · November 14, 2023, 2:38am

Python Steering Council hat: Thanks for the well written PEP and thorough discussion here. We have reviewed and discussed this PEP and we are unfortunately not finding ourselves leaning towards accepting it today. (The question the SC would like to see answered to change our future selves mind is at the end)

Reasoning:

A dataclasses.field converter adds complexity (additional spooky action at a distance and a concept going further than a “just a struct” mental model).
There are already multiple ways to do this even if they involve more lines (__init__, alternate constructors, etc).
For users who really want converters rather than the additional lines of code: They can already use third party dataclass-like libraries (presumably attrs) providing the feature today instead of waiting for CPython 3.13+.

One of our guiding general themes is that there is less reason for every feature to be done in the standard library now than in years past. Virtually all Python applications are built upon many third party packages from our ecosystem today.

Q: Is there a compelling reason for dataclass field converters to be in the standard library that we’re just missing?

It is good to see people piping up on this thread who do want the feature. It would be interesting to know how important that is and if you already do, or why you don’t, use something like attrs today just to have it.

We’ll keep the steering-council pep-712 issue open for a while as a reminder to observe any further discussion (and let the next elected SC make the final decision).

-gps for the 2023 Python Steering Council

thejcannon · November 14, 2023, 4:04pm

(Where’s the broken heart emoji reaction?)

Thanks for the response, and especially the explanation. Although it’s a bummer, I think the answer is very fair and understandable.

I hope others do chime on on their specific use-case, especially since the ones that said they’d benefit already have chosen standard library dataclasses over attrs despite not having this feature (even though that means more hurdles and pain). I’ll chime in on ours (I don’t remember if I have already) in a separate comment.

They can already use third party dataclass-like libraries (presumably attrs ) providing the feature today instead of waiting for CPython 3.13+.

I’ll say that since this PEP augments dataclass_transform, I don’t think the choice is binary. On the project that motivated this PEP we’d likely define a dataclass transform that simply handles conversion semantics and then forwards to dataclass, with the expectation that once we upgrade to Python 3.13 we’ll replace usage of the decorator with dataclass. The ability to do so lies in the fact that type-checkers support typing_extensions backports This was intentional on my part (maybe I should’ve made it explicit on the PEP?)

And speaking of type-checking…

One gap this decision remains open and unsolved is the type-semantics of any dataclass transform providing conversion semantics. This escapade started off as a question of augmenting dataclass_transform to support this feature. Since type-checkers, without special-casing, don’t support attrs’s conversion semantics when defined as a dataclass_transform, it seems to me that if the answer to this PEP is “you should probably just use attrs”, that we ought to plug the gap. Otherwise, we didn’t push the needle on any of the problems outlined in the “Motivation” section and told the community to use a sub-optimal solution.

So, can I ask for a gut-feel strawman from the (current) Steering Council on, if this PEP is likely to be rejected, a similar PEP scoped just to dataclass_transform?

(Regardless, saying “no” isn’t fun, but can certainly be necessary. I’m just glad the community is the way it is, from the strangers on these threads all the way up to the big bad Steering Council . Thanks for considering the PEP)

thejcannon · November 14, 2023, 4:12pm

The use-case that motivated this PEP, pantsbuild, uses frozen dataclasses everywhere (current count is ~1200 instances). ^[1].

Although our application already has third-party dependencies, I’ve been arguing hard for whittling them away (ideally towards 0) for two reasons:

We’re a build tool, so security is objectively more important than other libraries/applications. Infect a build tool, and you can infect everything it builds.
Every time our users want a new version, they are forced to download and install every one of our dependencies. That’s wasteful

So the standard library is our friend.

As a side-note we use frozen dataclasses because we cache these objects to be re-used from multiple threads. PEP 703 is very very exciting to us! ↩︎

thejcannon · November 14, 2023, 4:20pm

Last thing I’ll say (for now, I promise ). Any rich type like this in the standard library quickly becomes a vocabulary type. pathlib.Path is my favorite example. Instead of APIs declaring they take paths as strings, and having to document this string represents a path, we now have a type that expresses that intent.

dataclasses are such a powerful vocabulary type in themselves, and they allow you to further define very straightforward vocabulary types. Win-win. attrs has the same semantics, but it being third-party means it isn’t in everyone’s vocabulary. As an API author, if I’m going to define an API, I want the most users to be able to easily understand and easily use, it doesn’t get better than standard library types.