Dataclasses: subclassing a dataclass without its fields inherited as init-fields

I was wondering if it would be possible to allow subclassing a dataclass without automatically including its fields in Subclass.__init__ (in some sense, hiding the inherited fields).

When subclassing the dataclass AB below to create CD, the fields of AB become fields of CD, automatically included in __init__, __repr__, and other methods of CD.

@dataclass
class AB:
    a: str
    b: str

@dataclass
class CD(AB):
    c: str
    d: str

    # CD is instantiated as `CD(a, b, c, d)`.

In some of my use cases, the values of a and b are determined by the values of c and d in instances of CD (in some sense, instances of CD are “fully determined” by c and d). Typically, I set those values in __post_init__. However, since a and b are automatically included in some methods, I hide them manually:

@dataclass
class AB:
    a: str
    b: str

def hidden_field(default=MISSING):
    return field(default=default, init=False, repr=False, compare=False)

@dataclass
class CD(AB):
    c: str
    d: str
    
    a = hidden_field("Hello World")
    b = hidden_field()

    def __post_init__(self):
        self.b = c + d

    # CD is instantiated as `CD(c, d)`,
    # but behavior from `AB` using `a` and `b` is still available.

I was wondering if, in the dataclass decorator for CD, an argument could be used to achieve this behavior, something like the following. I imagine this feature would be useful when a class like CD needs behavior from AB, but fields like a and b have predetermined values or values determined by fields of CD. One potential issue is that comparison (I believe) is broken between CD and AB instances.

@dataclass
class AB:
    a: str
    b: str

@dataclass(hide_inherited_fields=True)
class CD(AB):
    c: str
    d: str

    def __post_init__(self):
        self.a = "Hello World"
        self.b = c + d

    # CD is instantiated as `CD(c, d)`.

Perhaps a specific set of fields could be ignored by the decorator argument, and the library could even include a function like hidden_field.

This is IMO just a bad use-cases for data classes. Generally, when you start using __post_init__ for more work than just validating the inputs, you should probably reconsider if your design is actually well suited to dataclasses.

Specially, if b is fully determined by c and d, then surely this is an invariant that should be kept up all the time? But dataclasses are the wrong tool for such a situation since they assume the fields are independent, meaning they can be set independently as well.

I don’t know the larger context of your usecases, but I am against adding functions to the stdlib to encourage such usage of dataclasses.

1 Like

If these are your only two classes in your hierarchy, then write your own CD.__init__ implementation. If this is an example of a general pattern that you want to solve, you probably can’t easily solve it without altering the dataclass decorator or not using dataclasses.

2 Likes

That’s a good point; maybe I could try explaining my use-case, and you might be right that it is better suited for non-data classes.

I am working with collections of data, and I have a class like the psuedocode below, that loads a set of files’ data but also has quite a few other fields and methods (like for processing the data and caching the processed data).

@dataclass
class FilesDataset:
    name: str
    files: list[Path]

    def load_data(self):
        return sum(load_to_list(file) for file in self.files)

I need this class independently, but I also have datasets for specific patients, whose files are determined from a specific folder structure, something like (with my proposed addition):

@dataclass(hide_inherited_fields=True)
class PatientDataset:
    patient_name: str

    def __post_init__(self):
        self.name = "Patient " + self.patient_name
        self.files = load_files_for_patient(self.patient_name)

For the second class, I do not need name and files to remain fields (i.e. stored in __dataclass_fields__ with the corresponding behavior). (The argument could be something like override_inherited_fields, although it would be nice for the type annotations to remain.) However, I need to set name and files as attributes to use the behavior of FilesDataset for the patient dataset.

I could achieve this without making the second class a dataclass:

class PatientDataset(FilesDataset):
    def __init__(self, patient_name):
        self.patient_name = patient_name
        self.name = "Patient " + self.patient_name
        self.files = load_files_for_patient(self.patient_name)

    def __repr__(self): ...
    # Comparison and hash methods might be useful in other cases.

However, it’s much cleaner with the hiding/overriding idea above.

I could also achieve this with a function like make_patient_dataset that outputs an instance of FilesDataset, but I need additional behavior for the patient datasets that the file datasets do not need.

When I first used dataclasses some time ago, I actually expected the hiding/overriding behavior to be the default, and this use-case is the first where I have run into an issue.

Thank you for your input!

You know you can still make it a dataclass with your custom __init__, as far as I know?

That will provide the __repr__ that you want, etc.

I think it’s a better pattern to use

@dataclass(frozen=True)
class PatientDataset:
    patient_name: str

    @cached_property
    def f(self):
        return FilesDataset(
             name = "Patient " + self.patient_name,
             files = load_files_for_patient(self.patient_name),
        )

or if you don’t want it frozen for some reason:

@dataclass()
class PatientDataset:
    patient_name: str

    def __post_init__(self):
        self.f = FilesDataset(
             name = "Patient " + self.patient_name,
             files = load_files_for_patient(self.patient_name),
        )

Ie, assign the first dataclass as an attribute of the second dataclass.
It’s cleaner, because you have no inheritance.

1 Like

That’s composition, which I agree should be preferred to inheritance—but we don’t know whether he needs inheritance or not because of his overall design.

You know you can still make it a dataclass with your custom __init__, as far as I know?

That will provide the __repr__ that you want, etc.

That’s a good point, but for my patient datasets, I would like for them to have a __repr__ that includes only patient_name, rather than the many fields (seven) that my version of FilesDataset has.

Similarly, I could imagine other use-cases where the comparison or hashing methods would only need to depend on the attributes of the subclass.

I think it’s a better pattern to use

@dataclass(frozen=True)
class PatientDataset:
   patient_name: str

   @cached_property
   def f(self):
       return FilesDataset(
            name = "Patient " + self.patient_name,
            files = load_files_for_patient(self.patient_name),
       )

This is certainly a clean solution and one that I could use in future cases, but the package I have as of now has quite a few other functions for behavior like plotting and “gating” the data. In my case, I think calls like plot(patient_dataset) and patient_dataset.apply(gates) would be cleaner and more intuitive than plot(patient_dataset.f) and patient_dataset.f.apply(gates).

I think a better example might be this from the documentation, used to explain the usage of __post_init__.

class Rectangle:
    def __init__(self, height, width):
      self.height = height
      self.width = width

@dataclass
class Square(Rectangle):
    side: float

    def __post_init__(self):
        super().__init__(self.side, self.side)

You could imagine these two classes in a large library of shapes, which includes functions or methods for behavior like drawing the shapes. Rectangle could be a dataclass too, with all of the nice behavior that follows:

@dataclass
class Rectangle:
    height: float
    width: float

However, Square would then inherit height and width as fields, when squares can be fully determined by side. You could imagine that the init, repr, equality, and comparison methods of Square need only depend on side.

However, the Square class might still want behavior from Rectangle, like a draw method that uses height and width, without having to rewrite those methods while only replacing all uses of height and width with side.

With the addition I proposed, this would be easy to implement with something like:

@dataclass
class Rectangle:
    height: float
    width: float

@dataclass(independent_fields=True)
class Square(Rectangle):
    side: float

    def __post_init__(self):
        self.height = side
        self.width = side

I like the name independent_fields because it indicates that the sub-dataclass has a set of fields completely independent of its super-dataclasses. An alternative would be an argument like inherit_fields. By default, independent_fields=False or inherit_fields=True.

The subclass can inherit all of the useful behavior of the superclasses, but the developer has the responsibility of setting the superclass attributes needed for that behavior (as is the essentially case with inheritance of normal classes, where one should use super().__init__(...)).

Thank you very much for all your feedback on both the my own design and my proposal. If something like this could be implemented, I think it could be added by prefixing these two lines in the standard library with if not independent_fields or if inherit_fields.