Add type aware dict to dataclass validation library

Zah · January 2, 2024, 8:56pm

I would like to propose adding a library that allows converting unstructured
objects (such as a dict read from a JSON formatted file) to structured objects
such as dataclasses, and makes use of the types and type annotations at runtime
to validate and cast the input as appropriate. The idea is that one starts with
a json-like input (i.e. containing lists, dicts and scalars) and gets a Python
object conforming to the schema implied by the type annotations (or a validation
error).

A web search contains plenty of “dict to dataclass projects” with various levels
of added functionality (I’d link them but Discourse doesn’t allow me). The most famous of which is
Pydanti,c which powers the FastAPI
framework. I also have my own
library that does this, and I think that
having something like it in the standard library might be useful.

Proposed interface

There would be a main interface function:

def parse_input[T](value: object, spec: Type[T]) -> T:

Where value is some JSON-like input and spec is some target type to convert
the value into. For example:


>>> import dataclasses
>>> @dataclasses.dataclass
... class Record:
...     name: str
...     age: int
... 
>>> parse_input(
...     [{'name': 'Alvaro', 'age': 0}, {'name': 'Luca', 'age': 0}],
...     tuple[Record, Record]
... )

(Record(name='Alvaro', age=0), Record(name='Luca', age=0))

or fail if something (including the type annotation) is wrong


>>> parse_input({'name': 'Alvaro', 'age': "0"} , Record)
Traceback (most recent call last):
...
WrongTypeError: Expecting value of type 'int', not str.

The above exception was the direct cause of the following exception:
Traceback (most recent call last):
...
WrongFieldError: Cannot process field 'age' of value into the corresponding field of 'Record'

The function would follow some reasonably unambiguous rules (e.g. lists can be
converted to tuples, dicts can be converted to dataclasses). For the rest, the
would be an extension
mechanism (see docs at Defining custom parsers — Validobj 1.2 documentation) which would
allow producing types annotated with their validation logic (implemented
using typing.Annotated).

import typing

import decimal



def to_decimal(inp: str | float) -> decimal.Decimal:

    try:

        return decimal.Decimal(inp)

    except decimal.InvalidOperation as e:

        raise ValidationError("Invalid decimal") from e


Decimal = Parser(to_decimal)


parse_input(0.5, Decimal)
Decimal('0.5')

This would allow to be conservative in terms of the conversions that are
supported, while being useful beyond.

Finally there would be an exception
hierarchy (see validobj.readthedocs.io/en/latest/errors.html) of validation
errors (with a base ValidationError). I believe this should be comprehensive
enough to allow to programmatically pinpoint the source of a validation error
(e.g. to then attach information on line numbers to
it, see example at validobj.readthedocs.io/en/latest/examples.html#yaml-line-numbers)

Usages

My main motivation is to read configuration files in languages like YAML, TOML
or similar into higher level objects, validate their fields and emit useful
error messages whenever need.

This both avoids having to deal with many layers of nested dictionaries and
having to check the types at runtime at many disconnected places.

Precedents

An AttrDict object was about to be
added (see cpython/issues/96145) to Python 3.12. Compared
to that, this solves the problem of allowing attribute access to the fields, but
it also allows validating the schema. It does not introduce a new objects with
different semantics, but makes use of existing objects. Arguably it addressed a
common complaint regarding dataclasses (the type annotations do finally do
something without external tools!), thereby making them more intuitive.

Libraries implementing variations of this behaviour have been quite successful
(e.g. Pydantic) and used internally.

Should all the typing specification be supported?

I don’t believe it is worth worrying about things like compile time generics for
a runtime checker and I don’t think these are needed to make this proposal
compelling. Same for annotations mainly used for classes or return types rather
than data.

Not proposed

Any sort of configuration in the conversion (e.g. what is the right decimal
context for an input, should a namedtuple accept a list or a dict input).
Use a custom validator for that.
Any Model class with enhanced functionality and state: Use Pydantic for that.
A way for a type to declare its favourite processing without using a custom
validator. Might be done later.
Deserializing from raw bytes rather than python objects for performance: It opens a can of
worms, could be done later.

Why should this be in the standard library?

The functionality has been often been requested and attempts have been made
to address it partially.
Variations of the idea have been (re)-invented many times. There are
successful libraries with similar ideas that are parts of popular projects.
Long term it may be useful to have a standard for deserializarion.
People who may need to process deeply nested json records may also be
subject to policies presenting use of external libraries.
The typing specification changes frequently, which makes it difficult to
have one library in PyPi supporting many versions of Python.
The runtime usage of annotations is an important use case for big projects
like FastAPI, whether intended or not, and having that functionality
exercised in the standard library would make it easier to spot problems
(like those of PEP 563 – Postponed Evaluation of
Annotations)

some github projects (not allowed to link):

EvgeniyBurdin/validated_dc
matchawine/python-enforce-typing
tamuhey/dataclass_utils
Fatal1ty/mashumaro
konradhalas/dacite

Lucas_Malor · January 3, 2024, 2:04pm

Why not using JSON schemas?

Viicos · January 3, 2024, 4:11pm

Such a feature (in other words, re-implementing Pydantic/attrs/msgpack/marshmallow or any kind of library with runtime validation) is extremely challenging (some of them have been created several years ago and are still in development). There’s also a lot of opinionated choices to be made, and edge cases to be taken into account. I’ll give an example that comes up quite often in Pydantic:

How to handle unions? Should each type be tried from left to right?

If you do perform some validation/type coercion you’ll probably face some performance issues, giving an advantage to 3rd party libraries with a Rust/C core.

Considering the already existing 3rd party libraries took years to be developed, how much time would it take to make it available in the stdlib? Even with several developers working on it, such a library needs user feedback, which is usually spread over a long period of time.

Finally, the biggest issue in my opinion would be the inability to upgrade to a newer version of this library. If I’m on 3.13 and want discover a bug/need a new feature, I’ll have to wait on a new version to be released (which could be 3.14 if not backported).

(Yes this applies to every stdlib module, but not a single one of them reaches the amount of complexity this would require).

Zah · January 7, 2024, 5:45pm

JSON schemas require a different specification language and different, external, tools to process them. Instead here schemas are defined in terms of the same Python objects that are going to be useful in the code. This is an advantage for ergonomics, editor support, and overall tooling footprint.

Tools like Fastapi go the other way and generate a schema based on the content of dataclasses. That is certainly possible, but not proposed here.

Zah · January 7, 2024, 6:10pm

Such a feature (in other words, re-implementing Pydantic/attrs/msgpack/marshmallow or any kind of library with runtime validation) is extremely challenging (some of them have been created several years ago and are still in development). There’s also a lot of opinionated choices to be made, and edge cases to be taken into account. I’ll give an example that comes up quite often in Pydantic:

The proposal here is not to reimplement Pydantic, or similar. Pydantic comes with state, configuration flags, its own model classes, a default non-strict mode, many opinionated choices on how to coerce types, and many other characteristics that are not being proposed here.

I believe the 90% use case can be well served with the interfaces proposed above: Namely a small and sensible amount of default coercion rules plus the ability to extend them in arbitrary ways using annotated types.

How to handle unions? Should each type be tried from left to right?

I believe left to right is the obvious choice. The reason it is not done in Pydantic by default seems mainly because of the looser coercion rules it implements.

Another option I have seen is to throw an error if there is more than one possible match, but that goes against the semantics of union.

If you do perform some validation/type coercion you’ll probably face some performance issues, giving an advantage to 3rd party libraries with a Rust/C core.

Surely the potential for a feature to be optimized is not a reason for it to not be included in the standard library (the one set of c-compiled modules everyone has access to, with no problems). Having a standard interface also makes it easier for a third party to provide an accelerated module (see the much more complicated case of asyncio).

Finally, the biggest issue in my opinion would be the inability to upgrade to a newer version of this library. If I’m on 3.13 and want discover a bug/need a new feature, I’ll have to wait on a new version to be released (which could be 3.14 if not backported).
(Yes this applies to every stdlib module, but not a single one of them reaches the amount of complexity this would require).

I disagree that this requires some large amount complexity on the grounds that I have written a library (see above) that does all I want in a few hundred lines of code (including tests and documentation). Of those, most of the issues comes from trying to correctly interpret type annotations at runtime across several versions of python, dealing with other changes across Python versions (notably in the enum module). All these seem to me arguments that this would fit in the stdlib.

The logic itself is essentially a big recursive if chain. The complexity of this feature is probably an order of magnitude smaller than the enum module and a bit more complicated than the proposal of the AttrDict class (arguably for a much bigger benefit).

Lucas_Malor · January 7, 2024, 7:36pm

You’re right. Never used these tools, but they seems very simple to use:

https://marshmallow.readthedocs.io/en/stable/examples.html

It seems to me that your library, @Zah, is not much more simpler.

Topic		Replies	Views
Draft PEP - Adding "converter" dataclasses field specifier parameter PEPs	24	2231	April 23, 2023
Dataclasses.asdict - transformation of dict-fields - change type(obj) to dict directly Ideas	4	2802	March 6, 2023
How should/can I convert loaded JSON data into Python objects? Python Help help	4	14804	April 27, 2023
Dataclasses.asdict more flexible Ideas	0	3343	January 31, 2020
Type narrowing with validation function that raises exception? Typing	4	286	March 25, 2024