I would like to propose adding a library that allows converting unstructured
objects (such as a dict read from a JSON formatted file) to structured objects
such as dataclasses, and makes use of the types and type annotations at runtime
to validate and cast the input as appropriate. The idea is that one starts with
a json-like input (i.e. containing lists, dicts and scalars) and gets a Python
object conforming to the schema implied by the type annotations (or a validation
error).
A web search contains plenty of “dict to dataclass projects” with various levels
of added functionality (I’d link them but Discourse doesn’t allow me). The most famous of which is
Pydanti,c which powers the FastAPI
framework. I also have my own
library that does this, and I think that
having something like it in the standard library might be useful.
Proposed interface
There would be a main interface function:
def parse_input[T](value: object, spec: Type[T]) -> T:
Where value
is some JSON-like input and spec
is some target type to convert
the value into. For example:
>>> import dataclasses
>>> @dataclasses.dataclass
... class Record:
... name: str
... age: int
...
>>> parse_input(
... [{'name': 'Alvaro', 'age': 0}, {'name': 'Luca', 'age': 0}],
... tuple[Record, Record]
... )
(Record(name='Alvaro', age=0), Record(name='Luca', age=0))
or fail if something (including the type annotation) is wrong
>>> parse_input({'name': 'Alvaro', 'age': "0"} , Record)
Traceback (most recent call last):
...
WrongTypeError: Expecting value of type 'int', not str.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
...
WrongFieldError: Cannot process field 'age' of value into the corresponding field of 'Record'
The function would follow some reasonably unambiguous rules (e.g. lists can be
converted to tuples, dicts can be converted to dataclasses). For the rest, the
would be an extension
mechanism (see docs at Defining custom parsers — Validobj 1.2 documentation) which would
allow producing types annotated with their validation logic (implemented
using typing.Annotated
).
import typing
import decimal
def to_decimal(inp: str | float) -> decimal.Decimal:
try:
return decimal.Decimal(inp)
except decimal.InvalidOperation as e:
raise ValidationError("Invalid decimal") from e
Decimal = Parser(to_decimal)
parse_input(0.5, Decimal)
Decimal('0.5')
This would allow to be conservative in terms of the conversions that are
supported, while being useful beyond.
Finally there would be an exception
hierarchy (see validobj.readthedocs.io/en/latest/errors.html
) of validation
errors (with a base ValidationError
). I believe this should be comprehensive
enough to allow to programmatically pinpoint the source of a validation error
(e.g. to then attach information on line numbers to
it, see example at validobj.readthedocs.io/en/latest/examples.html#yaml-line-numbers
)
Usages
My main motivation is to read configuration files in languages like YAML, TOML
or similar into higher level objects, validate their fields and emit useful
error messages whenever need.
This both avoids having to deal with many layers of nested dictionaries and
having to check the types at runtime at many disconnected places.
Precedents
An AttrDict object was about to be
added (see cpython/issues/96145) to Python 3.12. Compared
to that, this solves the problem of allowing attribute access to the fields, but
it also allows validating the schema. It does not introduce a new objects with
different semantics, but makes use of existing objects. Arguably it addressed a
common complaint regarding dataclasses (the type annotations do finally do
something without external tools!), thereby making them more intuitive.
Libraries implementing variations of this behaviour have been quite successful
(e.g. Pydantic) and used internally.
Should all the typing specification be supported?
I don’t believe it is worth worrying about things like compile time generics for
a runtime checker and I don’t think these are needed to make this proposal
compelling. Same for annotations mainly used for classes or return types rather
than data.
Not proposed
- Any sort of configuration in the conversion (e.g. what is the right decimal
context for an input, should a namedtuple accept a list or a dict input).
Use a custom validator for that. - Any Model class with enhanced functionality and state: Use Pydantic for that.
- A way for a type to declare its favourite processing without using a custom
validator. Might be done later. - Deserializing from raw bytes rather than python objects for performance: It opens a can of
worms, could be done later.
Why should this be in the standard library?
- The functionality has been often been requested and attempts have been made
to address it partially. - Variations of the idea have been (re)-invented many times. There are
successful libraries with similar ideas that are parts of popular projects. - Long term it may be useful to have a standard for deserializarion.
- People who may need to process deeply nested json records may also be
subject to policies presenting use of external libraries. - The typing specification changes frequently, which makes it difficult to
have one library in PyPi supporting many versions of Python. - The runtime usage of annotations is an important use case for big projects
like FastAPI, whether intended or not, and having that functionality
exercised in the standard library would make it easier to spot problems
(like those of PEP 563 – Postponed Evaluation of
Annotations)
some github projects (not allowed to link):
EvgeniyBurdin/validated_dc
matchawine/python-enforce-typing
tamuhey/dataclass_utils
Fatal1ty/mashumaro
konradhalas/dacite