Introduce a __json__ magic method

If you need to serialize arbitrary objects, you’ll probably need to, for example, preserve difference between list/tuple or int/float. Or serialize functions by name. It’s possible to encode that in JSON, but that JSON would not be very readable, and you’re probably better off using pickle.

Once you can represent list/tuple, functions, and similar built-in types, you can use pickle-related special methods to represent all other pickleable objects. In particular, __setstate__ is very similar to the proposed __decompose__.

FWIW, that’s a limitation of TOML 0.5 (from 2018).
Current TOML (1.0, released in 2021 and implemented by e.g. tomlkit or tomli-w), is much more reasonable in this regard.

>>> print(tomli_w.dumps({"foo": [1, {"2": 3}]}))
foo = [
    1,
    { 2 = 3 },
]
3 Likes

I’d rather see fewer things treating json as special over time. It’s not really a good machine format, and even if you need something that can handle arbitrary mappings and arrays of primitive data types without needing a pre-agreed upon schema, and might come from an untrusted source (making pickle inappropriate), I’d rather see people reach for better options like msgpack.

There are plenty of good libraries for structuring and restructuring of data, and many are partially agnostic to the serialization format. These tend to function by making [de]serialization functions rather than methods, or supporting only specific formats but having optimized handling of those for the library’s types, each of which has multiple extensibility benefits.

2 Likes

Well, this ended up much longer than I initially thought it would! And personally, I don’t have a huge need for this kind of thing, but it’s been a really interesting thought process to explore.




Something very similar to this proposal but more generic is Rust’s serde: “Serde is a framework for serializing and deserializing Rust data structures efficiently and generically.”

Serde defines two traits (think of them as dunder methods) called Serialize and Deserialize. They are a way of mapping every Rust data structure into one of 29 possible types and Serde implements these for most of stdlib’s types. Third-party libraries are free to implement these traits on their structures (classes). This approach has been so popular that practically every Rust library implements these despite Serde not being a part of Rust’s stdlib. Rust also has feature gates, which means you can conditionally implement serde support. The convention in the Rust community is to gate Serde behind a feature (think of it as an extra in Python packaging), so consumers of the library who don’t care about serde won’t have to pull in a library and its dependencies for a feature they won’t use.

Now, serde itself cannot actually (de)serialize from one concrete format to another; it just acts as a unified API for those that want to. Crucially, serde doesn’t need to concern itself with the specific types that each format supports, as that is the responsibility of the client libraries.

For example, if you want to convert a struct to JSON, you’ll have to install serde_json:

use serde::{Serialize, Deserialize};

#[derive(Serialize, Deserialize, Debug)]
struct Point {
    x: i32,
    y: i32,
}

fn main() {
    let point = Point { x: 1, y: 2 };

    // Convert the Point to a JSON string.
    let serialized = serde_json::to_string(&point).unwrap();

    // Prints serialized = {"x":1,"y":2}
    println!("serialized = {}", serialized);

    // Convert the JSON string back to a Point.
    let deserialized: Point = serde_json::from_str(&serialized).unwrap();

    // Prints deserialized = Point { x: 1, y: 2 }
    println!("deserialized = {:?}", deserialized);
}

This approach means “client” libraries that are tailored to specific formats (TOML, YAML, BSON, etc.) work with serde to do the (de)serializing.

Unlike pydantic or msgspec, serde’s scope is strictly limited to serialization and deserialization; it doesn’t automatically generate other methods like __init__, __repr__, or validation logic. This means that if a Rust library chooses not to enable the serde feature, the definition of its structs remains completely unchanged for users who don’t need serialization or deserialization. In contrast, the absence of inheriting from msgspec.Struct or pydantic.BaseModel in Python would drastically alter the class definition and its capabilities. Trying to achieve a similar optionality in Python by writing two separate class definitions, one inheriting from pydantic.BaseModel (or msgspec.Struct) and another without, is highly impractical due to code duplication and the complexity of maintaining two distinct versions of the same data structure.

To achieve something similar to serde in Python without these limitations, we could propose two new dunder methods:

  • def __serialize__(self) -> Any
  • def __deserialize__(cls, data: Any) -> Self

The __serialize__ method in Python would act similarly to a Rust type implementing the Serialize trait. Instead of mapping to one of serde’s 29 possible types, it would aim to map the Python object’s data into a restricted set of standard Python types like dictionaries, lists, strings, numbers, booleans, etc. This set would form an intermediate, format-agnostic representation of the object.

Then, similar to how serde_json or serde_yaml work in Rust, external Python libraries would be responsible for taking this intermediate representation and converting it into a specific format (like JSON, YAML, etc.). These libraries would understand the structure of the standard Python types and how to encode them according to the rules of their target format.

Conversely, the __deserialize__ class method would mirror the Deserialize trait. A format-specific library (e.g., one handling JSON) would first parse the incoming data from its format into a Python object. It would then call the __deserialize__ method of the target class, passing this object as the data argument. The __deserialize__ method would then be responsible for creating a new instance of the class and populating its attributes based on the data.

This separation of concerns, where the core class only needs to know how to represent itself in a standard Python way and the format-specific libraries handle the actual encoding and decoding, is a key aspect of serde’s design that contributes to its flexibility and wide adoption. It allows new formats to be supported simply by creating a new “client” library that understands the intermediate representation, without requiring changes to the core classes themselves. This approach also ensures that the class doesn’t become tightly coupled to any particular serialization format.

Here’s a crude implementation:


# --- User code ---

import json
from typing import Any, Self, Sequence
import base64


class Point:
    def __init__(self, x: float, y: float):
        self.x = x
        self.y = y

    def __serialize__(self) -> dict[str, int]:
        return {"x": self.x, "y": self.y}

    @classmethod
    def __deserialize__(cls, data: Any) -> Self:
        if isinstance(data, dict):
            return cls(float(data["x"]), float(data["y"]))
        elif isinstance(data, Sequence):
            return cls(float(data[0]), float(data[1]))
        else:
            raise TypeError(
                f"Unsupported data type for deserializing Point: {type(data).__name__}. "
                "Expected a dictionary or a sequence."
            )


# --- Hypothetical third-party JSON (de)serializing library ---


def default(obj: Any) -> Any:
    """
    Example handler for the limited set of types returned by __serialize__.
    """
    if isinstance(obj, bytes):
        return base64.b64encode(obj).decode("utf-8")
    if isinstance(obj, ...):  # Some other type
        return ...
    if isinstance(obj, ...):  # Some other type
        return ...
    raise TypeError(f"Object of type {type(obj).__name__} is not JSON serializable.")


def to_json_string(obj: Any) -> str:
    if hasattr(obj, "__serialize__"):
        return json.dumps(obj.__serialize__(), default=default)
    raise TypeError(f"Object of type {type(obj).__name__} is not serializable.")


def from_json_string(json_str: str, cls: type[Self]) -> Self:
    try:
        data = json.loads(json_str)
        if hasattr(cls, "__deserialize__"):
            return cls.__deserialize__(data)
        raise TypeError(f"Class {cls.__name__} is not deserializable.")
    except json.JSONDecodeError as e:
        raise ValueError(f"Invalid JSON string: {e}")
    except (TypeError, ValueError) as e:
        raise ValueError(f"Deserialization error for {cls.__name__}: {e}")


# --- Usage Example ---

point = Point(5, 10)

# Serialization
try:
    json_output = to_json_string(point)
    print(f"Serialized to JSON: {json_output}")
except TypeError as e:
    print(f"Serialization Error: {e}")

# Deserialization
try:
    deserialized_point = from_json_string(json_output, Point)
    print(f"Deserialized Point: x={deserialized_point.x}, y={deserialized_point.y}")
except (TypeError, ValueError) as e:
    print(f"Deserialization Error: {e}")

For more convenience, the Python standard library could potentially include @serializable and @deserializable decorators. When applied to a class, these decorators could automatically generate the __serialize__ and __deserialize__ methods. This functionality would be applicable if all attributes of the class are types from the standard library (such as integers, strings, lists, datetime, uuid, dictionaries, etc). The @serializable decorator would generate a __serialize__ method returning a dictionary of the object’s attributes, and the @deserializable decorator would generate a __deserialize__ class method to reconstruct the object from a dictionary. This could simplify the process for making data classes serializable and deserializable in many common cases.

5 Likes

At least in case of JSON, we already have protocols to convert the data types
to and from standard types:

JSON type Required methods for encoding Called with when decoding
"array" __len__(), and __iter__() list
"bool" __bool__(), __len__() or absent for true bool
"float" __str__() or __repr__() str
"int" __str__() or __repr__() str
"object" __len__() and items() list[tuple[Any, Any]]
"str" __str__() or __repr__() str

BUT, you still need to register the types and hooks manually:

>>> import jsonyx as json # just as an example
>>> import numpy as np
>>> obj = np.array([
...     np.bool_(), np.int8(), np.uint8(), np.int16(), np.uint16(), np.int32(),
...     np.uint32(), np.intp(), np.uintp(), np.int64(), np.uint64(), np.float16(),
...     np.float32(), np.float64()
... ], dtype="O")
>>> types = {
...     "array": np.ndarray,
...     "bool": np.bool_,
...     "float": np.floating,
...     "int": np.integer
... }
>>> json.dump(obj, types=types)
[false, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0, 0.0, 0.0, 0.0]
>>> import jsonyx as json
>>> from functools import partial
>>> import numpy as np
>>> hooks = {
...     "array": partial(np.array, dtype="O"),
...     "bool": np.bool_,
...     "float": np.float64,
...     "int": np.int64
... }
>>> json.loads("[false, 0.0, 0]", hooks=hooks)
array([np.False_, np.float64(0.0), np.int64(0)], dtype=object)
1 Like

**"I like the idea of __json__ or __serialize__(self, type="json"), but I don’t think it should be the standard implementation of JSON in the stdlib, because we can’t be sure that all classes implementing __json__ will return valid JSON.

Could the approach simply be to start implementing __json__ in different libraries and manually use it by default in json.decode (perhaps with a new library or a new ‘insecure’ method in the stdlib json module)?

If this is the plan, would it be reasonable to create a PEP suggesting the implementation of a __json__ method while prohibiting implementations that return invalid JSON?"**

I think @ericvsmith 's suggestion is actually something that sidesteps a lot of the thorns with the existing suggestions while remaining fairly elegant.

In case you aren’t sure what he’s saying:

import json

@json.JSONEncoder.encode.register
def encode_my_foo(foo: Foo):
    ...
3 Likes

I am not sure if json should serialise more exotic objects (e.g. ndarray) by default.

I appreciate the convenience if that was so, but given very simple spec of json, this functionality is (and IMO should be) extra. And it is natural that “extra” needs a bit of extra effort.

None of json libraries come with set of registries by default. E.g. orjson does not ship with option=SERIALISE_NUMPY.

Even if registry mechanism was implemented, I don’t think libraries should register their objects by default. Even if they do supply to_json method, it would most likely be up to the user whether to bind it to json via registry or not.

I think an option could be to have a name for standard method (__json__ looks fine). But it will be up to the user whether to enable it. E.g.:

json.dumps(a, default=json.defaultmethods('__json__'))

where some convenience function in json can be implemented for this:

class defaultmethods:
    def __init__(*names):
        self.names = names
    def __call__(self, obj):
        for name in self.names:
            if hasattr(obj, name):
                return geattr(obj, name)()
        raise TypeError

This way, if user needs json serialiser with this in his work, then he can make it available conveniently:

json_full_dumps = partial(json.dumps, default=json.defaultmethods('__json__'))

That’s difficult, np.ndarray doesn’t follow the Sequence protocol (missing count() and index()), making it impossible to determine what it is from method presence alone.

Also, isinstance() with an ABC is slow.

Haven´t considered additional implications but indeed looks an elegant way to extend the JSONEncoder.