Introduce a __json__ magic method

I’ve often felt that the current CustomEncoder approach to adding custom JSON serialization is a bit of an anti-pattern. The separation of class and serialization logic means that most libraries that wish to output json must also expose a parameter for overriding the default encoder, which can be complicated if the library already uses a custom JSONEncoder internally. In cases where your application needs to serialize more than two or three custom classes within the same context, managing your custom encoder can become a non-trivial task. In the worst case, I’ve seen developers resort to monkey patching library functionality in order to pass a custom encoder to one of the library’s dependencies.

To solve this problem, I’d like to propose a __json__ dunder method. This would allow classes to describe how they should be serialized in much the same way that __repr__ allows them to describe how they should be represented in text format. This way it would become trivial to pass custom classes to other libraries. The change would be quite simple, rather than the JSONEncoder.default() immediately raising a TypeError, it would first check to see if the object has callable __json__ attribute. If so, it returns the result like so:

def default(self, o):
    if callable(getattr(o, "__json__", False)):
        return o.__json__()
    raise TypeError(f'Object of type {o.__class__.__name__} '
                    f'is not JSON serializable')

This ensure backwards compatibility for libraries that are already using a custom encoder, as the custom encoder logic will have been executed before we get to this point. Additionally, the docstring of default already instructs programmers to call JSONEncoder.default(self, o) in the event that their custom logic cannot handle the provided input, so any custom encoder that was implemented according to the official guidelines would still automatically make use of this new functionality. Lastly, this change should have minimal performance impact, as it would only affect cases where the program would’ve otherwise thrown a serialization error. In my experience, such errors are not usually recovered from, so it seems unlikely that high-performance applications are out in the wild churning through such cases often enough that the additional callable(getattr(o, "__json__", False)) would noticeably impact performance.

I should note that when searching to see if this had already been suggested I found this topic:

In one of the comments introducing a __json__ protocol was mentioned off-hand, but I thought it still merited a separate topic. I agree with many of the comments that a global registry should be avoided, and I’m also not as concerned with de-serialization (after all, json is not pickle). This approach doesn’t cause weird side effects for libraries, and it still allows for serialization-time customization (e.g. it’s not uncommon to have two different custom encoders for datetime objects depending on the serialization context).

5 Likes

It’s worth noting that you can already do this. Have a single encoder class that checks for the appropriate method (though if it’s done within the application, I’d recommend calling it json() rather than __json__()), and then the management is pretty much the same.

The true benefit of bringing this into the stdlib would be that objects from different creators would all use the same protocol. Question: Is that actually a good thing, or not? Since JSON doesn’t allow true round-tripping of any data type that isn’t defined natively, any decoding you do will end up being somewhat ad-hoc. Perhaps it’s better to keep it within the application?

4 Likes

Thanks for the response! I do think it’s still worth moving to the stdlib for the following reasons:

  1. json doesn’t allow true round-tripping for built-in types either. loads(dumps((1,2,3))) -> [1,2,3], should json stop serializing tuples? No, json never promises round-tripping-- it’s not pickle.
  2. There’s currently no way for a class (even those in the stdlib) to define a sane default (e.g. Decimal clearly maps to a json number, but since decimal isn’t a built-in type, json can’t bake behavior into the default encoder without introducing an unnecessary dependency.) There are a lot of common sense cases like that fall through the cracks currently because json can’t be expected to be responsible for every type in the standard library. Adding __json__ would allow stakeholders to take responsibility for such cases. Other examples include UUID, and sequence objects (e.g. map, range, and filter).
  3. If you’re using some 3rd party SDK and it doesn’t expose an interface for passing a custom encoder then currently your only option is monkey-patching, or pre-serializing your output. There is another benefit of bringing this into the standard lib, which is that it offers guaranteed access to the stdlib’s serialization mechanism.
4 Likes

True, but that is by design of JSON.

Should it be a goal to be able to round trip all python types at all?
The big win for JSON is that interoperates with apps written in other languages.
In which case they cannot decode encoded python objects.

Let’s not get sidetracked by lack of round-tripping:

I believe we just did something similar with the new stdlib tomllib, except in reverse – it can read, but not write, toml files.

Why limit to JSON? There are other common and useful serialisation protocols such as YAML, TOML, HDF5, ProtoBuf and pickle (which also has its own dedicated support in special methods).

I would suggest generalising by introducing a special method __serialise_prepare__, which takes one argument (beyond self) which is the name of the format bring serialised to (similar to __format__), and must return only types which the serialisation library recognises.

This is because TOML and YAML support date-tones, whereas JSON doesn’t, and other formats have better support for binary data.

A consumer of this API would implement this method on their toes as, for example:

class MyDateTime(datetime.datetime):
    def __serialise_prepare__(self, fmt):
        if fmt in ("yaml", "toml"):
            return self
        return self.isoformat(sep="T")
9 Likes

I’m not opposed to providing similar support for other formats supported by the stdlib, but I don’t think it’s a good idea to open this up to every conceivable format. When python incorporated json into the stdlib, they moved it all the way upstream and took ownership over how Python would interface with json format.

That’s not to say that other formats couldn’t choose to mimic this functionality. For example, pyyaml would be free to expose __yaml__ (or perhaps __yaml_format__ to avoid potential name collision in the future), but that’s a discussion for the pyyaml maintainers to have-- it’s not appropriate for python core.

Additionally, obj.__serialize_prepare__("my_favorite_format") encourages the use of magic strings and makes it more complicated to figure out whether and how the format in question is supported. I mean imagine the documentation for a class that makes heavy use of this functionality. They would probably end up documenting each of the expected inputs values of __serialize_prepare__, because in effect each input is a different method, but we’ve moved the name lookup into a sub dictionary instead of using the built-in method lookup system.

2 Likes

Hell no. You said it yourself: JSON doesn’t require the ability to round trip all types. I think that this issue of round tripping is a red herring. JSON doesn’t require round tripping of all types, so why should a __json__ dunder method?

So long as the types that JSON does require to round trip do round trip, we’re good.

I’m not sure that a single method is sufficient. Surely we would need an encoder and a decoder? Or have I missed something?

In theory yes, but in practice, how would it know what object to call the decoder on? Consider:

class X:
    def __json__(self): return ["X", "my state here"]

item = {
    "id": 42,
    "x": X(),
    "spam": "ham",
}
text = json.dumps(item)
remade_item = json.loads(text)

There’s no way for loads to figure out that X() should be loaded. That’s the inherent non-round-trippability coming in; by definition, it’s going to load as the actual native type. This ONLY works when you have a custom Python object that you want to send to something else in a particular format.

(It sounds like a narrow use-case, but it’s one that covers a remarkable number of situations.)

To be explicit I think round tripping python in JSON is a bad idea.
I want to see if anyone was going to argue for it.

3 Likes

As I wrote in the past thread, __json__ solves only serialization. It can not solve deserialization of custom object. So I prefer two stage approach like pydantic.

If we really need to add the method, I think it should be simplejson compatible.

https://simplejson.readthedocs.io/en/latest/#simplejson.dumps

If for_json is true (not the default), objects with a for_json() method will use the return value of that method for encoding as JSON instead of the object.

3 Likes

It may be worth noting that adding this to JSONEncoder in the manner suggested would have the unexpected behaviour that if a separate default is passed to dumps, the __json__ method would then be completely ignored as providing a default argument to dumps leads to the method on the encoder instance being replaced. (Yes, this behaviour did surprise me when I ran into it).


This may seem tangential but as this has come up a couple of times recently I wonder if it make sense if the JSON documentation recommended providing a default function to the default argument of dumps instead of extending JSONEncoder?

If all you are doing when subclassing is replacing the default method then passing a default to dumps seems to be a more straightforward approach.

It looks like the simplejson documentation was changed to suggest using default. Is there a reason the python documentation still recommends subclassing instead?

Equivalent Python documentation for comparison

How I would do it is a pair of magic methods,__serialise__() and __serialise_object__(). __serialise__() would take one argument, the encoder to serialise with while __serialise_object__() would take the object to serialise. When a user wants to serialise an object, they would call the builtin function serialise() with the object to serialise and how to serialise it. For example serialise(my_object, json.JSONEncoder) would serialise an object to json.
The serialise() function would be defined as:

def serialise(data:object,encoder):
    try:
        return encoder.__serialise_object__(data)
    except NotImplementedError:
        try:
            return data.__serialise__(encoder)
        except NotImplementedError:
            raise NotImplementedError(f"Cannot serialise {data!s} using {encoder!s}")

When writing the __serialise__() method, the programmer wouldn’t need to know which types the encoder can handle natively because that’s handled in serialise().

This way the writer of the encoder only needs to deal with the types that can be natively serialised and the programmer writing __serialise__() can effectivly ignore what the data is being serialised to.

2 Likes

I’m not sure why no one thought of this, but essentially there is an underlying issue with Python: All libraries assume all types are serializable. This assumption fails for a user-defined class. To be fair, Python almost solves the serialization problem since many other frameworks have assumed no class is serializable (without special libraries or custom methods). Serializers like pickle grab every attribute, but not every attribute should be saved (Some attributes may be specific to runtime, such as counters, a game engine’s custom opengl vector class, etc.; other attributes may objects that need recursive processing, such as another attribute that is a custom vector class but should be converted to a tuple and stored).

My solution:

  • Make a __decompose__ method that returns a dict, other iterable, or other built-in type (Typically, a custom class is best represented by a dict, but a custom vector class is best represented by a tuple of float/int, so any built-in type should be allowed). In the case of an iterable, your __decompose__ implementation should guarantee built-in types for each value recursively, including by checking if each value has its own __decompose__ method that can be called.
  • Then a @classmethod __compose__(cls, instance) can be implemented that accepts the built-in type instance, but should return an instance of cls.

IMO it is Pythonic, since it is akin to duck typing if you can treat a custom object like a built-in one (More specifically though it really is one, after __decompose__), and akin to how objects use __dict__ as a backend (but that tacit guarantee of serialization isn’t really kept without something like __decompose__, since the OP’s issue with json wasn’t prevented by the __dict__ method).

This solution applies not only to json but to any serializer/deserializer. The JSON code itself has no standard way of storing the custom type (JSON only contains JSON types), but there are two solutions for that:

  1. The json (or other library) load/loads methods could have a return_type keyword argument that could be set to your user-defined class, which would require a __compose__ method and then parse could return the return of __compose__ and the type would be your custom type (This assumes the entire block of JSON, as opposed to part of it, is one object of that type, so see #2 below).
  2. (or) call your __compose__ method yourself after parsing: If the JSON code represents a list/dict of objects of your type (or more deeply nested structure), iterate through the values and call your class’ static __compose__ method on each yourself after load/loads.

I think this is a great idea. Where I work we move a lot of data from custom classes around as JSON. We have code to covert them to/from a dict to enable this. Being able to just call json.dump() would simplify a lot of the things that we do.

If Python adds __json__, why not also __toml__, __xml__ and so on?

See my answer for why–just make a method that makes it serializable in general, rather than serializing it multiple ways.

Regarding generalising using either the __toml__(), __xml__() suggestions or some single __serialise__() methods, it’s important to remember what structures these formats can hold.

  • JSON supports arbitrary compositions of lists, dictionaries and the basic scalar types.
  • TOML looks the same as JSON except that requires the top level to be a dictionary and has all sorts of funky rules about mixed types in nested structures (try running toml.dumps({"foo": [1, {"2": 3}]})).
  • XML is some weird nested object structure where each object has a class name, has children like a list and named attributes like a dictionary except the dictionary’s values are limited to strings. It doesn’t really translate to or from Python’s builtin types.

With that in mind:

  • If __serialise__() were to exist, it would almost certainly be unusable to XML.
  • If __serialise__() or __toml__() existed, you’d have a good chance of tripping over structures that TOML can’t handle.

I think (de)serializing custom types is better handled by a wrapper function:

def to_json(obj):
    if isinstance(obj, list):
        return [to_json(value) for value in obj]
    if isinstance(obj, dict):
        return {key: to_json(value) for key, value in obj.items()}
    if isinstance(obj, complex):
        return {"__complex__": True, "real": obj.real, "imag": obj.imag}
    return obj

json.dump(to_json(1 + 2j))

That way you can implement any conversion strategy you like, without restrictions. The JSON library should focus on more flexible encoding and decoding, as that can’t be implemented using a wrapper. If you take a look at how many options simplejson has, you’ll know what I mean…

def dump(obj, fp, skipkeys=False, ensure_ascii=True, check_circular=True,
         allow_nan=False, cls=None, indent=None, separators=None,
         encoding='utf-8', default=None, use_decimal=True,
         namedtuple_as_object=True, tuple_as_array=True, bigint_as_string=False,
         sort_keys=False, item_sort_key=None, for_json=None, ignore_nan=False,
         int_as_string_bitcount=None, iterable_as_array=False, **kw): ...

Or use functools.singledispatch, which would make it extensible.

1 Like