Introduce a __json__ magic method

I’ve often felt that the current CustomEncoder approach to adding custom JSON serialization is a bit of an anti-pattern. The separation of class and serialization logic means that most libraries that wish to output json must also expose a parameter for overriding the default encoder, which can be complicated if the library already uses a custom JSONEncoder internally. In cases where your application needs to serialize more than two or three custom classes within the same context, managing your custom encoder can become a non-trivial task. In the worst case, I’ve seen developers resort to monkey patching library functionality in order to pass a custom encoder to one of the library’s dependencies.

To solve this problem, I’d like to propose a __json__ dunder method. This would allow classes to describe how they should be serialized in much the same way that __repr__ allows them to describe how they should be represented in text format. This way it would become trivial to pass custom classes to other libraries. The change would be quite simple, rather than the JSONEncoder.default() immediately raising a TypeError, it would first check to see if the object has callable __json__ attribute. If so, it returns the result like so:

def default(self, o):
    if callable(getattr(o, "__json__", False)):
        return o.__json__()
    raise TypeError(f'Object of type {o.__class__.__name__} '
                    f'is not JSON serializable')

This ensure backwards compatibility for libraries that are already using a custom encoder, as the custom encoder logic will have been executed before we get to this point. Additionally, the docstring of default already instructs programmers to call JSONEncoder.default(self, o) in the event that their custom logic cannot handle the provided input, so any custom encoder that was implemented according to the official guidelines would still automatically make use of this new functionality. Lastly, this change should have minimal performance impact, as it would only affect cases where the program would’ve otherwise thrown a serialization error. In my experience, such errors are not usually recovered from, so it seems unlikely that high-performance applications are out in the wild churning through such cases often enough that the additional callable(getattr(o, "__json__", False)) would noticeably impact performance.

I should note that when searching to see if this had already been suggested I found this topic:

In one of the comments introducing a __json__ protocol was mentioned off-hand, but I thought it still merited a separate topic. I agree with many of the comments that a global registry should be avoided, and I’m also not as concerned with de-serialization (after all, json is not pickle). This approach doesn’t cause weird side effects for libraries, and it still allows for serialization-time customization (e.g. it’s not uncommon to have two different custom encoders for datetime objects depending on the serialization context).

2 Likes

It’s worth noting that you can already do this. Have a single encoder class that checks for the appropriate method (though if it’s done within the application, I’d recommend calling it json() rather than __json__()), and then the management is pretty much the same.

The true benefit of bringing this into the stdlib would be that objects from different creators would all use the same protocol. Question: Is that actually a good thing, or not? Since JSON doesn’t allow true round-tripping of any data type that isn’t defined natively, any decoding you do will end up being somewhat ad-hoc. Perhaps it’s better to keep it within the application?

3 Likes

Thanks for the response! I do think it’s still worth moving to the stdlib for the following reasons:

  1. json doesn’t allow true round-tripping for built-in types either. loads(dumps((1,2,3))) -> [1,2,3], should json stop serializing tuples? No, json never promises round-tripping-- it’s not pickle.
  2. There’s currently no way for a class (even those in the stdlib) to define a sane default (e.g. Decimal clearly maps to a json number, but since decimal isn’t a built-in type, json can’t bake behavior into the default encoder without introducing an unnecessary dependency.) There are a lot of common sense cases like that fall through the cracks currently because json can’t be expected to be responsible for every type in the standard library. Adding __json__ would allow stakeholders to take responsibility for such cases. Other examples include UUID, and sequence objects (e.g. map, range, and filter).
  3. If you’re using some 3rd party SDK and it doesn’t expose an interface for passing a custom encoder then currently your only option is monkey-patching, or pre-serializing your output. There is another benefit of bringing this into the standard lib, which is that it offers guaranteed access to the stdlib’s serialization mechanism.
3 Likes

True, but that is by design of JSON.

Should it be a goal to be able to round trip all python types at all?
The big win for JSON is that interoperates with apps written in other languages.
In which case they cannot decode encoded python objects.

Let’s not get sidetracked by lack of round-tripping:

I believe we just did something similar with the new stdlib tomllib, except in reverse – it can read, but not write, toml files.

Why limit to JSON? There are other common and useful serialisation protocols such as YAML, TOML, HDF5, ProtoBuf and pickle (which also has its own dedicated support in special methods).

I would suggest generalising by introducing a special method __serialise_prepare__, which takes one argument (beyond self) which is the name of the format bring serialised to (similar to __format__), and must return only types which the serialisation library recognises.

This is because TOML and YAML support date-tones, whereas JSON doesn’t, and other formats have better support for binary data.

A consumer of this API would implement this method on their toes as, for example:

class MyDateTime(datetime.datetime):
    def __serialise_prepare__(self, fmt):
        if fmt in ("yaml", "toml"):
            return self
        return self.isoformat(sep="T")
9 Likes

I’m not opposed to providing similar support for other formats supported by the stdlib, but I don’t think it’s a good idea to open this up to every conceivable format. When python incorporated json into the stdlib, they moved it all the way upstream and took ownership over how Python would interface with json format.

That’s not to say that other formats couldn’t choose to mimic this functionality. For example, pyyaml would be free to expose __yaml__ (or perhaps __yaml_format__ to avoid potential name collision in the future), but that’s a discussion for the pyyaml maintainers to have-- it’s not appropriate for python core.

Additionally, obj.__serialize_prepare__("my_favorite_format") encourages the use of magic strings and makes it more complicated to figure out whether and how the format in question is supported. I mean imagine the documentation for a class that makes heavy use of this functionality. They would probably end up documenting each of the expected inputs values of __serialize_prepare__, because in effect each input is a different method, but we’ve moved the name lookup into a sub dictionary instead of using the built-in method lookup system.

1 Like

Hell no. You said it yourself: JSON doesn’t require the ability to round trip all types. I think that this issue of round tripping is a red herring. JSON doesn’t require round tripping of all types, so why should a __json__ dunder method?

So long as the types that JSON does require to round trip do round trip, we’re good.

I’m not sure that a single method is sufficient. Surely we would need an encoder and a decoder? Or have I missed something?

In theory yes, but in practice, how would it know what object to call the decoder on? Consider:

class X:
    def __json__(self): return ["X", "my state here"]

item = {
    "id": 42,
    "x": X(),
    "spam": "ham",
}
text = json.dumps(item)
remade_item = json.loads(text)

There’s no way for loads to figure out that X() should be loaded. That’s the inherent non-round-trippability coming in; by definition, it’s going to load as the actual native type. This ONLY works when you have a custom Python object that you want to send to something else in a particular format.

(It sounds like a narrow use-case, but it’s one that covers a remarkable number of situations.)

To be explicit I think round tripping python in JSON is a bad idea.
I want to see if anyone was going to argue for it.

3 Likes

As I wrote in the past thread, __json__ solves only serialization. It can not solve deserialization of custom object. So I prefer two stage approach like pydantic.

If we really need to add the method, I think it should be simplejson compatible.

https://simplejson.readthedocs.io/en/latest/#simplejson.dumps

If for_json is true (not the default), objects with a for_json() method will use the return value of that method for encoding as JSON instead of the object.

3 Likes

It may be worth noting that adding this to JSONEncoder in the manner suggested would have the unexpected behaviour that if a separate default is passed to dumps, the __json__ method would then be completely ignored as providing a default argument to dumps leads to the method on the encoder instance being replaced. (Yes, this behaviour did surprise me when I ran into it).


This may seem tangential but as this has come up a couple of times recently I wonder if it make sense if the JSON documentation recommended providing a default function to the default argument of dumps instead of extending JSONEncoder?

If all you are doing when subclassing is replacing the default method then passing a default to dumps seems to be a more straightforward approach.

It looks like the simplejson documentation was changed to suggest using default. Is there a reason the python documentation still recommends subclassing instead?

Equivalent Python documentation for comparison

How I would do it is a pair of magic methods,__serialise__() and __serialise_object__(). __serialise__() would take one argument, the encoder to serialise with while __serialise_object__() would take the object to serialise. When a user wants to serialise an object, they would call the builtin function serialise() with the object to serialise and how to serialise it. For example serialise(my_object, json.JSONEncoder) would serialise an object to json.
The serialise() function would be defined as:

def serialise(data:object,encoder):
    try:
        return encoder.__serialise_object__(data)
    except NotImplementedError:
        try:
            return data.__serialise__(encoder)
        except NotImplementedError:
            raise NotImplementedError(f"Cannot serialise {data!s} using {encoder!s}")

When writing the __serialise__() method, the programmer wouldn’t need to know which types the encoder can handle natively because that’s handled in serialise().

This way the writer of the encoder only needs to deal with the types that can be natively serialised and the programmer writing __serialise__() can effectivly ignore what the data is being serialised to.