Allowing override of JSON dumps serialization for standard types

I come from here: Subclassed json.JSONEncoder does not respect default method for supported types · Issue #74528 · python/cpython · GitHub

As mentioned, I need to serialize a class (which is a 128-bit value which extends the standard int class) into an RFC-4122 compliant string, and the fact that I can’t override int serialization (which, of course, cuts off my ints because of numbers-as-floats in JSON…) seems bonkers.

The original author of the PR had done good work, but it was refused on the grounds of “needing more discussion”, which I understand, given that their change was potentially breaking, and if you want to do that kind of change, better do it well and durably.

However, all this new feature really needs to be non-breaking is to add an “override()” method to the JSONEncoder class. It would basically “pass” in the implementation of JSONEncoder, but you could extend it like you would “default()” currently. This “override()” method would simply be read before the standard conversions, unlike “default()”.

In the meantime, I’ll be writing my own “json.dumps preprocessor” for my code, but this seems like a lot of trouble for a feature which should absolutely be standard.

If there is any better way to override the JSON serialization for standard types, or classes that extend standard types, I’m all ears.

1 Like

It would help if you could point to the docs of JSON serialization – I vaguely recall that you can override for specific classes, so if you subclass int, can’t you add the magic method to your subclass?

(Aside: using terms like “bonkers” seems unnecessary. Let’s just focus on ways we could solve the problem.)

While the behavior of JSONEncoder.default / dumps(default=) is often surprising to new users, I don’t think it’s unreasonable. For speed, classes that directly map to JSON types can’t be overridden. If you want to change the output you’re going to get, you do a preprocessing step. At that point, I doubt you’re getting any significant performance difference from using default vs your own preprocessor.

This often comes up in Flask, where users are confused because the default json_response function doesn’t serialize SQLAlchemy query results. The answer there is to use a serialization library like Marshmallow or cattrs that is designed to get data to and from a format that’s appropriate for JSON.

3 Likes

I agree default could have been named fallback, but too late now (ie not worth the disruption).

I don’t think the standard-library JSON module should learn this functionality, as I think it should be a strict JSON encode/decode library for systems without the ability to install third-party packages. I think it already has too much functionality.

I’m sure you’re aware of the JSON encoders on PyPI which support the functionality you’re requesting, eg: simplejson, rapidjson, orjson (but not fast-json, ijson, json5, ijson or nujson)

1 Like

I think two-pass (preprocessor/postprocessor) approach is the best.

  • User can chose preprocessor that matches their use case.
  • Same preprocessor can be used for toml, msgpack, etc.
  • Postprocessor can use context information (see below).
In [1]: from pydantic import BaseModel
In [3]: from datetime import date

In [4]: class Child(BaseModel):
   ...:     name: str
   ...:     birthday: date
   ...:

In [5]: c = Child(name="miro", birthday=date(2012,3,24))

In [8]: s = c.json()

In [9]: s
Out[9]: '{"name": "miro", "birthday": "2012-03-24"}'

In [12]: Child.parse_raw(s)
Out[12]: Child(name='miro', birthday=datetime.date(2012, 3, 24))

In this example, pydantic can convert “2012-03-24” as date because it knows “birthday: date” information. That is what I say “context”.

I am afraid about making JSONEncoder complicated. Many people may have many customization request.

Instead of making JSONEncoder more customizable, I would like to provide toolkit for writing their own encoder. Since JSON is simple, providing str encoder and float encoder would be enough for writing custom encoder by themselves.

We expose encode_basestring and encode_basestring_ascii in json.encoder but they are not documented. And we don’t expose float encoder yet.

So my counter proposal is:

  • Rename encode_basestring and encode_basestring_ascii to encode_str and encode_str_ascii. (keep undocumented encode_basestring for backward compatibility.)
  • Expose encoder_encode_float() C function as json.encoder.encode_float() Python function.
  • Be conservative about making JSONEncoder more customizable. Recommend two-pass approach or writing their custom encoder first for feature request.

Guido von Rossum

Here is the doc; the only method that allows override is default(), but this method is only ever read for non-standard types (so effectively an “afterthought” of a normal serialization, in a sense), so standard types like bool or int (and, importantly, subclasses that inherit them) cannot have their serialization be overridden properly.

(Aside: I see “bonkers” as a silly term, but rereading the sentence, I do understand how it could be construed as some form of attack: I apologize. My sentiment was more along the lines of “I tried this for so long and with so little success that I’m starting to think I am insane, or at least very, very stupid”. And, while I might try to defend myself from accusations of the former, I will make no such claims for the latter. :sweat_smile: )

David Lord

The idea would precisely be to bake the design of a simple (possibly limited) preprocessing method into the JSONEncoder class, so that it can be leveraged like default() is, for a “postprocessor that just handles unhandled types” (caricatural description, but bear with me). I think expecting new users to write their own (fast) preprocessor (for JSON-like structures that might be heavily nested) is not the best idea. The advantage of a design like default() is that it is pretty easy to get a handle on (ie, just write a bunch of “if isintance of X, return Y”, then raise the unhandled case properly). I think something similar should be done for a preprocessor.

Laurie O.

I did not know many of these, actually. Where I work, we tend to limit our reliance on third-party dependencies as much as possible; but I’ll definitely take a look at these, thanks !

Once again, my thinking is just to provide “syntactic sugar” for simple preprocessors in the standard json lib; but if you consider that the current json lib already has too much functionality, I can see why you’d disagree with my suggestion.

Inada Naoki

Instead of making JSONEncoder more customizable, I would like to provide toolkit for writing their own encoder. Since JSON is simple, providing str encoder and float encoder would be enough for writing custom encoder by themselves.

This seems like a fine solution to me as well. Not sure about the implementation specificities that you describe later though.

Any function a user writes and passes as default will be essentially as fast as a function they write that preprocesses their data. Preprocessing is likely to be faster since it’s only called once for the entire data. There’s is essentially no difference between writing a function to pass as default versus writing a function to call.

By “not the best idea”, I was referring to making code more complex than it needs to be. Of course, performance-wise, a preprocessor is a fine idea. Perhaps we’re both victims of our respective sampling bias, but I’m pretty confident my newbies wouldn’t be able to write a good preprocessor (that’s clean, legible, and efficient) without some oversight, while I see no problem with them extending the JSONEncoder class from a basic, 10-line example. Out of those that were recommended to me by Laurie O., I think only simplejson’s “for_json” is within their range at this point in time. :sweat_smile:

Unless there’s a way of easily writing a JSON preprocessor, compatible with the JSONEncoder class, that I’m missing ? :thinking:

Writing a general preprocessor does seem to me pretty hard (you have to walk the data structure, looking for instances of your type, and replace them - that walk is precisely the core loop of a JSON encoder, so having to implement it again is clearly not trivial).

However, that’s if you want a general mechanism for encoding a custom int subclass. In most application code that I’ve written using JSON, there’s usually a specific data structure involved. In that case, there’s no need to write a generic preprocessor, you can just do something like data["record"]["field"] = convert_my_field_to_str(data["record"]["field"]).

Obvously that’s over-simplified, but my point is that you’re discussing this in the abstract, and abstract preprocessors are hard. But for a specific use case, an application-specific fix may well be very easy. And if it’s still messy, maybe what you want is a library like cattrs which takes your application data structure, with custom data types, and provides translations to and from a JSON-compatible dictionary format.

3 Likes

AFAIK all of these (including pydantic) still use json.dumps() internally and suffer from exactly the same problem that’s discussed here.

Again, preprocessing a pydantic BaseModel class looks like rewriting 50% of pydantic to me, is it not?

Howdy,

I’m the author of cattrs. David is right in that this sounds like an easy job for cattrs; a couple of lines of code to apply the override.

Yes, cattrs will ultimately call json.dumps for you, but before that happens it will apply your hook, transforming your int subclass into a string. json.dumps just sees the string. This is how other types like dates are routinely handled.

4 Likes

[Sorry to bump an old discussion]
Run into a relatable problem today. I work in observability and my job is to pick random data from other people libraries and send them serialized to some server for storing them. So I do not control what I have to serialize but have to send valid JSON to a server. Sometimes I get floats with values that are not valid for JSON that is +inf, -inf and nan. JSONEncoder permits to raise a ValueError with allow_nan=False but it’s not what I want, I want to serialize them to null and leave the decision to discard the data or not to people that have a clue about the data. e.g. simplejson have an ignore_nan parameter to handle the case but I see pitfall of adding a boolean parameter for each case. As for the original poster a generic way to override serialization for basic types would have made this issue easy to solve for me.

cattrs will handle this:

>>> from cattrs.preconf.json import make_converter
>>> import math
>>> converter = make_converter()
>>> converter.dumps(float("-inf"))
'-Infinity'
>>> def conv(f):
...     if math.isfinite(f):
...         return f
...     return None
...
>>> converter.register_unstructure_hook(float, conv)
>>> converter.dumps(float("-inf"))
'null'
>>>

This may not be the best approach - I’m not an expert in cattrs, I just put this together in a few minutes from reading the documentation - but it should give you an idea how to start.

The message remains the same, though, for this sort of application, a pre/post-processor is the best approach, and if you don’t want to write your own, libraries like cattrs exist to do the heavy lifting for you.

3 Likes