Add json.ExtendedEncoder to the standard library for common types (datetime, UUID, Decimal, set)

Converting many standard values (date, time, bytes, complex numbers, UUIDs, etc.) to text and back from text is quite easy.

Creating a format serializable by JSON is quite easy too (just add a data type info to each converted value - example: "date:2026-06-03", BTW "int:42" solves the JSON key type limitation). What’s quite disappointing and surprising (at least for me) is that there is no real standard for that.

Given the history of TOML, there is a chance that a well thought out solution that suits not 80%, but 99+% becomes a wide-spread standard.

Note that JSON is not the only data format; serialization issues exist across all data formats. This means serialization is not specific to JSON. Each format has its own data types and constraints.

When data falls outside those constraints, it often indicates a mismatch between the data format and the use case, which can result in using the wrong format or misusing an existing one rather than a limitation unique to JSON. In such cases, the issue is typically not with the format itself, but with choosing an inappropriate representation for the data.

For example, using JSON to transmit binary data does not make much sense.

JSON numbers are quite literally decimal numbers. Representing them as strings because of “IEEE 754” is suspect.

If only json encoder unable to handle decimal · Issue #60739 · python/cpython · GitHub would be completed, then in conjunction with the parse_float argument to json.load, Decimal would be perfectly taken care of.

Why isn’t it JSON? The OP is proposing an opinionated way to serialize some Python object to standard JSON, not to create JSON5++.
Anyway, PYON is cool. It remembers me Starcraft. We can build our own JSON, with blackjack and… lemonade.

I don’t get the point. I always used JSON in the body of an endpoint, never as URL param or path.

Notice that the OP didn’t added bytes to the proposal. On the contrary, he also explained why he didn’t add it.

1 Like

My reasoning differs from that of the original poster and applies to any Python object that cannot be represented as text. JSON is not an appropriate format for transmitting data types that it does not support. Serialization is not an issue with the JSON format itself. The JSON library is already complete. This means that the concern of a JSON library is to support JSON data types in Python, not Python-specific data types or objects within JSON.

1 Like

ujson used to support datetime until it was eventually thrown out. No-one could agree on a datetime format[1]. It does still support Decimal but that was a mistake – again it’s contested whether it should serialise to a float or string[2].

Even that’s ambiguous. Should it reject surrogates like most UTF-8 conversions or allow them like other parts of JSON implementations usually do?


  1. Personally I consider UNIX timestamps to be the only sensible format for datetimes but I rarely meet someone who agrees ↩︎

  2. and everyone who does have an opinion is adamant that theirs is the only one that makes sense ↩︎

2 Likes

That’s not really a UTF-8 quirk but a Unicode quirk, one that works as intended. Surrogates are weird because they’re valid Unicode Codepoints but not valid Unicode Scalar Values. ie it’s perfectly fine to talk about U+D800, but not fine to try and encode it. Both Python and the Python json module seem to respect that.

Python lets you do chr(0xD800) or chr() of any surrogate codepoint value, but not encode it alone eg in UTF-8.

>>> chr(0xD800)
'\ud800'
>>> chr(0xD800).encode()
Traceback (most recent call last):
  File "<python-input-1>", line 1, in <module>
    chr(0xD800).encode()
    ~~~~~~~~~~~~~~~~~~^^
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

Same with .encode('UTF-16') and .encode('UTF-32'). As for the json module

>>> import json
>>> s = chr(0xD800)
>>> print(json.dumps(s))
"\ud800"
>>> print(json.dumps(s, ensure_ascii=False))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
    print(json.dumps(s, ensure_ascii=False))
    ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 1: surrogates not allowed

That’s what I would expect. When ensure_ascii=True, the default behaviour, the codepoints are ASCII-sub-encoded written as "\uWXYZ" to permit ‘talking about’ all codepoints, including surrogate codepoints, but we are not trying to actually encode them.

To be more explicit, we’re talking about input like b'\xff'. That isn’t a valid UTF-8 encoded string, so you can’t “just use UTF-8”. Remember, the point here is to convert to a type that’s a valid JSON data type here, and there are only really 2 sensible options for that:

  1. An array of integer byte values - list(byte_val).
  2. A string, where we need to choose an encoding like Latin-1 that can correctly decode every byte string - byte_val.decode("latin-1")

In specialised cases where you know your byte string is a character string encoded using a particular encoding (likely UTF-8), then you can decode using that encoding, but that’s not a general solution as you won’t be able to handle arbitrary byte strings.

There’s also dedicated encoding methods like Base64, but whether they are appropriate in any given context depends on the application.

All of which demonstrates that there’s no “obvious” way of serialising arbitrary byte strings to JSON, so a stdlib encoder makes no sense, applications should specify their own rules. Which, to be fair, is what the OP said in the first place.

4 Likes

If you let me the joke, I agree. Personally I prefer RFC because it’s a little more readable with the space instead of the T. And for a matter of raw speed, the UNIX timestamps are better.

Anyway, no one will force you to use ExtendedEncoder. If you don’t like the ISO format, you will not use it.

Just my two cents: if you have a Decimal, you really care about precision, so probably you want a string. Again, if you want a float, just not use ExtendedEncoder :slight_smile:

BTW, why are you saying it was a mistake?

Side note: in ujson there’s a reject_bytes param. It’s True by default, probably for backward compatibility:

>>> ujson.dumps(b'\xff', reject_bytes=False)
Traceback (most recent call last):
  File "<python-input-6>", line 1, in <module>
    ujson.dumps(b'\xff', reject_bytes=False)
    ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 1: invalid start byte
>>> ujson.dumps(b'a', reject_bytes=False)
'"a"'

My opinion would be that the most correct encoding of Decimal is to serialise it as a number, and it is the responsibility of the decoder to use json.loads(..., parse_float=str) to preserve the precision.

As for example happens in this code fragment:

import json
import simplejson
from dataclasses import dataclass
from decimal import Decimal

from cattrs import Converter

@dataclass(frozen=True)
class C:
    f: float
    x: Decimal

c = C(1.3300000000000001, Decimal("1.3300000000000001"))

converter = Converter()
converter.register_structure_hook(Decimal, lambda v, _: Decimal(v))
converter.register_structure_hook(float, lambda v, _: float(v))

raw_json = simplejson.dumps(converter.unstructure(c), use_decimal=True)
assert raw_json == '{"f": 1.33, "x": 1.3300000000000001}'

c_again = converter.structure(json.loads(raw_json, parse_float=str), C)
assert c_again == c
assert c_again.f != c_again.x

here c.f loses precision because it is a float, but c.x keeps all its Decimal precision.

Yet I have to import simplejson to write what I consider the ‘proper’ json, which means in practice in production we’d almost certainly use the double-stringification of Decimals, as in eg '"1.2345"'.

Even if the Python standard lib picked the (wrong!) double-stringification approach to Decimal, I’d still be happy that dumping objects/dicts with Decimal in them to json had become easier.

But I suppose that this way the Decimal is first converted to a float, than to a string, so the precision is loss.

You could also just skip the intermediate float conversion, but this way parse_float=str is a “lie”.
It could be seen as a “white lie” for avoiding yet another param.

It’s better than nothing anyway :slight_smile:

Why do you suppose that?

>>> json.loads("1.23456789012345678901234567890")
1.2345678901234567
>>> json.loads("1.23456789012345678901234567890", parse_float=str)
'1.23456789012345678901234567890'
>>> json.loads("1.23456789012345678901234567890", parse_float=decimal.Decimal)
Decimal('1.23456789012345678901234567890')

How is it a lie?

1 Like