CBOR tags for Python object serialization?

I’m working on some IoT client/server stuff (MicroPython on the small stuff, CPython or possibly PyPy on the servers). I’m using CBOR as the data link message format, and I would like to be able to serialize some objects across the wire.

Unfortunately, the CBOR tag registry at Concise Binary Object Representation (CBOR) Tags doesn’t list tags for serializing Python objects or classes. Question: is there an Internet draft (or similar) around on how to do this? If not, would anybody be interested in working on one?

While CBOR has a “private” tag range I don’t want to use it, among other reasons because the range starts at 80000, i.e. the tag would require five bytes. Not a good idea.

When you say “serialize some objects across the wire”, are you referring to some form of pickle format which can serialize many Python objects or are you just interested in sending plain data (e.g. strings, integers, floats) ?

For the general serialization approach, you could first pickle the object and then send it as a string. However, this is rather insecure, since unpickling data from arbitrary sources is a security risk.

For the plain data type approach, CBOR seems to provide sufficient data type coverage to deal with most situations.

Some alternative ways:

  • send the data as JSON and then use the CBOR Embedded JSON Object tag.
  • write your own CBOR pickler/unpickler (the Python code for pickle is fairly easy to understand)
  • switch to MsgPack as message format; there is support for this on MicroPython and it’s a really compact binary format which can be read from many different languages

Yes I’m interested in some sort of pickle-ish format, and in fact writing a CBOR (un)pickler is exactly what I’m going to do if nobody has done it before. Alternately there’s jsonpickle which might be useable as a template to start from.

The point of my question is that the data format should be able to distinguish plain arrays and maps/hashes from pickled objects. CBOR does that kind of thing with tags. I want to use an “official” tag for “this encodes a Python class or instance”, thus I need to write up some sort of specification (unless there already is one) and submit it to IANA in order to get an ID assigned.

Basic CBOR and msgpack codecs are roughly the same size, as are the resulting byte streams. With msgpack I’d have the exact same problem of not stepping on anybody else’s toes, except that msgpack doesn’t have an extension registry and exactly one official extension (for dates) instead of, well, a lot of them.

Also, msgpack is only partially streamable (extension data need to be prefixed with their length, thus must be pre-assembled; arrays+maps can’t be streamed from iterators), has no universally-understood way to deal with data structures that have cross references or recursion (CBOR has tags for that), and using extensions incurs more overhead than with CBOR. Bottom line: no msgpack for this project.

Finally: No, I’m not using JSON. There’s no point in doing that. It’s a subset of CBOR and has more overhead. Also no bigints, no bytestrings, no integer keys for dicts, …

1 Like

I guess easiest would be to register just a single tag for “Python pickle data” and then send the data itself as a regular Python pickle string, which would be fairly efficient, since it’s a binary format as well.

If you were to go low level (try to mimic the pickle protocol with existing CBOR tags), you’d have to register quite a few new tags. See https://github.com/python/cpython/blob/main/Lib/pickle.py#L101 for a list of opcodes it uses. pickle essentially uses a stack machine for encoding and decoding process.

1 Like

Well, that’s one possibility; the problem is that pickle isn’t exactly language agnostic.

The other would be to use __getstate__/__setstate__ and __getnewargs_ex__/__new__ to extract a single object’s state, then mark that with a tag when encoding so the decoder knows it’s an object instead of a plain dict or whatever; this seems to require just two tags, one for the former and one for the latter.

Python classes and objects are not language agnostic either.
Don’t you need to convert you python objects into existing COBR tags to be agnostic?

1 Like

They’re not per se, but it’s absurdly easy to teach e.g. Perl or Javascript to read and write those objects when all you need is a tag value and the proper arguments to __setstate__.

Speaking of Perl, there’s already a nice CBOR tag for its object serialization protocol. IMHO Python should kindof get the same courtesy. :wink:

1 Like

Is it this? http://cbor.schmorp.de/perl-object

The next tag in that list (http://cbor.schmorp.de/generic-object) seems to be basically the same as the Perl tag, but also “supports” Python.

Would it be sufficient / make sense to use that generic-object tag then, or is there a downside to that?

Have you checked what other iot-python people think about this? It seems premature to register a format that might not be well thought out before discussing with other members of that community what a reasonable format would be.

I’m not a CBOR expert, my understanding stops at it being essentially isomorphic to JSON. I would probably use dataclasses to model serializable data, and then us asdict to turn them into a dict and then serialize the dict as if it was JSON. Seems wasteful to serialize the entire object but I don’t know exactly how pickle does it, nor do I know your exact needs.

1 Like

Other than “why should Perl be the only special snowflake to have its own tag” :nerd_face: there’s the additional overhead of prefixing class paths with “py/” if this is intended to be somewhat-standard-conforming.

This is not just about IoT, that’s just the scope of my current project. It’s also about having a Python serialization format that’s as compact as Pickle but can be analyzed (and mangled) in a safe and language agnostic way; even on Python you could easily read and write the format safely by simply not decoding the tag.

NB, CBOR isn’t isomorphic to JSON. It’s a strict superset. It has native bigints (as opposed to interprting large numbers as floats), bytestrings, non-string keys for dicts …

Thanks for explaining CBOR, I only encountered it in a couple of papers I read during my PhD, always in the same breath as JSON was mentioned and essentially explained as “binary json”. Now I see why you’d actually use it. All of those papers were, unsurprisingly, IoT related too.

I think that maybe there should be more discussion about this before Python (the CBOR-using Python users at least) snags a tag. I think I agree that a Base64 encoded pickled object seems like a good idea if you were to go ahead with it. However, I wonder what the benefit of using CBOR would be in that case instead of just sending a raw string?

1 Like

That would be a great thing to have, but I think it’s a pretty significant task and the issue there is deciding what that format would actually be. I’d imagine getting such a format into CBOR would be much easier than creating the format in the first place.

I very much agree! Pickle has many unexpected pitfalls and there’s lots of stuff out there on the internet recommending against its use. It may be that some of those problems don’t apply in the cases we’re considering, but I think we’d need to examine that. That is, we need to answer the question “how (if at all) can/should we serialize a Python object” before worrying about getting it into CBOR.

I’m not sure of this is relevant to this discussion, but for me it seems that cattrs library could be of interest here. It separates structuring from serialization, and serialization supports cbor:

cattrs comes with preconfigured converters for a number of serialization libraries, including json, msgpack, cbor2, bson, yaml and toml.

1 Like

Why would I want to base64-encode anything, given that CBOR has perfectly useable binary strings?

Also pickle is a terrible protocol if you want to do more than (mostly-)faithfully replicate a Python object.

Umm, no. cattrs reads/writes loosely-typed object streams from/to typed subclasses like attr and dataclass. It can’t package arbitrary objects, much less refer to classes.

Definitely. Somebody :disguised_face: should write up a draft. Probably me …

Also, it might make sense propose a generic read-only-object tag to CBOR. For Python this tag can serve to distingush bytes/bytearray and tuple/list.