When you walk __reduce_ex__ recursively (for pickling, copying, or a custom serialiser), you eventually need to know when to stop, that is, when a value is irreducible and should not be reduced further. As far as I can tell, the protocol gives no self-describing signal for this. Every reduce returns a (callable, args, ...) recipe whose args are themselves reducible, with no terminal element in the protocol’s own vocabulary. (5).__reduce_ex__(2) is (__newobj__, (int, 5), ...), and the 5 inside reduces again, forever.
So every consumer terminates recursion by carrying an externally maintained list of “atomic” types, pickle’s dispatch table being the canonical example (dispatch[int], dispatch[str], dispatch[bytes], and so on). My objection to that approach is that you have to know every atomic type up front and keep the list current. It is not derivable from the object; it is a curated set that evolves over time. bytes is a clean illustration: it became a first-class pickled type only in protocol 3 (Python 3.0), explicitly added rather than being intrinsically atomic. So the atomic set is not a fixed fact about the language; it is a maintained list that has grown. A third-party type that is genuinely irreducible has no way to say so. It just falls through to the generic reduce unless someone registers it.
What I find myself wanting is for irreducibility to be self-describing and detectable, so I do not have to enumerate types. Interestingly, you can almost derive it today. I have been using this stop condition: a value is terminal if its reduce is (__newobj__, (cls, v), ...) where type(v) is type(original) and v == original, that is, “the reduce reproduces an equal object of my own type, so there is nothing left to unfold.” This works uniformly, and it even handles delegation correctly: bool reduces to (__newobj__, (bool, 1), ...) where the 1 is an int, not a bool, so it does not match and correctly recurses one more level to the int, which does. “Reduce until you can reduce no further” falls out without any type list.
The catch, and my actual question, is the comparison step. To confirm that the reduce reproduced the input, I would like to use is, which is cheap and exact. But I cannot, and the reason is worth spelling out, because the immutable scalar types are inconsistent with each other in a way that surprised me.
Take str, int, float and bytes. Every one of them, when reduced, hands back its value as a freshly constructed object, not the original. "hello".__getnewargs__()[0] is a new string equal to the original but at a different address; the same is true for an int, a float, and a bytes. In effect each of these reduces to a deep copy of itself. So is against the original always fails for them, even though they are about as irreducible as a value can be. If I want to recognise them as terminal by comparison, I am forced onto ==, because identity has been thrown away by the time the reduce hands me the arg.
The tuple is the odd one out, and it is the one case that can be argued to be correct. (1,2,3).__getnewargs__()[0] is (1,2,3) is True: the tuple hands back itself, not a copy. For a deep copy that is arguably the right thing, since the tuple is genuinely immutable and sharing it is unobservable, so preserving identity is harmless and cheaper. But for serialisation it is exactly the wrong thing, because what I actually want is for the tuple to be converted into something I can walk, a list, so that I can recurse into its contents and encode them. As it stands the tuple looks terminal under any identity or equality test (it equals itself and is itself), so a naive walk stops there and never encodes the elements. A tuple holding a mutable, say (1, [2, 3]), makes this concrete: it compares equal to itself and reports its own type, so it looks terminal, yet the inner list still needs walking.
So the two problems are separate. The scalars throw away identity, which forces me off is and onto == with all the baggage that == brings (user-defined, potentially expensive, potentially dishonest). The tuple keeps its identity, which is defensible for copying but actively unhelpful for serialisation, where I would want it expanded into a list instead.
My questions:
- Is the non-preservation of identity for immutable newargs (str, int, float, bytes) a documented and relied-upon property, or is it incidental to how each type’s
__getnewargs__was implemented? If incidental, would making it consistent, or at least documenting that consumers must not rely on it, be worthwhile? - More fundamentally, is there any appetite for a self-describing irreducibility signal in the protocol, some way for a type to declare “I am a terminal, stop here,” so that consumers walking
__reduce_ex__do not each have to hardcode and maintain an atomic-type table? Or is the type-dispatch-table approach considered the deliberate and correct design, with the curated atomic set being a feature rather than a maintenance burden?
I suspect the answer to (2) may be that type dispatch is intended, but I would like to understand why that is preferred over a self-describing terminal, given that the atomic set demonstrably is not fixed (bytes joining in protocol 3) and third-party irreducible types cannot currently announce themselves.