Could the reduce protocol offer a self-describing "irreducible" signal, instead of consumers maintaining a hardcoded atomic-type table?

When you walk __reduce_ex__ recursively (for pickling, copying, or a custom serialiser), you eventually need to know when to stop, that is, when a value is irreducible and should not be reduced further. As far as I can tell, the protocol gives no self-describing signal for this. Every reduce returns a (callable, args, ...) recipe whose args are themselves reducible, with no terminal element in the protocol’s own vocabulary. (5).__reduce_ex__(2) is (__newobj__, (int, 5), ...), and the 5 inside reduces again, forever.

So every consumer terminates recursion by carrying an externally maintained list of “atomic” types, pickle’s dispatch table being the canonical example (dispatch[int], dispatch[str], dispatch[bytes], and so on). My objection to that approach is that you have to know every atomic type up front and keep the list current. It is not derivable from the object; it is a curated set that evolves over time. bytes is a clean illustration: it became a first-class pickled type only in protocol 3 (Python 3.0), explicitly added rather than being intrinsically atomic. So the atomic set is not a fixed fact about the language; it is a maintained list that has grown. A third-party type that is genuinely irreducible has no way to say so. It just falls through to the generic reduce unless someone registers it.

What I find myself wanting is for irreducibility to be self-describing and detectable, so I do not have to enumerate types. Interestingly, you can almost derive it today. I have been using this stop condition: a value is terminal if its reduce is (__newobj__, (cls, v), ...) where type(v) is type(original) and v == original, that is, “the reduce reproduces an equal object of my own type, so there is nothing left to unfold.” This works uniformly, and it even handles delegation correctly: bool reduces to (__newobj__, (bool, 1), ...) where the 1 is an int, not a bool, so it does not match and correctly recurses one more level to the int, which does. “Reduce until you can reduce no further” falls out without any type list.

The catch, and my actual question, is the comparison step. To confirm that the reduce reproduced the input, I would like to use is, which is cheap and exact. But I cannot, and the reason is worth spelling out, because the immutable scalar types are inconsistent with each other in a way that surprised me.

Take str, int, float and bytes. Every one of them, when reduced, hands back its value as a freshly constructed object, not the original. "hello".__getnewargs__()[0] is a new string equal to the original but at a different address; the same is true for an int, a float, and a bytes. In effect each of these reduces to a deep copy of itself. So is against the original always fails for them, even though they are about as irreducible as a value can be. If I want to recognise them as terminal by comparison, I am forced onto ==, because identity has been thrown away by the time the reduce hands me the arg.

The tuple is the odd one out, and it is the one case that can be argued to be correct. (1,2,3).__getnewargs__()[0] is (1,2,3) is True: the tuple hands back itself, not a copy. For a deep copy that is arguably the right thing, since the tuple is genuinely immutable and sharing it is unobservable, so preserving identity is harmless and cheaper. But for serialisation it is exactly the wrong thing, because what I actually want is for the tuple to be converted into something I can walk, a list, so that I can recurse into its contents and encode them. As it stands the tuple looks terminal under any identity or equality test (it equals itself and is itself), so a naive walk stops there and never encodes the elements. A tuple holding a mutable, say (1, [2, 3]), makes this concrete: it compares equal to itself and reports its own type, so it looks terminal, yet the inner list still needs walking.

So the two problems are separate. The scalars throw away identity, which forces me off is and onto == with all the baggage that == brings (user-defined, potentially expensive, potentially dishonest). The tuple keeps its identity, which is defensible for copying but actively unhelpful for serialisation, where I would want it expanded into a list instead.

My questions:

  1. Is the non-preservation of identity for immutable newargs (str, int, float, bytes) a documented and relied-upon property, or is it incidental to how each type’s __getnewargs__ was implemented? If incidental, would making it consistent, or at least documenting that consumers must not rely on it, be worthwhile?
  2. More fundamentally, is there any appetite for a self-describing irreducibility signal in the protocol, some way for a type to declare “I am a terminal, stop here,” so that consumers walking __reduce_ex__ do not each have to hardcode and maintain an atomic-type table? Or is the type-dispatch-table approach considered the deliberate and correct design, with the curated atomic set being a feature rather than a maintenance burden?

I suspect the answer to (2) may be that type dispatch is intended, but I would like to understand why that is preferred over a self-describing terminal, given that the atomic set demonstrably is not fixed (bytes joining in protocol 3) and third-party irreducible types cannot currently announce themselves.

3 Likes

For something to be atomic, it needs special handling by the outer library. Right? If the caller doesn’t itself recognize an object as irreducible, then what can the caller do with an such an object? What useful third party object can mark itself irreducible and be handled correctly by pickle?

I suspect that int and co only have a __reduce__ implementation for the benefit of subclasses.

4 Likes

You’re right that the consumer has to do something with the signal — but I think that cuts toward my proposal rather than against it. Let me show where the signal lands, because a deep-copier makes it concrete in a way pickle’s name-based machinery obscures.

A reduce-driven deep copy is just: walk _reduce_ex_ recursively, rebuild. The recursion has to terminate somewhere — and “somewhere” is exactly the set of objects that reduce to themselves. Skeleton (the real thing I had working, trimmed):

def copy(obj):

reduced = obj.\__reduce_ex_\_(2)

if isinstance(reduced, str):

    return obj                      # global-by-name: shared, not copied

callable\_, args = reduced\[0\], reduced\[1\]

state    = reduced\[2\] if len(reduced) > 2 else None

new = callable\_(\*\[

    a if (type(a) is type(obj) and a == obj)   # <-- the stop condition

    else copy(a)

    for a in args

\])

if state is not None:

    new.\__dict_\_.update(state)      # (+ listitems/dictitems if present)

return new

The whole question lives in that one line. For 5, int._reduce_ex_ yields (copyreg._reconstructor, (int, object, None)) or, depending on protocol, a form whose args contain 5 itself — and I recurse into copy(5), which reduces to args containing 5, which… doesn’t terminate. So I need a terminal test, and right now that test is a heuristic: “is this arg the same type and value as the thing I’m reducing?” That’s me, the consumer, guessing at irreducibility because the protocol won’t tell me.

That heuristic is exactly what’s fragile. It misfires (a tuple whose contents compare == to itself looks like a self-reproduction and gets returned un-walked), and it can’t distinguish “I reduced to myself because I’m atomic” from “I reduced to something that happens to equal me.” A self-describing irreducibility signal would replace the guess with a fact: the consumer checks one flag instead of pattern-matching on type-and-value identity.

On your second point — yes, I think you’re largely correct that int/str/tuple carry _reduce_ex_ mainly so subclasses round-trip, and the base types are special-cased inside pickle’s dispatch table rather than relying on their own reduce. That’s the tell: pickle already maintains a hardcoded table of “these are atomic” precisely because the objects can’t say so themselves. My proposal is just to lift that knowledge out of pickle’s private table and let the object declare it — so that a different consumer (a deep-copier, a JSON transcriber, anything walking reduce) doesn’t have to re-derive or re-hardcode the same table. Today every such consumer either imports pickle’s assumptions or invents its own stop-heuristic like the line above.

So “what can a third-party object do by marking itself irreducible?” — it can terminate any reduce-walking consumer correctly, not just pickle. Pickle happens to handle atoms via its dispatch table; a from-scratch consumer has no such table and currently has to guess. The signal is for the consumers pickle doesn’t write for.

No. That is not why pickle carries a hardcoded table.

pickle has a hardcoded table because pickle needs to take these atoms and serialize them. That is something pickle needs to know how to do.

What is pickle supposed to do with such an object?

1 Like

I’m not necessarily trying to serialise an object, just deep copy it to start with. Or to put it another way I’m trying to extract the data part of an object. If I could recurse the reduction until the reduced gave me a callable and the exact same object I would know that was an “irreducible” object.

Yes, but my point is that this ability to not care about the exact types is not particularly widespread. Most usecases, e.g. pickle, json libraries, other serailizers (which you used as examples) need to special case these types. The “requirement” isn’t a bug. They need to be hardcoded for almost all users. Giving third party libraries a “proper” way to signal irreducability gives the incorrect impression that that is a supported usecase.

Instead there should be a clear list somewhere in the documentation about which types are irreducible.

Although I do think these types inviting infinite recursion for incorrect useage of the reduce protocol is a footgun - maybe they should raise a TypeError if called on a non-subclass instance, although that may not be worth the performance cost.

1 Like

Hi

I wasn’t really suggesting it was a bug, I was more interested if there was a reason down at the C level the irreducible elements can behave as I suggest. Why can’t the __reduce__ of a string return the exact same string? It must actually be harder to deep copy it down there! Then why does the tuple return as I suggest, is deep copying it too much work at the C level?

I also agree that even in my use case I’ve add a mechanism to deal with certain types in alternative ways.

I also wanted to point out that there wouldn’t be a need to maintain a list of irreducibles if a mechanism like mine existed.

Sam

Because the code is designed for subclasses. For subclasses, a new string instances has to be created. It’s easier to just always do that instead of checking if it’s needed or not. The code could be changed to create a copy if needed

Because the generic code it uses to create a new tuple with the same content happens to reuse the tuple object. Specifically it runs something like self[:], and you can verify that for tuples a is a[:] if a is not a subclass. See the code

The int version calls _PyLong_Copy which doesn’t have this special case. (except ofcourse for the cached small ints)

Whether or not a new object is returned for the baseclass is not an intentionally designed choice - after all, calling __reduce__ and friends on these objects probably indicates that your code is misbehaving - you should recognize and special case these types, otherwise you would get into an infinite loop.

I remain unconvinced that standardizing on a specific solution is worth it - it would just invite people to use the reduce protocol in unsupported ways.

2 Likes