I’m working on some code (Sage’s multivariate polynomials) where I need to pickle and unpickle large objects (polynomials with millions of terms), and I’m trying to figure out how to do it well.
The current code converts the polynomial to a dictionary (exponent -> coefficient maps) and pickles the dictionary. Unpickling is done by supporting initialization of the polynomial object from a dictionary.
Obviously, this is problematic for large polynomials, since we need to duplicate all of the data, both when pickling and when unpickling.
Pickling doesn’t seem like too much of a problem - just create an iterator that produces key/value pairs and return it from
Unpickling is the problem. First, the object is supposed to be immutable and doesn’t currently have a
__setitem__ method. A more serious problem is that the data is maintained as a sorted linked list in the underlying C library. Inserting individual items is slow. The current code, when initializing from a dictionary, puts everything into a bucket, then once everything is present, sorts the bucket once and forms the linked list.
So, how to implement
__setitem__ efficiently? I’m thinking that it would be best to get some kind of notification when the unpickling is complete, so that the sort can be delayed until then.
Looking at the unpickle code, it seems like
__setstate__ is the very last thing that gets called. So, I could implement a
__setitem__ method that just sticks everything into a bucket, and a
__setstate__ method that finalizes the initialization (sorting and forming the linked list).
Don’t know if I dare publish such code. Would it be reliable? Don’t think there are any guarantees about the order that
__setstate__ gets called.
Any other ideas?