I woud like to know how much each of them knows about core development, the various moving parts of CPython, and our collective history. A good way for that is to ask a concrete question. So here is my question: if I were going to submit PEP 574 for pronouncement (not really an unlikely scenario), what detailed reaction or resolution would you propose as a Steering Council member?
I will reiterate my position that I think the Steering Council should form working groups or topical subcommittees to work on specific matters. So on most matters, I would look at the broad developer consensus, and only weigh in at a technical level if I felt very strongly that something was clearly objectionable.
That being said, I do have questions about this PEP. I admit that I am remiss in not having looked at it in too much detail yet, and so realistically I need to take some more time to review it. But after spending a couple of hours with it last night, here is my initial feedback.
Technical & Design Questions
This PEP addresses a concrete problem whose motivations I (of all people am very sympathetic to. Maybe I am missing something obvious, but the existing facilities in pickle and copyreg/dispatch_table seem like they should be sufficient. It seems like the intent of Pickler.persistent_id() is precisely to serve this kind of need, and I don’t fully grok the reasoning for why this is a rejected alternative.
-
In the PEP, the first stated drawback of persistent_id, namely having N consumers x M producers, does not actually seem that onerous to me, because N and M are quite small. The producers are basically {NumPy, Pandas, Arrow} and the consumers are (I think) {Dask, Arrow, xarray}? I’d have to think about it a bit more, but it seems like the producers could even provide template reducer/constructor functions which the consumers would only need to lightly customize.
I know that future libraries may arise, so this cardinality may grow - but it’s been 20 years since Numeric, and we’re still looking at a very small handful of libraries that need to touch this kind of internal detail. Even as more people use large memory CPU and GPU devices, they are likely to build on things like Dask, Arrow, or xarray rather than building yet another "dict of numpy arrays" sort of thing. So, I expect N and M to grow slowly - I can see XND, CuPy, and maybe xtensor on the horizon.
Additionally, the downside "potential performance drop" cited in the PEP surely applies to any usage of persistent_id(). So if even this (very compelling) case doesn’t justify its use, then why should the pickle module provide persistent_id() at all? Should we have a prominent performance warning in the docs?
-
Moving beyond the question of motivation, the PickleBuffer object itself seems pretty reasonable, but my spidey senses are really tingling at this idea of a "potential weak reference to a hunk of memory whose immutability is indicated by a flag in another object, perhaps on another thread, perhaps in another process". I don’t feel like I have a good mental model of the various interplays between reference vs. copy, mutable vs. immutable, inter-thread shared Numpy buffers vs. Arrow inter-process shared memory, etc.
This all just seems like asking for a pile of pain, because PickleBuffer has to internalize and handle all of the semantics of ownership and mutability across that NxM consumer-producer matrix. It seems to me that these semantics should ideally be handled entirely by the higher-level "consumer" libraries, building on more primitive semantics within the lower-level array "producer" libraries. I don’t see how PickleBuffer can avoid a raft of fairly complex unit tests and special-casing. For instance, what does it look like if I serialize a NumPy array and a Dask array which consists of two Numpy views on that array? If I were using pickle protocol 4 and did the same thing, what is different (other than performance & memory footprint)?
-
Has there been some assessment of the difficulty of implementing PickleBuffer correctly on non-CPython interpreters? If this is one more hurdle that makes it harder for Pypy to keep up, or for Jython to ever become Python 3 compatible, then that is something we should keep in mind. For instance, with this PEP, the Jython folks are condemned to build a set of test suites that test Arrow<>Numpy<>Pandas<>Dask zero-copy array transport in the context of JVM off-heap memory semantics.
Developer Consensus
It looks like some representative members of the affected libraries & data ecosystem have been consulted in the drafting of this proposal, which is good. The feedback also seems to be positive. I did, however, note this comment from Allan Haldane in the Numpy PR:
One might object that it’s a bit strange that whether or not an unpickled ndarray is writeable depends on the pickle protocol and whether it was saved to disk or not.
This portends the sort of troubling morass of mutability & ownership semantics that I allude to above. PickleBuffer - by its name - connotes a "pickling", which throughout all of Python’s history has implied a serialization process. But it is now going to be an occasional teleporter of shared references and views, and users will need to pay close attention to the serialization path through different libraries in order to be able to reason about basic things like, "can the unpickler modify this object that I’m about pickle?" That seems like it could be really error-prone, and a really great source of security issues.
User Impact
What are the downstream consequences of bumping a pickle version number? I know that pickles are very widely used as convenient persistence that "mostly just works". It’s much less fuss than fiddling with a database and thinking about long-term schema management. So to that point, people may not be as diligent in tracking their usages of pickles throughout their overall application pipeline.
It should be noted that Python 2’s pickle protocol has been stable for 15 years, across v2.3 - v2.7. So, many people who have been using Python the longest may be least accustomed to thinking about pickle protocol incompatibilities between point revisions of the language.
Python 3 has seen a pickle version bump every few years, so maybe I’m over-worrying and this is not actually a huge burden to users. However, it would be really nice if there was some way to not force people to pay for what they don’t use, i.e. a pickle without any PickleBuffers can save as a more-compatible v4 pickle.
Conclusion
-
With my Steering Council hat on, since it looks like there is broad consensus between core developers and the dev community served by this PEP, I don’t think I have standing to wield power in this case, other than to certify the consensus.
-
That being said, with my developer hat on, I would issue a strong word of warning. This PEP involves the intersection of memory ownership semantics between many differing projects (several of which are still in the formative phase), across process boundaries, and persists them across time, and then invites that persistence into the core language as a thing the core devs will maintain going forward. Whenever a memory leak, race condition, or security vulnerability arises between two external projects, it will very likely require some changes or clarifications in PickleBuffer’s metadata. In the future, when someone loads an old pickle from a world-writeable path using some old version of Dask, might they be creating a security hole in their application?
-
This PEP is essentially an optimization (although an important one), and so I don’t understand the urgency of incorporating this into the core language. If various external projects feel so strongly about this issue, then why not create PickleBuffer as small new package and encourage Numpy, Dask, Arrow, xarray, tensorflow, pytorch, cupy, etc. all take it on as a dependency? This then allows it to bake against real-world use cases, fleshing out usability, readability, and hidden surprises. Its design can then evolve much faster before it is memorialized as a protocol version in Python itself. If desired, I can help facilitate this conversation between multiple parties, as I have contacts with many of the affected projects.