Question for Steering Council candidates

pitrou · January 16, 2019, 7:41pm

Given the number of candidates we have now, I woud like to know how much each of them knows about core development, the various moving parts of CPython, and our collective history. A good way for that is to ask a concrete question.

So here is my question: if I were going to submit PEP 574 for pronouncement (not really an unlikely scenario), what detailed reaction or resolution would you propose as a Steering Council member?

pitrou · January 16, 2019, 7:45pm

cc’ing the 10 first candidates: @ncoghlan @vstinner @DavidMertz @emily @willingc @Mariatta @benjamin @ambv @yselivanov @barry

pitrou · January 16, 2019, 7:45pm

cc’ing the last candidates (Discourse doesn’t allow more than 10 user mentions per post): @brettcannon @guido

willingc · January 16, 2019, 9:47pm

Hi @pitrou,

I’ll throw out a few technical thoughts, but first I would like to share some Council thoughts.

Given Python’s development history of trying to build consensus on PEPs, I would imagine that the Steering Council members would do the same. If there is a subgroup focused on science/data science or security, their input would be valuable to the Council, as a whole and individual members, during a decision making process for pronouncement of acceptance, provisional acceptance, or rejection.

Since we’re talking about a PEP that has its roots in the Science and Data Science communities, as well as the Web community that uses data at scale, I want to pause a moment and think about whether this question is biased. While the question and responses are data points, I believe it is somewhat biased by areas of technical expertise. One of the greatest strengths of a Steering Council is the collective talents that members bring to the Council including technical, history, people skills, communication, and the ability to get things done. tldr; the sum is greater than the parts when it comes to a Steering Council.

On the technical side, it’s impractical to copy or move large data stores in an HPC or distributed cluster. Dask and Arrow, and to a lesser degree IPyParallel, along with packages like Dill and CloudPickle have been actively developed to address this. When it comes to custom data structures, the existing packages drop back to the standard lib, pickle, for serializing/deserializing which is inefficient. Arrow has a promising approach ref for converting custom types to dictionaries of standard types.

Positives I like that PEP 574 proposes an addition to pickle instead of a change to it. Also, the concept of a buffer view instead of a copy is a reasonable approach for the HPC/large data users. I like that the picklebuffer does not have pickle responsible for the type’s metadata and instead puts this additional responsibility on the type’s __reduce__ implementation.

Questions What do Matt Rocklin, Min, and Nathaniel think about the current implementation in your repo? Will this approach add any new security threats when serializing and deserializing or for malicious writes to the data? What, if any, negative performance hits does allocating and using PickleBufferObjects introduce?

Overall, my initial view of PEP 574 is positive and seems reasonable to move forward to acceptance or provisional barring any showstopper answers to the above questions.

pitrou · January 16, 2019, 10:03pm

Thanks @willingc for that knowledgeable response. While I agree with you that “the sum is greater than the parts”, the sum is very much correlated to the value of the parts

emily · January 16, 2019, 10:38pm

Pertaining to @willingc’s council thoughts, I agree with her wholeheartedly. For one, this is a particularly specific technical question which won’t adequately assess general technical knowledge of the CPython codebase. I’m no expert on the current implementation of pickle, nor am I familiar enough with any of the third-party implementations mentioned. I can, however, approach the question from a higher level. From my perspective, this is an excellent opportunity to 1) learn more about a specific part of CPython and 2) to lean on the expertise of other core developers as well as the larger community who are indeed experts in this area.

Withholding a personal judgment of the technical implementation of the PEP, I would have preliminary support of it as well. Such an incremental change with backward compatibility is a plus. It also seems as though this change would be a warranted inclusion to the standard library, which I think is an important question to ask as CPython grows. It seems as though there are multiple third-party libraries that have worked to alleviate similar pain points, as described in the PEP, and this would be incredibly useful to have more efficient, modern support built-in to CPython. I can also see how this could be useful for PEP 554, with acknowledgment of my own knowledge bias in favor of PEP 554, as I’ve assisted Eric Snow with it.

pzwang · January 18, 2019, 4:10pm

I woud like to know how much each of them knows about core development, the various moving parts of CPython, and our collective history. A good way for that is to ask a concrete question. So here is my question: if I were going to submit PEP 574 for pronouncement (not really an unlikely scenario), what detailed reaction or resolution would you propose as a Steering Council member?

I will reiterate my position that I think the Steering Council should form working groups or topical subcommittees to work on specific matters. So on most matters, I would look at the broad developer consensus, and only weigh in at a technical level if I felt very strongly that something was clearly objectionable.

That being said, I do have questions about this PEP. I admit that I am remiss in not having looked at it in too much detail yet, and so realistically I need to take some more time to review it. But after spending a couple of hours with it last night, here is my initial feedback.

Technical & Design Questions

This PEP addresses a concrete problem whose motivations I (of all people am very sympathetic to. Maybe I am missing something obvious, but the existing facilities in pickle and copyreg/dispatch_table seem like they should be sufficient. It seems like the intent of Pickler.persistent_id() is precisely to serve this kind of need, and I don’t fully grok the reasoning for why this is a rejected alternative.

In the PEP, the first stated drawback of persistent_id, namely having N consumers x M producers, does not actually seem that onerous to me, because N and M are quite small. The producers are basically {NumPy, Pandas, Arrow} and the consumers are (I think) {Dask, Arrow, xarray}? I’d have to think about it a bit more, but it seems like the producers could even provide template reducer/constructor functions which the consumers would only need to lightly customize.
I know that future libraries may arise, so this cardinality may grow - but it’s been 20 years since Numeric, and we’re still looking at a very small handful of libraries that need to touch this kind of internal detail. Even as more people use large memory CPU and GPU devices, they are likely to build on things like Dask, Arrow, or xarray rather than building yet another "dict of numpy arrays" sort of thing. So, I expect N and M to grow slowly - I can see XND, CuPy, and maybe xtensor on the horizon.
Additionally, the downside "potential performance drop" cited in the PEP surely applies to any usage of persistent_id(). So if even this (very compelling) case doesn’t justify its use, then why should the pickle module provide persistent_id() at all? Should we have a prominent performance warning in the docs?
Moving beyond the question of motivation, the PickleBuffer object itself seems pretty reasonable, but my spidey senses are really tingling at this idea of a "potential weak reference to a hunk of memory whose immutability is indicated by a flag in another object, perhaps on another thread, perhaps in another process". I don’t feel like I have a good mental model of the various interplays between reference vs. copy, mutable vs. immutable, inter-thread shared Numpy buffers vs. Arrow inter-process shared memory, etc.
This all just seems like asking for a pile of pain, because PickleBuffer has to internalize and handle all of the semantics of ownership and mutability across that NxM consumer-producer matrix. It seems to me that these semantics should ideally be handled entirely by the higher-level "consumer" libraries, building on more primitive semantics within the lower-level array "producer" libraries. I don’t see how PickleBuffer can avoid a raft of fairly complex unit tests and special-casing. For instance, what does it look like if I serialize a NumPy array and a Dask array which consists of two Numpy views on that array? If I were using pickle protocol 4 and did the same thing, what is different (other than performance & memory footprint)?
Has there been some assessment of the difficulty of implementing PickleBuffer correctly on non-CPython interpreters? If this is one more hurdle that makes it harder for Pypy to keep up, or for Jython to ever become Python 3 compatible, then that is something we should keep in mind. For instance, with this PEP, the Jython folks are condemned to build a set of test suites that test Arrow<>Numpy<>Pandas<>Dask zero-copy array transport in the context of JVM off-heap memory semantics.

Developer Consensus

It looks like some representative members of the affected libraries & data ecosystem have been consulted in the drafting of this proposal, which is good. The feedback also seems to be positive. I did, however, note this comment from Allan Haldane in the Numpy PR:

One might object that it’s a bit strange that whether or not an unpickled ndarray is writeable depends on the pickle protocol and whether it was saved to disk or not.

This portends the sort of troubling morass of mutability & ownership semantics that I allude to above. PickleBuffer - by its name - connotes a "pickling", which throughout all of Python’s history has implied a serialization process. But it is now going to be an occasional teleporter of shared references and views, and users will need to pay close attention to the serialization path through different libraries in order to be able to reason about basic things like, "can the unpickler modify this object that I’m about pickle?" That seems like it could be really error-prone, and a really great source of security issues.

User Impact

What are the downstream consequences of bumping a pickle version number? I know that pickles are very widely used as convenient persistence that "mostly just works". It’s much less fuss than fiddling with a database and thinking about long-term schema management. So to that point, people may not be as diligent in tracking their usages of pickles throughout their overall application pipeline.

It should be noted that Python 2’s pickle protocol has been stable for 15 years, across v2.3 - v2.7. So, many people who have been using Python the longest may be least accustomed to thinking about pickle protocol incompatibilities between point revisions of the language.

Python 3 has seen a pickle version bump every few years, so maybe I’m over-worrying and this is not actually a huge burden to users. However, it would be really nice if there was some way to not force people to pay for what they don’t use, i.e. a pickle without any PickleBuffers can save as a more-compatible v4 pickle.

Conclusion

With my Steering Council hat on, since it looks like there is broad consensus between core developers and the dev community served by this PEP, I don’t think I have standing to wield power in this case, other than to certify the consensus.
That being said, with my developer hat on, I would issue a strong word of warning. This PEP involves the intersection of memory ownership semantics between many differing projects (several of which are still in the formative phase), across process boundaries, and persists them across time, and then invites that persistence into the core language as a thing the core devs will maintain going forward. Whenever a memory leak, race condition, or security vulnerability arises between two external projects, it will very likely require some changes or clarifications in PickleBuffer’s metadata. In the future, when someone loads an old pickle from a world-writeable path using some old version of Dask, might they be creating a security hole in their application?
This PEP is essentially an optimization (although an important one), and so I don’t understand the urgency of incorporating this into the core language. If various external projects feel so strongly about this issue, then why not create PickleBuffer as small new package and encourage Numpy, Dask, Arrow, xarray, tensorflow, pytorch, cupy, etc. all take it on as a dependency? This then allows it to bake against real-world use cases, fleshing out usability, readability, and hidden surprises. Its design can then evolve much faster before it is memorialized as a protocol version in Python itself. If desired, I can help facilitate this conversation between multiple parties, as I have contacts with many of the affected projects.

pitrou · January 18, 2019, 4:26pm

Thanks @pzwang. To be clear, I don’t want to answer any technical or conceptual questions about PEP 574 itself here. I am mostly interested in the thought process of the various Steering Council candidates, and in the knowledge and experience they’re displaying. I would therefore encourage you to post your PEP-specific questions on the python-dev ML, if you feel strongly enough about them .

(oh, and for full disclosure, though I don’t think it has any impact here, @pzwang as a Anaconda co-founder is a former employer of mine)

pzwang · January 18, 2019, 4:50pm

Ah, got it – thanks for clarifying!

It does seem from your original question (as well as your comment about the “parts” mattering in how they comprise a “sum”) that you were wanting candidates to display some knowledge of Python/CPython itself and “collective history”.

If you don’t mind, I’ll leave my response as it is, with my technical commentary fully serialized inline.

pitrou · January 18, 2019, 4:52pm

I do. It just doesn’t matter under what form exactly that manifests (for example opposing this or that point in the PEP).

+1

ncoghlan · January 19, 2019, 2:53am

At a Steering Council level, I think the key involvement would be appointing and/or ratifying a BDFL-Delegate/Council-Delegate who could make a reasonable risk assessment along the lines of the one Peter posted earlier in the thread.

That delegate may be a member of the Council themselves (and where that’s feasible, I think it would make sense for the Council to rely on that approach), but even with 5 people vs Guido’s 1, there are still going to be areas where the most appropriate decision maker is someone else (e.g. for PEP 453, which added ensurepip to the standard library, one of the main reasons I asked Martin von Löwis to take on the role was precisely his skepticism as to whether or not it was actually a good idea, so addressing his concerns ended up making the overall proposal significantly stronger and more maintainable over the long term)

(Full disclosure: I’ve been serving in that role for PEP 574 so far, and I’m making an assumption that the Steering Council’s initial tweaks to PEP 1 will just file the serial numbers off by replacing references to the BDFL with references to the Steering Council)

vstinner · January 25, 2019, 4:34pm

I am unable to take any decision on that PEP because I didn’t read it. I didn’t read it because I don’t really need it and I don’t feel able to review it properly!

If I recall correctly, you already posted a few messages about your PEP but they didn’t get much feedback. I understand that sadly, too few people know that topic and would be able to review it properly. I guess that the best solution here is to select a PEP-delegate who work closely with you to review the design, spot corner cases, play the implementation, etc. I would only select someone that I trust enough to be able to review properly the PEP, trust enough to delegate the vote on that PEP.

I noticed that you proposed Nick Coghlan as a PEP-delegate. It’s a good news that you identified a potential PEP-delegate. I understand that it means that you trust him enough to review properly your PEP

I would be annoying if nobody see any obvious PEP-delegate for a PEP. In that case, sadly, I would suggest to do nothing and wait until we find someone. In my experience, a PEP must be reviewed by a peer. That’s how Python gets great PEP. It’s common to see major changes between the first draft and the final accepted PEP.

For example, I recall that INADA Naoki forced me to almost rewrite my UTF-8 Mode PEP (PEP 540). I can now testify that it is a great enhancement in term of correctness and usability (compared to my first draft).

I also recall that Charles-François Natali forced me to redesign deeply my tracemalloc module (PEP 454) even if I was very (too) confident that my PEP was perfect.

gpshead · January 26, 2019, 4:33am

This was my first time reading through your pickle buffer PEP. I’d ordinarily have buried my head in the sand as someone else’s thing to consider as I have a general dislike of pickle. I’m not a fan of arbitrary object serialization, schemaless data, and language specific data formats. But I am a huge fan of performance. That means from a council point of view is that I’d initially be in favor of selecting a delegate decider for this one.

Assuming we needed to be the deciders, I read onwards: The PEP started answering most of the questions that were amassing in my mind within the next sections of the PEP. Good. I was also happy to see apparent example user use case implementations in numpy and arrow. This entire thing would seem silly to accept without buy in from the data science community it is aimed as satisfying.

The questions I have left for the PEP, and something I think all PEPs should consider: What possibly ways to abuse the new features are there? What would the potential fallout of that be? And is there sufficient reason to care about that? Ie: could users lives become unpleasant as a result (bad), or will dabeaz just get a new toy to abuse in a future impractical evil “please don’t do this” talk. (meh)

We should be able to reach a decision if this somehow weren’t delegated.

Off topic by getting too specific rather than meta here: I’d include the potential debugging “fun” of shared mutable memory backing multiple objects in that along with potential for memory leaks.

yselivanov · January 26, 2019, 8:15pm

Congrats for yet another great PEP! I’ve been wanting for something like this for a long time. My use case is a high-efficient IPC, and this proposal would enable me to use shared memory to share large buffers of raw data between async (fast) and non-async (slow/heavy computation/disk IO) Python processes.

First, a couple of suggestions on how I think the text of the PEP can be improved:

It would be great if you can add a link to a CPython pull request. Having a git branch with a proof-of-concept implementation is nice (moreover, I think it’s a must for PEPs like this one), but it’s hard to use a branch just to quickly assess the amount/scope of change the PEP introduces to the codebase.
I’d add a few benchmarks results. I’ve found this one https://github.com/numpy/numpy/issues/11161#issuecomment-424035962 and it is quite impressive. This can definitely help to “sell” the PEP.

Now to the technical issues. What I don’t like about this PEP is that it breaks the assumption that pickle.dump() serializes the passed objects and writes the entirety of the data needed to re-create them to the stream. That’s also what the “dump” name suggests. The PEP introduces a lot of complexity to the pickle protocol that many Python users simply don’t expect it to ever have.

While this issue is not a deal-breaker for the PEP acceptance, I would at least try to discuss another idea: what if we change the PEP to introduce a new (and separate) protocol – Extended Pickle (we can probably find a better name). Few points to discuss:

I expect that the EP will be able to share most of the code with the current pickle protocol.
Instead of using pickle.dump() we can use a new function; I’ll call it pickle.serialize() in this message.
The new pickle.serialize() will use all of the standard pickle protocol magic methods __reduce__, __getstate__, etc, plus new specific ones for things like out-of-band data (if necessary!)
Multiprocessing can either be configured to use the EP or we can even make it use it by default.
Since the EP would be a superset of the pickle protocol implemented in the stdlib, it shouldn’t have the problems that third-party packages have when they attempt to implement support for out-of-bound streams using pickle protocol 4.
We can later add support for serializing generators/coroutines and traceback objects. The latter is a pain point that I struggle with pretty often, and I wish we can solve it one day (it would require us to refactor frame/traceback interaction too, but it’s doable).

I think the last point is illustrative. It shows that we can later use the EP to serialize things that pickle.dump() probably shouldn’t ever support. With this separation, pickle.dump() becomes a method for storing Python objects “long term”, whereas EP would be for things like IPC, with potentially entirely different characteristics.

njs · January 27, 2019, 3:54am

It looks like some substantive technical discussion is happening here. Which is awesome, but a bit separate from the original topic :-). Maybe it would be good to make a new topic for discussing the PEP itself and move some of the posts over there and continue the discussion? And then that thread can be part of what @pitrou presents to the new steering council :-).

DavidMertz · January 27, 2019, 3:23pm

I do not disagree with any of the comments posted by other candidates here. All offer balanced and well thought reactions.

PEP 574 is somewhat unusual in that it proposes a change to CPython (or at least to the standard library) that has little effect or relevance to “Pure Python” users. Of course, it is not unique in this respect. PEP 3118 (buffer protocol) which PEP 574 builds upon was similar in purpose. An additional more recent example is PEP 465 (infix operator). All of these are especially helpful to NumPy and scientific computing, which seems appropriate.

What was done with PEP 465 especially was get buy in from the external developer groups who would be most affected, and for whose benefit the PEP was largely written. PEP 574 is somewhat more complex in this respect in that there are a half dozen (semi-) independent projects that have an interest in out-of-band pickling, whereas PEP 465 was largely just NumPy, but with other libraries able to follow suit. Clearly, PEP 574 has already made an effort to work with affected communities (I notice Min and Matt Rocklin are both acknowledged, for example; NumPy has already added support). This feels like the proper process, but were I elected to the SC, I would want to contact interested external developers to get my own sense of potential issues they might still have.

One technical area that is not addressed quite as much as I would like is support for shared memory on CUDA/GPU devices. I do not think acceptance of the PEP should depend on a specific implementation and use already existing on those types of memory devices; but I think we should be careful to make sure that the developers of those projects do not foresee any technical obstacles in the design of the PEP. This includes CuPy, cudf, pygdf, PyCUDA, RAPIDS, Numba is probably touched on. I think I would like to see many of those projects converge to interoperate as well, but that decision is outside the scope of CPython development. I do not personally see any likely obstacles for GPUs in the protocol, but I’d want to find the opinions of the real experts.

tjreedy · February 4, 2019, 7:26am

Antoine, thank you for a great question. I think dealing with PEPs and the PEP process is the first order of business, and I considered this in my voting. While I don’t think that the Council should immediately rush to approve pending PEPs, I do think they should quickly appoint delegates prepared to give at least some report relatively quickly.

I would like to know how Guido interacted with delegates between appoint and decision. While I don’t remember that he ever vetoed a delegate decision, I presume he did not either blindly accept them.

guido · February 5, 2019, 11:31pm

Actually I mostly did accept those blindly, because I only used the delegation process for situations where I had no domain knowledge, or no interest. (E.g. anything related to packaging.)