There is a rarely used feature in PyArg_ParseTuple(). It supports parsing nested sequences. For example, resource.setrlimit used PyArg_ParseTuple(args, "i(OO):setrlimit", &resource, &curobj, &maxobj) to parse its arguments. The problem is that it supports arbitrary sequences, including mutable sequences. Some format units store borrowed buffer or reference (e.g. “s” and “O”). If the sequence was mutated, the borrowed buffer or reference can be no longer valid. This is an innate flaw of such C API, it cannot be fixed in PyArg_ParseTuple(). The only solution is to deprecate accepting non-tuple sequences.
We can make the deprecation more lenient – only reject non-tuple sequence if format units store borrowed buffer or reference. If all format units are like “i” or “s*”, it is safe to accept a mutable sequence.
I agree that this should be deprecated. Being conservative on how we do that is key to not breaking the world for existing extension users or surfacing relevant deprecation warnings to the wrong people vs owners of code that need to make a change. Anything compiled for the Stable ABI for use as a binary for a long time may need this exist for quite a while. But we can guide new users away from it. I like the idea of limiting it to non-tuple sequences as a starter but even that may need to be done cautiously.
We may be able to do something like disabling the feature sooner in non-limited-api builds even though legacy ABI users might still be allowed to use it. That might require getting too creative in C, I haven’t thought through how to do that (or looked at your PR).
A survey of top-PyPI package extension module sources to get an idea of how often this feature is used and why would be helpful.
Am I wrong in suspecting that this C API feature remained in place during py3k development even when tuple parameter unpacking was removed from the language as it saw little use with most people not even realizing the feature had existed.
It’s not really the same thing, but it sure feels like it. “oops”
There are other such examples (e.g. bytearray), so it’s a partial solution at best. Would we be better off offering safe(r) format units and just documenting the risks of using the others?
I’ve used this function quite a few times recently in a context where I absolutely know that I could do this safely.[1] I don’t like rejecting things that are “fine if used correctly and unsafe if used incorrectly” - I’d prefer to only warn/document in that case.
I couldn’t, because virtually all of my types need converters because the built-in formats don’t work for me, but in theory there’d be nothing wrong with my code if I did. ↩︎
bytearray is not a safe example. Passing a bytearray of length 2 with format "(OO)" will create borrowed references to temporary integer objects. It only works on CPython because integers from 0 to 255 are cached, but this is an implementation detail. BTW, bytes are already forbidden for unpacking.
Should we risk? I am sure that almost nobody will read that documentation. I just fixed a seriously outdated paragraph in that file – it was incorrect since Python 2.0, and nobody noticed.
I though about adding the N format unit for storing the strong reference. It will be a replacement for O which stores a borrowed references. But there is no “safe” and efficient replacement for s. In any case, we only can force users to use safe format units if we forbid use unsafe format units.
I think that my solution is pretty solid. If nested format units all safe – you can pass a list or other sequence. If any of them unsafe – you can only pass a tuple, and I am sure, that this is enough for 99.99% of users. I don’t know of any case where unpacking a list would be necessary.
I meant that accepting a bytearray at all isn’t safe in the same contexts where a mutable sequence would be unsafe (i.e. you’re allowing Python code to run during your function, whether on the same or another thread). So are you proposing to fix that as well? Or perhaps we should come up with a better way to deal with unsafe arguments more generally.
In what case? w* uses the buffer protocol which prevents changing the size of the bytearray. et and c rely on the GIL (and they can be made safe for the GIL-less build inside PyArg_Parse if they are not safe). Y just checks the type of the object.
Parsing a mutable sequence with (...) is not safe in general even with the GIL.