Supporting (or not) invalid identifiers in **kwargs

vbrozik · July 17, 2022, 7:18pm

I have only Python 3.10.4 at hand and here it throws a (not unexpected) exception:

…

>>> "my.domain.name".split(**{S('xxx'):'.'})
sep
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
SystemError: null argument to internal routine

The interpreter does not crash.

…ok I did not have experience with SystemError. According to the documentation it should not appear:

https://docs.python.org/3/library/exceptions.html#SystemError

guido · July 17, 2022, 7:31pm

Aha. You should totally file a bug report on GitHub for that. Thanks for finding this!

jeff5 · July 17, 2022, 7:40pm

I meant that I just get a === RESTART: Shell === in 3.8.

However, in 3.10.2 with the same S in a file, after printing sep, I get this traceback:

Traceback (most recent call last):
  File " ... \kwargsbomb.py", line 19, in <module>
    c = u.split(**{S('xxx'):'.'})
SystemError: null argument to internal routine

Amazing how much re-work this code is getting from one version to the next.

jeff5 · July 17, 2022, 8:09pm

And the same on 3.11.0b4

I can do that. (Custom str subclass as keyword argument leads to System Error · Issue #94938 · python/cpython · GitHub)

tjreedy · August 6, 2022, 12:48am

(Custom str subclass as keyword argument leads to System Error · Issue #94938 · python/cpython · GitHub) has been fixed and closed. Jeff’s code above now gives

TypeError: invalid keyword argument for split()

PythonCHB · August 9, 2022, 4:49pm

Glad that bug got fixed! – but where are we on the documentation question. It seems there are two options:

This is a supported feature of Python

or

This is an implementation detail that may or may not be supported in other versions / implementations of Python.

I think not documenting it is a clearly bad idea.

If (1) then it should be documented so other implementations know that it needs to be supported. And some clarity on exactly what is supported would be good.

if (2) then it needs to be documented so users are warned that they are relying on an implementation detail.

Given that there is code in the wild that uses this feature, and has for years, I think (1) is probably the way to go.

That leaves a clear definition of what “this” is. Here’s a stab at it:

In all instances where a python name is specified/stored in a container or variable, the name must be a string, but may be a non-valid python identifier.

Examples:

__dict__ and **kwargs may not have non-string keys, but may have non-valid strings as identifiers (e.g. "this-name").

getattr() and setattr() will except only strings, but those may not be valid identifiers. (e.g. setattr(obj, "this-name", value)

As for the “wild side of Unicode” – I have no idea how to define or describe that – but it would be good to do so – at least to say that the results may be undefined.

Along those lines – I recall a fairly recent conversation somewhere (maybe python-dev?) about the normalization of Unicode in identifiers – e.g. various versions of letters that are essentially a font different (like Blackboard Bold) are normalized to the same letter by the interpreter:

In [6]: ℂhris = 5

In [7]: Chris
Out[7]: 5

(that C is "\u2102" in the first instance, and a regular C in the second)

But this transformation is not done by setattr(), nor I assume (Not tested) by any direct setting to dict and the like.

In [14]: setattr(obj, "ℂhris", 5)

In [15]: getattr(obj, "Chris")
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-15-ddf0ae72b865> in <module>
----> 1 getattr(obj, "Chris")

AttributeError: 'Dummy' object has no attribute 'Chris'

In [16]: getattr(obj, "ℂhris")
Out[16]: 5

This should probably be documented as well.

Gouvernathor · August 10, 2022, 12:18am

I completely agree about the need for a clarification.
I’m not convinced by any of the arguments for allowing and documenting the current behavior, your option (1). I understand the argument for leaving the code as it is, but not for pledging support for it in the future.

I don’t think we ought to support the code that currently uses it, and we should warn of its unstability and that it’s an implementation detail as soon as possible.
To my knowledge, such things have already been done. For example, the following function signature worked in py2 (and possibly later) def place(label, (x, y), color="#f00"), while in py3 it doesn’t. Granted, it’s the py2-3 transition, but I never saw any warning about it in various things I read about that transition, or in the PEPs describing the current arguments syntax (pos-only arguments, keyword-only required arguments…). It was never documented, and deserved to go without a proper funeral when it was deemed necessary.

I think __dict__ would probably be the only exception, I’d even be fine with it accepting any and all keys, much like the globals() dict.
In any other case, it should be left unguaranteed, as an implementation detail, that 1) non-indentifiers and 2) non-strictly-string objects (meaning string subclasses and other classes) are supported. Even if my point 1) goes as documented, we should leave 2) as an implementation detail.

Lastly, I believe the last part about Unicode normalization is another issue. It should be adressed, and probably be deemed an implementation detail, but I don’t think this thread is the place for it - though it’s a good illustration of the dangers of allowing non-identifiers.

guido · August 27, 2022, 11:44pm

(Digging up a relatively old thread that I’d saved as deserving a reply.)

Seems you have not yet found PEP 3113, which is about this specific change.

Anyway, you haven’t convinced me that there’s anything to be gained by officially dropping support for this. I still feel we should continue the current support and make it official (other Python implementations like Jython support it too).

Unicode normalization is a non-issue since at the setattr/getattr level we just use the string passed in as the key, so you get what the unicode __eq__ method gives you. The same goes for str subclasses.

Gouvernathor · August 28, 2022, 2:00am

Apologies about the PEP, I talked too fast.

For supporting or not supporting something I think the burden of proof, so to speak, goes the other way around : this partial restriction on some dicts’ keys (a non-strict type check and no content check) is not mentioned anywhere in the doc, as far as I know. What I found in the doc amounts to this, here (emphasis mine) :

… [unless] **identifier is present; in this case, that formal parameter receives a dictionary containing the excess keyword arguments (using the keywords as keys and the argument values as corresponding values)

and this :

If the syntax **expression appears in the function call, expression must evaluate to a mapping, the contents of which are treated as additional keyword arguments. If a keyword is already present…

The doc describes it as a dict (or mapping) containing keywords, which means valid identifiers. This is shown in the lexical definition square, where a “keyword_item” is described as being made of an identifier, an equal sign, and an expression.

Python developers considering something a feature is one thing, it being a publicly supported and documented feature is another. So, for me, the debate should be whether to start officially supporting it or not, rather than whether to drop a support that isn’t there.
I presented a case to include defensive programming against it, there is a consensus against that, and I accept it. But I think that doesn’t make the behavior anything more than an implementation detail, until argued and unless decided otherwise.

jcgoble3 · August 28, 2022, 4:28am

I have no opinion either way, but I’m going to point out that Python’s own documentation for the unittest.mock library supports and endorses the use of non-identifiers as keyword arguments in the docs for the configure_mock() function.

Does this change any opinions?

guido · August 28, 2022, 9:39pm

I am tired of litigating this, and doubly so if the litigation is based on documentation that was written decades ago. (These docs are also clearly out of date, positional-only arguments are still described as a CPython-only feature for builtins only.)

I never intended this to be an undefined language feature (those are generally a bad idea). Can someone please submit a doc PR that clarifies the situation, so I can approve it?

Gouvernathor · August 28, 2022, 11:38pm

@guido I don’t think any change adding that to the documentation, turning in effect that currently undocumented implementation detail into a documented feature, should be approved by you alone without a general discussion about it where you would have to come up with reasons - after all you’re the one asking for a change, here. And I don’t think you gave any other reason yet than the fact that you always considered it that way.
If you’re tired of arguing, we can always keep the doc the way it is.

Maybe I’m mistaken, and “approving” a PR doesn’t mean merging it but only starting a discussion on it. If it’s the case, my mistake.

@jcgoble3 Congrats on finding that one ! I guess that’s a quite good reason not to change the implementation. But I still have doubts about documenting it.

guido · August 29, 2022, 2:26am

I am hating every minute of this discussion, but I disagree that the feature deserves to be implemented but undocumented. So I don’t want to do what I typically do when – mute or unsubscribe from the discussion.

I would have removed myself from the discussion long ago if I hadn’t disagreed with that outcome – the status quo (having it undocumented) feels wrong to me. I’m honestly at a loss why you’re fighting this – your arguments seem to be procedural (it was undocumented so we can’t document it without a general discussion) rather than substantial (there’s something wrong with the feature).

As far as I’m concerned, we’ve been having the general discussion in this thread, and other than outrage and a vague concern about a potential security issue you haven’t clarified what you think is wrong with the feature. Can you please summarize your argument?

jeff5 · August 29, 2022, 9:41am

I would also prefer a behaviour to be documented. People tend to believe the language is what CPython does, and it is useful to know the limits of what was intended when considering a re-implementation.

I would be happy to do that if no-one has beaten me to the preparation of one. (I couldn’t see an actual PR.) I would draw on what Chris and others have written.

At least to begin with, I would try to clarify the intended freedom in supplying keyword arguments as a mapping to a call, as a language feature not an implementation detail. It would be a few sentences in the documentation of calls.

There is at least one other place where a similar issue arises (the global dictionary of a module), and it might be worth doing there too, but I find this to be too abstract:

At any rate, I wouldn’t be able to make this precise, or find it when wondering I should be outraged along with:

da-woods · August 29, 2022, 10:07am

Going back quite a few messages: “GraalPython” (two "a"s) - GitHub - oracle/graalpython: A Python 3 implementation built on GraalVM

It looks to behave the same as CPython as far as I can tell

Gouvernathor · August 29, 2022, 11:26am

@guido

your arguments seem to be procedural (it was undocumented so we can’t document it without a general discussion)

Yes, that’s a big part of what I’m saying.

you haven’t clarified what you think is wrong with the feature

My initial point was that the feature is wrong and should be removed (from CPython and other interpreters, that is), but there are compelling reason not to do that, so I accept them and that’s not what I’m aguing right now.
That issue is partly moot, but the main reason for my “outrage” (which is a bit of an exaggeration on my part, not to be taken too litteraly) is the fact that (**existing_dict) is an extension of the concept of keyword arguments, and that keyword arguments are named using valid identifiers. This behavior enables more than just that, so I think it’s weird and inconsistent, because that’s not the job the ** syntax was tasked to do.
To use a metaphore, passing arguments is like passing data from one point to another in the code, which is like moving people from one point to another in a car. Parsing/interpreting the function call is like putting people in a car, and interpreting the function signature afterwards is like pulling them out of the car. So, this behavior feels (to me) like if you opened the car’s door in a particular way, it enabled you to fit a building inside a two-seats car. It’s inconsistent with how cars work, with how function arguments work, and with what you can pass through.
And what’s even more, there yet is some form of limited safety in the fact that you need to use strings or string subclasses, and not any and all objects : you can put buildings in the car, but not clouds or other cars. That’s what’s the most inconsistent, to me : the point to which it’s permissive seems as arbitrary as it gets.

Now, since you want to change things, and you want this to be considered a proper documented feature, could you please summarize your argument as to why you want it to be ?
I think that’s the way this discussion should go, and only once you do that can I understand your reasons, and explain why I consider them good or bad. I think you should do this before answering to the reason I wrote above - which your reasons may directly uproot.

@jeff5 your PR’s phrasing may advance this discussion, so please do. But to be clear I think that no PR should be merged without a conclusive discussion about it.

guido · August 29, 2022, 3:31pm

You’re arguing from perceived inconsistency. I have news for you: no matter what you do there will be inconsistency somewhere. This just isn’t a tenable argument with software, you can always come up with an analogy or metaphor that breaks.

My position is simple: it’s always been this way (since **kwds was introduced), it is useful (for interpreter speed, and for users with certain kinds of needs) and it was a pure accident that it was never documented.

I have the impression that you enjoy arguments a bit too much. Please do it elsewhere.

Gouvernathor · August 29, 2022, 3:56pm

Sure, but that doesn’t mean we shouldn’t minimize cases when that happens.

I understand that, and it’s a very good reason to leave the implementation as it is.

I think that argument gets a response in the PEP you linked, 3113, in this section. Basically, if people want the function to receive a mixed-keys dict, they can stay within the boundaries of the (current) doc, and pass the dict as one specific parameter.
NB : the section is called “…if removed”, but in our case it’s rather “…if not used, or advised against”.

Actually, wouldn’t that solution be faster at runtime ?

Believe me, I don’t particularly enjoy this discussion. I just would even less enjoy seeing unjustified changes being unilaterally applied to a language I love, especially when I find these changes to be bad.

jeff5 · August 29, 2022, 5:01pm

PR at Document that keywords in calls need not be identifiers by jeff5 · Pull Request #96393 · python/cpython · GitHub

encukou · August 29, 2022, 7:39pm

I also believe this is a case unclear documentation.

The fact that you can use a keyword like for as argument/attribute/variable name was pretty useful when async became a keyword in 3.7.

Similarly, I think that as Python in the browser becomes more popular, the possibility of putting $ in attribute/argument names will be very useful for interop with some Javascript libraries. At least for a while, before APIs adapt.