Supporting (or not) invalid identifiers in **kwargs

Gouvernathor · July 7, 2022, 2:38pm

Consider the following function :

def foo(**kwargs):
    print(kwargs)

>>> foo(er=5)
{'er': 5}
>>> foo(**{"er":5})
{'er': 5}

This is normal behavior. But consider the following :

>>> foo(**{"er ":5})
{'er ': 5}
>>> foo(**{"3":5})
{'3': 5}

That’s weird. But what follows is weirder :

>>> foo(**{3:5})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: keywords must be strings

Why does that happen now ?

There is another instance where python itself does a identifier:value association with a dict : the globals() builtin. And it allows for any value, even numerals, as keys in the returned dictionary.
In that case the side-effect of it modifying the global scope is undocumented (afaik, I also heard removing that side-effect was under discussion but I may be wrong), so if we give the user a dictionary we shouldn’t restrain them from doing whatever they want with it. The dict reads a scope, it does not create one (or at least it’s not documented as such).

But in the case of double-stars, it’s different, because passing a double-starred dict creates a scope, and only incidentally stores the remaining identifier:value pairs in a supplementary dict (conventionally named “kwargs”). From that point of view, it makes sense then to forbid passing 3 as a key in a double-starred dict, because 3 is not a valid identifier.
But in that case, if we are doing checks on the keys of a double-starred dict, why aren’t we checking the keys of that double-starred dict ? Why are we letting invalid identifiers such as "3" or "er " or even "arg-tup(fup" pass in the function’s kwargs without raising an exception ? ("class" is another problem)

I think that’s an inconsistency which doesn’t make any sense, and that it should be solved either by allowing any dict (with any key type) to be double-starred, or by disallowing any non-valid-identifier key from double-starred dictionaries, typically by calling the equivalent of assert key.isidentifier().
I would personally prefer the second choice, and the possible next step would be to call (the equivalent of) assert not keyword.iskeyword(key).
Also, just to be clear, allowing the user to do kwargs[3] = 6 inside the function is normal, and is not the subject here.

steven.daprano · July 7, 2022, 3:50pm

Calling functions in Python is slow enough as it is without the extra checks you are asking for. This behaviour seems harmless: I’ve never heard of it causing an actual bug in real code. So if you don’t like that behaviour, the answer is “Don’t do that!”

Because the behaviour is not specified as required by the language, other interpreters may choose to be stricter. Have you tested what PyPy, Jython, IronPython, MicroPython, RustPython, GrailPython, Cinder etc do?

Gouvernathor · July 7, 2022, 11:28pm

PyPy and IronPython behave exactly like cpython, as far as I tested.
RustPython’s online demo is the same.
Jython’s version is 2.7, so I’m not sure that’s relevent (and it also doesn’t work on my computer for some reason).
MicroPython doesn’t seem testable on my end, Cinder is a web browser apparently, and searching for GrailPython doesn’t return any programming-related result.

My main concern is consistency, and the idea of having a safety net. We need to make a clear choice. If we don’t check the keys, then we shouldn’t check the keys, and passing **{3:5} should be allowed. Otherwise people testing it may induce from the resulting error an idea of safety which isn’t there, as for the kind of keys you can find in a kwargs dict.
That could reveal to be a security issue : you can pass valid instructions, function calls, even "import stuf;stuf.os_command('DROP_TABLE')" could be passed as a string, when only return or break or other relatively harmless stuff could be passed if you had the identifier-only check (and even that is gone if you go further with the no-keyword test). I don’t have a use-case or existing vulnerable code, but who knows, security failures have been based upon far less than this.

It’s also a feature corner. As in, if we want to give a future meaning to writing 3=something in a function call, the current practice of having **{"3":5} work blocks us : either we have to break working code (based upon an undocumented feature, but nonetheless working code), or we end up with an ambiguous behavior (does **{"3":5} do the same as 3=5, or does it store things in the kwargs dict ?).
Breaking things like that has been done in the past, for example def foo((x, y), color): was valid (though possibly undocumented) python 2 code, yet it was broken at some point during the 3k transition. But breaking that was only possible through the historic 3k breaking transition, and because it wasn’t that widely used.
Since among 4 tested interpreters, 4 behave the same, it has the dangerous potential of becoming a widely used “feature”.

guido · July 8, 2022, 12:02am

It’s totally consistent, harmless, and intentional. The keys of a dict passed using **d must be strings but they don’t have to be identifiers.

Gouvernathor · July 8, 2022, 12:03am

>>> class sttr(str):
...     def __repr__(self):
...         return "this could be " + 'malicious code'
...     __str__ = __repr__
... 
>>> foo(**{sttr(whatever):5})
... {'this could be malicious code':5}

Ok, granted, having the said malicious code already available for execution on the executing machine is a task in and of itself. But it’s an important thing, I think, that the type checking is not strict - and unless I’m very wrong, doing a strict type checking would actually be faster - so that non-strings (in a sense) can be passed.

guido · July 8, 2022, 12:20am

I don’t see why a string key that’s not a valid identifier would open up an attack vector. Maybe you don’t know how Python is interpreted?

Gouvernathor · July 8, 2022, 12:33am

I’m not saying the calling of the function itself - that is, exploding the double-starred dict and creating the kwargs dict from it basically - could trigger malicious code by itself. But at the same time, yes it’s possible there’s something I’m missing about how Python is interpreted.

What I’m saying about security issues is that people have expectations as for the type and value of keys found in the kwargs dict. You said it could only be strings, and yet that’s not exactly 100% true.
I see a parallel with a PEP I read about a pseudo-type for literal strings as opposed to non-literal strings, because only allowing literal strings in a function call (if the chech is done in a reliable manner) offers security, agains SQL injection typically. In the same spirit, I don’t think it’s a stretch to say that people may expect the kwargs dict to have Any values, and Identifier as keys (Identifier being a subset of str, which I’m sure you understand).
Even though it’s actually not the case, I don’t think expecting that is foolish and I believe a lot of programmers naively think it is the case. So there may be a security issue, coming for example from the fact that a SQL injection is impossible from an Identifier, but possible for any string. People having that expectation could write vulnerable code, handling potentially dangerous strings.

EpicWink · July 8, 2022, 12:41am

I’m 75% sure there is Azure SDK code with hyphens in kwargs keys (in some model initialisation).

Is the string requirement for optimisation reasons (matching with explicitly declared parameters)?

Gouvernathor · July 9, 2022, 12:19am

There is currently code using that undocumented feature. My initial outrage came from the realization that this line of code worked.

guido · July 9, 2022, 3:57pm

Then please file a bug that it should be documented. I consider it a language feature in good standing but I don’t have ever page of documentation memorized so I don’t know whether there is explicit documentation of this (though it is probably implied somewhere by something that indicates a dict with string keys is required for **).

ChrisBarker-NOAA · July 13, 2022, 10:31pm

Adding to the conversation:

**kwargs, IMHO is a not-that-special case – it’s passing a dict around.

However, you can also use setattr() with invalid identifiers:

In [22]: class C:
    ...:     pass
    ...: 

In [23]: obj = C()

In [24]: setattr(obj, 'this-that', 5)

In [25]: getattr(obj, 'this-that')
Out[25]: 5

In [26]: obj.__dict__
Out[26]: {'this-that': 5}

In [27]: dir(obj)
Out[27]: 
['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'this-that']

(non-strings will fail)

And, of course, that attribute is not accessible in the usual way.

This is certainly known, I don’t think it’s well documented, and I have no idea if its intentional.

Personally, I think it’s a bit of wart, but not a fatal one, and maybe even sometimes useful.

guido · July 13, 2022, 11:00pm

Again, totally intentional, and this could be seen to explain the behavior that **kwds insists on string keys – attribute names are strings, any further validity checks are up to the individual object.

ChrisBarker-NOAA · July 14, 2022, 4:20pm

totally intentional,

I’m curious about that intent.

It’s not worth the performance overhead to enforce legal identifiers
or
Being able to use non-legal-identifier strings as attribute names is a useful feature that should be allowed, and maybe even encouraged.

As for (2) – I’m personally ambivalent – I know that sometimes python object attributes are mapped to attributes from some other system that may have different naming rules (e.g. the netCDF4 library maps netcdf variable attributes to python class attributes, and they are not all valid python identifiers). But I tend to think this is a mistake – crossing the boundary from “data” to “code” inappropriately – so should this use case be allowed/encouraged?

In any case, it should be better documented – it would great if the OP or other interested parties submitted a doc PR

pf_moore · July 14, 2022, 6:15pm

I have no direct knowledge, but I suspect it’s a combination of the things you mentioned:

The performance cost of the check is unnecessary.
Most uses of either **kwargs or setattr won’t hit this case, so it’s pointless to worry.
Someone might have a use for it, so why prohibit it?
But it’s an implementation detail, so let’s not make a fuss about it.

Basically, it feels very much like a case of “we’re consenting adults” - if you don’t like being able to do this, then don’t do it, but we’re not going to stop you if you want to.

I’m honestly not sure it should. Documenting it would require all Python implementations to support it, including future versions of CPython. Personally, I’m perfectly happy with it being an implementation detail of CPython that people have to discover for themselves.

Gouvernathor · July 15, 2022, 12:11pm

So, your answer is not to support it, even though adding a check against it is not on the books ?
I’m much more in favor of that view than what @guido said, which was that it should be documented.

I think (if an interpreter check is not going to happen anyway) that the doc should warn about it : That passing anything other than a valid identifier is undocumented behavior, and can be enabled on some platforms but is not guaranteed to work anywhere nor in future versions.
We can specify that cpython does it, if you want, since that’s usually what implementation detail warnings in the doc contain.
That, for **kwargs as well as for setattr.

guido · July 15, 2022, 2:57pm

What’s the point of leaving it up to the implementation? That just invites gratuitous differences.

Gouvernathor · July 15, 2022, 6:29pm

What’s the point of taking up the permanent burden of supporting it ? What benefit does it do ? Specifying it as an implementation detail / unsupported is exactly how things currently are, but with a supplementary warning about it.

sirosen · July 16, 2022, 10:23pm

+1 to that.

More is not always better. cpython is a complicated piece of software. The behaviors can already be hard to understand, and the more that the docs focus on subtleties which are important, the better they are.

The most significant benefit of supporting the current behavior is that it is the current behavior. Changing it could potentially make a lot of people’s code stop working. That’s a very strong reason not to change things in the language unless there’s a good case for doing so.

What is the benefit of rejecting invalid identifiers in splat-expansion? Does the benefit of documenting the behavior offset the cost^[1] of doing so?

The cost of documentation includes deciding whether it is part of the Python Language or just a CPython detail. ↩︎

jeff5 · July 17, 2022, 4:50pm

The check is being enforced here, in the form exhibited by @Gouvernathor, but the same thing is checked aboutkeywords elsewhere (just grep):

github.com

python/cpython/blob/044a593cbbe1639e906e06c47504dd1020ddfee4/Python/ceval.c#L6241-L6246


      
          if (keyword == NULL || !PyUnicode_Check(keyword)) {
              _PyErr_Format(tstate, PyExc_TypeError,
                          "%U() keywords must be strings",
                        func->func_qualname);
              goto kw_fail;
          }

An implementation that wanted to, could take advantage of the fact that names that appear in code, including keyword argument names, are always strings. It would be more robust and may be faster for strongly typing the variables that hold them. This is no obstacle to callers submitting dictionaries with such keys. They just don’t match (but see below).

It is too costly to check the form of the keys in every dictionary presented here, so arbitrary strings will remain a possibility. I’m +0.5 for documenting that. I’d quite like not to support strs that walk on the wild side of Unicode (e.g. that contain lone surrogates encoding text in undefined ways).

The other uncomfortable aspect of the code is that the type check is not exact, and is followed by a call to a potentially custom __eq__. I don’t see an exploit, but it’s questionable as a feature.

>>> "my.domain.name".split(**{'sep':'.'})
['my', 'domain', 'name']
>>> class S(str):
	def __eq__(self, other):
		print(other)
		return True
	def __hash__(self):
		return 42

	
>>> "my.domain.name".split(**{S('sep'):'.'})
['my', 'domain', 'name']
>>> "my.domain.name".split(**{S('xxx'):'.'})
sep

This last one brings down the interpreter (idle in 3.8.10).

guido · July 17, 2022, 5:25pm

Is that a traceback or a segfault? If the latter, can you repro on main or at least 3.11?