Non-identifier names of kwargs, attributes, variables etc

encukou · September 20, 2022, 10:47am

The Steering Council was recently asked to decide whether arbitrary strings should be allowed as ** arguments in calls, attributes and elsewhere.
The SC ruled that allowing arbitrary strings is a feature of Python, rather than an implementation detail.

The Ideas thread has a lot of background discussion, but now that we have a SC ruling, I think a new thread is in order.
I wrote up a text about how the details should work, which goes beyond a simple SC ruling. Members of the SC generally agree with this direction, but not necessarily all the details. The nitpicking is better done in public, so here goes:

Let’s separate names and identifiers.

The name of a keyword argument, attribute, function, class, module, variable and similar can be any string. This includes, for example:

the empty string
a string with dots, dashes, dollars or other symbols
a language keyword (e.g. for)
a string with \0 or other control characters
emoji!

This is a feature of Python’s object model. It is not CPython-specific.

(The term name can be confusing when used alone. It should generally be qualified as attribute name, argument name, variable name etc. I’ll make an exception in this thread, which lumps this kind of names together, and isn’t about other kinds of names. Better terminology would be welcome.)

Identifiers, currently documented as a synonym for “name”, are a feature of the Python syntax – a part of Python that’s separate from the object model.

While non-identifier names are inaccessible using the Python syntax, in many cases there is a string-based API to work with them, like getattr/setattr, importlib.import_module, call(**...).
Implementation-specific alternative ways to work with objects, like CPython’s C API or the AST, are also not limited to the Python syntax.

Allowing arbitrary strings should help make implementations simpler (as we don’t need potentially expensive checks), and allows straightforward bindings to other languages and object systems.

The following are implementation details, which may be different across implementations, and might change in future CPython versions (with an appropriate deprecation process):

Allowing non-strings that compare equal to strings (including subclasses of str) as names.
Allowing non-strings in namepaces (like __dict__[3.14]). Non-strings are not considered to be names.
Preserving the identity of strings used as names. (For example, namespace implementations may intern the names, or not store names as Python objects at all.)

Some kinds of names may have additional restrictions. For example, module names containing a dot (.) will not work well with the import machinery, since the dot separates package names.

Since we’re only writing this down now, CPython might contain bugs and omissions around non-identifier names, especially ones with embedded NULs. Similarly, the documentation currently doesn’t use the terms “name” and “identifier” as defined here. These should be reported and fixed, eventually.

PEP 8 could be clarified to specify that “all names in the Python standard library MUST be ASCII-only non-keyword identifiers” (except in tests for unusual names). Third-party projects are encouraged to adopt this policy as well.

Note that Python implementations can vary in details of what is considered a string – for example, we currently don’t specify if surrogates or “characters” outside the Unicode range are allowed. This means that the exact set of allowed names is, technically, also implementation-specific.

What are y’all’s thoughts?

storchaka · September 20, 2022, 12:58pm

Not only in module names. Dots will cause troubles in attribute, class and function names: mostly with modules pickle, unittest.mock and pydoc.

jeff5 · September 20, 2022, 7:53pm

Thanks for thinking this out to its full breadth. The answer is much better than the question.

Also for what this has to say about what is not a feature (names that are str sub-classes, and preservation of identity). I can see potential optimisations those would make difficult.

facundo · September 22, 2022, 2:19pm

Hello! I like the resolution, but we’re not there yet, right?

I mean, I can do the first but not the second…

>>> ñ = 3
>>> ✓ = 5
  File "<stdin>", line 1
    ✓ = 5
    ^
SyntaxError: invalid character '✓' (U+2713)

Thanks!!

saaketp · September 22, 2022, 4:01pm

That is addressed by this line I believe,

The string based API in this case is globals()["✓"] = 5 or setattr(sys.modules[__name__], "✓", 5)

Glenn · September 23, 2022, 5:48am

That is presently pleasantly ugly, of course.

Here’s an syntactic-sugary sweeter idea to spice things up: l-string
for locals access and g-string (surely python needs something called
that!) for globals access.

l"✓" = 5
foo = g"?"

In combination with f-strings, and proper precedence (f expansion
first), one could have interesting name indirections, and even
pass-by-name semantics for function calls (just pass the string, but
reference it appropriately):

name = “✓”
lf"[name]" = 5
name = “?”
foo = gf"{name}"

This could get ugly too, but could also be powerful:

g"✓?" = 7
namepart1 = “✓”
namepart2 = “?”
print( gf"{namepart1}{namepart2}") # would print 7

What about module names? Needing an extra parameter for the module name
at first seemed complicated, but we have syntax for that already, and
just have to extend the semantics of l-string to be module local if used
in that syntax:

module.l"✓" = 5

or even gf"{modulename}“.l"✓”

OK, take this suggestion with a grain of tongue-in-cheek salt, or maybe
PEPper it.

jeff5 · September 23, 2022, 6:15am

I was trying to think what positive statement one could make about the behaviour of a conforming implementation that allows sub-classes of str to be presented as names, but does not preserve their identity. For object attributes I think it might be that equal strings identify the same attribute:

s1==s2 implies getattr(o, s1) == getattr(o, s2), if either exists, and
after setattr(o, s1, x) then getattr(o, s2) == x.

A similar statement could be made about about matching a keyword parameter or attribute accessible with ., in relation to the string of the identifier vs some other string equal to it.

jeff5 · September 23, 2022, 6:20am

People are talking about the arbitrary name syntax at Backtics to allow any name

eric.snow · September 23, 2022, 4:48pm

I’m guessing you meant a stronger relationship, that the same exact object would return from getattr(), so:

s1==s2 implies getattr(o, s1) is getattr(o, s2), if either exists, and
after setattr(o, s1, x) then getattr(o, s2) is x.

Is that right?

jeff5 · September 23, 2022, 5:04pm

Absolutely right, when the identity of the object matters, yes. Good point.

For value types, we probably don’t have that. (I’m trying to think of a killer stdlib example.)

Edit: Oh here’s one:

c = 1+2j
getattr(c, "imag") is getattr(c, "imag")
False

Not as simple as I hoped.

encukou · September 26, 2022, 8:00pm

For the right side of the implication you can say “getattr(o, s1) is equivalent to getattr(o, s2)” ;‍)

But the left side is more tricky, because subclasses can do all kinds of black magic. You might need restrictions like:

In the s1==s2, type(s2) == str rather than another subclass
__hash__ method must be defined (and, of course, well-behaved: s1 == s2 implies hash(s1) == hash(s2))
str(s1) == s2 and/or str.__str__(s1) == s2, to extract the character data for interning &c.

Perhaps allow “subclasses of str that don’t override __eq__, __hash__, __str__”? Or others?

Analyzing this can of worms requires insight into current implementations and possible optimizations, and/or lots of time. I currently don’t have that.
So I went with saying that support for subclasses of str is implementation-specific, which IMO isn’t so bad:

the implementations are free to do something reasonable
avoiding the subclasses is not too much of a burden for portable programs
it’s a simple rule that’s easy to reason about – and implement in linters, for example

But we definitely could do better.