UTS #55 Unicode Source Code Handling

steve.dower · August 10, 2023, 7:59pm

In a few weeks, the Unicode Technical Committee is going to publish UTS #55 (currently in draft form), a new technical standard on handling Unicode text in source code. As one of the contributors, I wanted to provide a bit more context here for our team on what’s in it and how it applies to us.

In brief, the majority of the document and all of the requirements do not apply to language implementations, but it endorses and clarifies UAX #31, which does apply, and which we mostly comply with.

For those implementing source code renderers (e.g. GUI or web-based apps), the rest of the document very much applies to you. In particular, handling bidirectional text safely (one of the motivating factors) is the responsibility of the renderer, not the language. Don’t wait for a PEP or anything describing how to render Unicode Python code properly, because this document is going to cover virtually all of it.

Back to the language implementers, the new section for us is Section 3: Computer Language Specifications. It references UAX #31, which is essential reading if you want to go deep into this space, but I think we’re pretty good there thanks to past work.

A few specific notes for Python as you read through the parts of section 3:

3.1 Identifiers - we already use a forward-compatible profile for identifiers, which does not include the mentioned mathematical profile (which is good) and also does not include the emoji profile, which might be worth discussing as our current definition includes some emoji but not all of them.
3.1.1 Normalization and Case - we use NFKC normalisation and are case-sensitive, which is not recommended, because the “K” in normalisation treats certain characters as equal that make less sense than case-sensitivity (e.g. superscript 2, subscript 2 and 2 are all equal under NFKC). I don’t think there’s anything we can do about this now, but most people won’t encounter any issues
3.1.2 Semantics based on Case - we have none to worry about
3.2 Whitespace and Syntax - we do not follow UAX31 properly here, which would include a wider set of accepted/ignored characters. In particular, there are specific “ignorable format control” characters that users would use to correct for bidirectional issues.
3.3.3 Changing Normalization and Case - most interesting thing here is that the example shouldn’t be able to apply to us

Conformance with UTS #55 isn’t a big deal (for us), and compatibility is most important always, so I’m not proposing any particular changes right now. But wanted to make people aware of it.

rhettinger · August 12, 2023, 3:19pm

Would it make sense to deprecate use of emojis in identifiers? Presumably, the earlier we do this, the less difficult it will be.

gpshead · August 12, 2023, 6:11pm

Possibly. But I can’t see a compelling reason to bother going through with such a syntax deprecation. Unless their presence leads to actually bad non-contrived user experiences, it’s an unlikely to be used feature.

rhettinger · August 12, 2023, 7:15pm

Perhaps just skip the deprecation and remove the emoji identifier support right away.

malemburg · August 14, 2023, 2:13pm

Do those emojis pose any harm with respect to attack vectors discussed in UTS #55 ? If not, I don’t see why we would need to change Martin’s original definition.

This is true and was discussed briefly at the time. While it’s not ideal that e.g. a2³ = 123 is (supposed to be converted using NFKC and) interpreted as “a23 = 123” by the interpreter, I also don’t think that many people use such source code.

BTW: I tested this with Python 3.10 and this does not allow using “a2³” as a variable name:

>>> unicodedata.normalize('NFKC', 'a2³')
'a23'
>>> a23 = 123
>>> print (a23)
123
>>> a2³ = 123
  File "<stdin>", line 1
    a2³ = 123
      ^
SyntaxError: invalid character in identifier

Am I doing something wrong or is this a bug somewhere ?

malemburg · August 14, 2023, 2:41pm

It seems that the parser first applies the XID_Start / XID_Continue check and only then runs the NFKC normalization on the identifier string. The number super/subscript chars are not included in the XID ranges, so things are actually better than I thought

steve.dower · August 14, 2023, 3:42pm

That’s reassuring.

I’m sure I’ve seen it used once or twice in presentations, but that’s a contrived scenario. I also can’t seem to recreate it myself right now, so possibly it was beyond contrived and was actually entirely hacked up! Or possibly it just doesn’t work on Windows (maybe there’s something in Jupyter to allow it?)

XID_Start and XID_Continue shouldn’t directly contain any emoji characters, so either it was magic or a glitch, and I’m at fault for going from memory of experience instead of just trusting what the spec says

malemburg · August 14, 2023, 6:58pm

There are no emoji code points in the XID lists.

You can check the complete list of covered code points in this file: https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt Only the code points with property XID_Start and XID_Continue are regarded as valid identifier code points – still a huge number (272k at the moment).

Topic		Replies	Views
Un-deprecate PyUnicode_READY() for future Unicode improvement Core Development	10	1084	May 16, 2022
Need help understanding issue 42237 Core Development	1	475	October 20, 2023
"JEP 400: UTF-8 by Default" and future of Python Core Development	20	2300	March 18, 2022
PEP 467: Minor API improvements for binary sequences PEPs	33	2173	February 10, 2024
Add URI normalization functions to the urllib.parse module Ideas	2	3771	April 6, 2020

UTS #55 Unicode Source Code Handling

Related Topics