Python Std Lib module unicodedata

dlamblin · December 11, 2024, 6:40pm

It seems to me it would help if more links to Unicode v3.2 documentation were provided in this documentation.

E.G. I was trying to figure out why I would get this result:

>>> unicodedata.name('\u0333')
'COMBINING DOUBLE LOW LINE'
>>> unicodedata.name('\u20DD')
'COMBINING ENCLOSING CIRCLE'
>>> unicodedata.combining('\u0333')
220
>>> unicodedata.combining('\u20DD')
0

To do so, I had to read: unicodedata — Unicode Database — Python 3.13.1 documentation which could have linked to Chapter 4 – Unicode 16.0.0 and the code (pardon there’s a link limit) at github dot com’s /python/cpython/blob/main/Modules/unicodedata.c#L317-L330 and maybe explained which part of unicode dot org’s /Public/3.2-Update/UnicodeData-3.2.0.txt#:~:text=20DD%3B or possibly /Public/16.0.0/ucd/UnicodeData.txt#:~:text=20DD%3B is used for this result.

Having some more of those links in each method that relates to some of these definitions would have helped.

Now… why some combining characters have canonical combing class set to 0 which is defined as being a non-combining character or a base combining character, is something I wish were also explained, but might be outside of Python’s documentation scope. It seems there’s some intent to use General Category to denote if it combines, but I’m unclear if it’s consistant.

IE if I wanted to write an is_combined check, I might look at whether category is in ['Lm', 'Mn', 'Mc', 'Me', 'Sk'] … but I don’t seem to find any reference in Unicode docs that confirms or clarifies this; I believe there’s a few exceptions in some categories listed and possibly in those not listed.

brettcannon · December 11, 2024, 6:52pm

Would you mind opening an issue at GitHub - python/cpython: The Python programming language , maybe even provide a PR with what you would suggest be changed?

MRAB · December 11, 2024, 7:52pm

Python 3.13 uses Unicode 15.1.0 because Unicode 16.0.0 was released too close to Python’s final release, after several betas and release candidates had already been released for testing.

As for why U+0333 is in combining class 220 but U+20DD is in combining class 0, well, that’s what’s in the Unicode standard.

dlamblin · December 11, 2024, 9:58pm

Thanks, yes, I think the 15.1.10 UnicodeData.txt is the same as UnicodData-3.2.0.txt for the combining pages I looked at. And yes, the data and spec define the combining class that way… it’s just when you read the unicodedata document, it seems as though a combining class of 0 is for a character that won’t combine, if you don’t have (and aren’t linked to) a definition of what a combining class means and is used for.

SnoopJ · December 12, 2024, 6:25am

I agree that it would be helpful if CPython provided more guidance on how to interpret the UCD data provided by the module, although I think I’m a -1 on the idea of granular links into the standard. That seems hard to reliably maintain since the Unicode Standard referred to changes with most point releases, and the purpose of the module is mostly to be a wrapper around the UCD properties.

Pointing out that the UCD and the meaning of the things in the UCD are distinct and giving the user a jumping-off point to read more about the meaning are good ideas, though. Perhaps something along the lines of:

This module provides access to the Unicode Character Database (UCD) which defines character properties for all Unicode characters. The meaning of and interaction between these properties is defined by the Unicode Standard^[1] and related specifications.

The data contained in this database is compiled from the UCD version 15.1.0.

I can see room for elaborating on some of the functions in the module, though. A warning attached to unicodedata.combining() explaining that “canonical combining class” does not mean “is this a combining character”^[2] would probably be an improvement.

It would be easy to overdo it, though, the subject matter is dense. I think the main benefit would be giving readers a jump-start into the heady Unicode docs and pointing out any identified pitfalls.

Speaking of combining()…

I’m curious what definition you’re referring to here. I can’t find language like this in the Unicode specifications. I believe the definition of this combining category is that it’s the base/default category, and characters in it are never considered part of a reorderable pair, no matter what the other character is. Characters with that combining class are locked in place when evaluating the Canonical Ordering Algorithm, if you want to think about it that way.

It’s consistent, but there’s a bit of a quirk of jargon here as to what “combining” means. The punchline is: yes, you probably want to test which category the codepoint is in. Whether you include Lm, Sk in your set depends on whether or not you care if the characters combine graphically.

Members of Lm, Sk are not formally-speaking combining characters which the Standard defines as being in category M. The “modifier letters” are considered base characters on their own^[3] and explicitly do not graphically combine with other characters. The Unicode Standard gives a little bit of motivation for their use:

Modifier letters are commonly used in technical phonetic transcriptional systems, where they augment the use of combining marks to make phonetic distinctions. Some of them have been adapted into regular language orthographies as well.

To give a concrete example:

>>> [unicodedata.category(c) for c in "c\N{CEDILLA}"]
['Ll', 'Sk']
>>> print("c\N{CEDILLA}")  # cedilla is not a combiner (Sk), two glyphs
c¸
>>> [unicodedata.category(c) for c in "c\N{COMBINING CEDILLA}"]
['Ll', 'Mn']
>>> print("c\N{COMBINING CEDILLA}")  # cedilla is a combiner (Mn), one glyph
ç

Caveat: a decent amount of the above required me to pore over the Unicode Specification to figure out what’s going on with the combining classes and Lm, Sk, so it’s possible I’ve missed something nuanced.

Note that the 15.1.0 link is to the landing page for that Unicode version, because there is no HTML version in that release. It would be more appropriate to link to the HTML version of the core specification from Unicode 16.0.0 forward, as these versions are now the authoritative ones and are going to be generally more accessible anyway. ↩︎
It would probably have been better to give the function a more thorough name, but that ship has sailed. ↩︎
A fun quirk of this is that almost all members of Lm are valid Python identifiers on their own! ↩︎