I agree that it would be helpful if CPython provided more guidance on how to interpret the UCD data provided by the module, although I think I’m a -1 on the idea of granular links into the standard. That seems hard to reliably maintain since the Unicode Standard referred to changes with most point releases, and the purpose of the module is mostly to be a wrapper around the UCD properties.
Pointing out that the UCD and the meaning of the things in the UCD are distinct and giving the user a jumping-off point to read more about the meaning are good ideas, though. Perhaps something along the lines of:
This module provides access to the Unicode Character Database (UCD) which defines character properties for all Unicode characters. The meaning of and interaction between these properties is defined by the Unicode Standard and related specifications.
The data contained in this database is compiled from the UCD version 15.1.0.
I can see room for elaborating on some of the functions in the module, though. A warning attached to unicodedata.combining()
explaining that “canonical combining class” does not mean “is this a combining character” would probably be an improvement.
It would be easy to overdo it, though, the subject matter is dense. I think the main benefit would be giving readers a jump-start into the heady Unicode docs and pointing out any identified pitfalls.
Speaking of combining()
…
I’m curious what definition you’re referring to here. I can’t find language like this in the Unicode specifications. I believe the definition of this combining category is that it’s the base/default category, and characters in it are never considered part of a reorderable pair, no matter what the other character is. Characters with that combining class are locked in place when evaluating the Canonical Ordering Algorithm, if you want to think about it that way.
It’s consistent, but there’s a bit of a quirk of jargon here as to what “combining” means. The punchline is: yes, you probably want to test which category the codepoint is in. Whether you include Lm, Sk
in your set depends on whether or not you care if the characters combine graphically.
Members of Lm, Sk
are not formally-speaking combining characters which the Standard defines as being in category M
. The “modifier letters” are considered base characters on their own and explicitly do not graphically combine with other characters. The Unicode Standard gives a little bit of motivation for their use:
Modifier letters are commonly used in technical phonetic transcriptional systems, where they augment the use of combining marks to make phonetic distinctions. Some of them have been adapted into regular language orthographies as well.
To give a concrete example:
>>> [unicodedata.category(c) for c in "c\N{CEDILLA}"]
['Ll', 'Sk']
>>> print("c\N{CEDILLA}") # cedilla is not a combiner (Sk), two glyphs
c¸
>>> [unicodedata.category(c) for c in "c\N{COMBINING CEDILLA}"]
['Ll', 'Mn']
>>> print("c\N{COMBINING CEDILLA}") # cedilla is a combiner (Mn), one glyph
ç
Caveat: a decent amount of the above required me to pore over the Unicode Specification to figure out what’s going on with the combining classes and Lm, Sk
, so it’s possible I’ve missed something nuanced.