Yes, and also looked through the Unicode standard (too big to absorb in
gory detail in a short time) and at the code blocks (without
comprehending the difference between code points that have no name, but
do have a bunch of aliases, and code points that do have a name, and
also aliases). I could see there was something different about the
layout, but didn’t find the reference that some code points have no
names… there were “names” there in the control block, officially
“aliases”, though.
I added a bug to a (previously, apparently) working program that
introduced a control code point in a spot where one had never been seen
before, that launched an error message that had never encountered an
unnamed Unicode code point before… and went searching.
The issue Serhiy pointed to has enough discussion to enlighten me to the
appropriate arcana of Unicode that resulted in the ValueError thrown
from unicodedata.name, but it would sure be nice if the documentation
for unicodedata.name pointed out that not all defined code points have
names (the text there can easily be augmented with an ASS U ME as only
throwing an error for code points that are yet undefined, or in the
reserved ranges), and it would be even nicer if it also referenced some
function that would return SOME name for all defined code points, so
that it would be more understandable in error messages than the code
point alone. I guess such a function doesn’t exist in Python, and could
appropriately be a possible result of issue Serhiy mentioned, as well as
other functions that might return aliases of various types, etc.
Meanwhile, I guess I’ll try: unicodedata.name and on except:
substitute the code point escape.
It doesn’t state that some “names” (possibly aliases) are not round-trippable – I’m fine with getting a different, aka canonical, name, but a ValueError for something I just used a name for is confusing.
That would be my preference (not that I have time to do it).
As discussed on the ticket that Serhiy quoted, a new function unicodedata.aliases() would need to be added, exposing defined aliases of Unicode code points to resolve this.
So do you think we should fix that (ancient) bug, or update the docs?
I think the docs should be updated to alert than unicodedata.name
doesn’t produce a result for every defined code point, that some code
points defined by Unicode are defined without names. (Not sure how to
state this clearly, but the docs presently don’t give such a warning at
all, except to people that fully understand Unicode code point naming
and all its ramifications in gory detail.)
I think it would be useful to address the (ancient, yes I noticed 2014)
issue, some of the suggested APIs to expose more of the defined names in
standard ways would be great, so that people that do or want to
understand all the variations in the names, aliases, etc. of Unicode can
access them appropriately.
I also think it would be good to have an API that returns what seems to
be the “best” name. My opinion (probably easily swayed with reasonable
arguments by any of the many people that understand Unicode better than
I) would be to return, based on the version of Unicode supported by the
version of Python, the first of the following items:
latest corrected name
name
latest corrected alias (if such even exists)
first alias
This would seem to be a practical name that could be searched for, for
more information, and would seem to generally have the best spelling
(except maybe in the 4th case if the 3rd case doesn’t exist).
Adding any of these additional APIs would require more docs changes, of
course, and perhaps a cross-reference from .name to the newer ones.
Nothing is wrong, if you understand Unicode code point naming in depth.
For the rest of us (all but Unicode committee members and maybe 100
others), it is easy to get mislead. unicodedata.name seems like a
useful function to return a meaningful name for the character code of
interest… except that there are gotchas in the standard.
Ethan:
It doesn’t state that some “names” (possibly aliases) are not
round-trippable
If any of the aliases are used, they still won’t be round-trippable via
.name, and probably not any other API that returns only one name of the
many that might be assigned to a particular code point.
Pointing that out would also be helpful, in .name docs, and also for the
docs of any new APIs that return only a single name.
Having an API that returns all the names and their categorizations would
be useful, but not practical for error reporting (or pretty much any
other type of reporting, except educational exploration). Such an API,
if cross-referenced from .name, would probably have saved me some hours
of delving into the standard… I could have asked the REPL
unicodedata.tell_all(“\n”) and learned that there is no “name”, and that
LINE FEED was one of many aliases.
By round-trippable I meant getting a meaningful name/alias back when using a name/alias to generate the code point. I.e. the current situation feels like:
>>> 1 + 1 # an alias for the number 2
2
>>> 2 # the number 2
ValueError
unicodedata.lookup will accept either the name or one of the aliases;
unicodedata.name will only return the name (if it exists), never an alias;
there is currently no support to retrieve the list of aliases for a code point;
there is no support for code point labels.
This means there are many code points that we cannot round-trip:
assert 0 <= n <= 0x10FFFF
c = chr(n) # Some code point.
name_or_alias = unicodedata.name(c) # This may succeed.
d = unicodedata.lookup(name_or_alias) # But this may fail.
assert c == d
Section 4.8 has this to say about character name APIs:
“An API which is defined as strictly returning the value of the Unicode Name property (the “na” attribute), should return a null string for any Unicode code point other than graphic or format characters, as that is the actual value of the property for such code points. On the other hand, an API which returns a name for Unicode code points, but which is expected to provide useful, unique labels for unassigned, reserved code points and other special code point types, should return the value of the Unicode Name property for any code point for which it is non-null, but should otherwise construct a code point label to stand in for a character name.”
I propose two additions to unicodedata:
An API to return the aliases of a code point.
We already support this “strict” Unicode Name API, add support for the “useful, unique labels” API by adding a keyword-only parameter “strict=True” to unicodedata.name, with the slight modification that we prefer aliases rather than code point labels.
The first item has a simple interface:
def aliases(char, /):
"""Return a list of Unicode name aliases for the char, which may be empty."""
...
The second item is a slight modification to the existing unicodedata.name function.
When strict is true, the function behaves exactly as it does now:
return the proper name of the character (the “na” attribute), if it exists
otherwise return the supplied argument “default”, if given;
otherwise raise KeyError.
When strict is false, also include aliases:
return the proper name of the character (the “na” attribute), if it exists
otherwise return an arbitrary alias, if there is one;
otherwise return the supplied argument “default”, if given;
Here is a function to return the code point label (or name):
import unicodedata
def label(c):
"""Return the name, or the Code Point Label of character c.
If c is a code point with a name, the name is used as the label;
otherwise the Code Point Label is returned.
>>> label('Δ')
'GREEK CAPITAL LETTER DELTA'
>>> label('\x1F')
'<control-001F>'
"""
name = unicodedata.name(c, '')
if name == '':
number = ord(c)
category = unicodedata.category(c)
assert category in ('Cc', 'Cn', 'Co', 'Cs')
if category == 'Cc':
kind = 'control'
elif category == 'Cn':
if (number in set(range(0xFDD0, 0xFDF0)) |
{n*0x10000 + 0xFFFE +i for n in range(17) for i in (0, 1)}
):
kind = 'noncharacter'
else:
kind = 'reserved'
elif category == 'Co':
kind = 'private-use'
else:
assert category == 'Cs'
kind = 'surrogate'
name = "<%s-%04X>" % (kind, number)
return name