Unicodedata oddity

Glenn · February 22, 2023, 6:26am

>>>"\N{LINE FEED}"
'\n'
>>>unicodedata.name("\N{LINE FEED}")
ValueError: no such name

Happens for all code points from 0-31. Python knows the name for \N but can’t produce it from unicodedata.name.

I can’t tell that this is intentional from the documentation.

storchaka · February 22, 2023, 7:47am

See Unicodedata module should provide access to codepoint aliases · Issue #62434 · python/cpython · GitHub.

guido · February 22, 2023, 4:19pm

Have you tried looking through the source code?

github.com

python/cpython/blob/main/Modules/unicodedata.c

/* ------------------------------------------------------------------------

   unicodedata -- Provides access to the Unicode database.

   The current version number is reported in the unidata_version constant.

   Written by Marc-Andre Lemburg (mal@lemburg.com).
   Modified for Python 2.0 by Fredrik Lundh (fredrik@pythonware.com)
   Modified by Martin v. Löwis (martin@v.loewis.de)

   Copyright (c) Corporation for National Research Initiatives.

   ------------------------------------------------------------------------ */

#ifndef Py_BUILD_CORE_BUILTIN
#  define Py_BUILD_CORE_MODULE 1
#endif

#define PY_SSIZE_T_CLEAN

This file has been truncated. show original

Glenn · February 22, 2023, 8:20pm

Yes, and also looked through the Unicode standard (too big to absorb in
gory detail in a short time) and at the code blocks (without
comprehending the difference between code points that have no name, but
do have a bunch of aliases, and code points that do have a name, and
also aliases). I could see there was something different about the
layout, but didn’t find the reference that some code points have no
names… there were “names” there in the control block, officially
“aliases”, though.

I added a bug to a (previously, apparently) working program that
introduced a control code point in a spot where one had never been seen
before, that launched an error message that had never encountered an
unnamed Unicode code point before… and went searching.

The issue Serhiy pointed to has enough discussion to enlighten me to the
appropriate arcana of Unicode that resulted in the ValueError thrown
from unicodedata.name, but it would sure be nice if the documentation
for unicodedata.name pointed out that not all defined code points have
names (the text there can easily be augmented with an ASS U ME as only
throwing an error for code points that are yet undefined, or in the
reserved ranges), and it would be even nicer if it also referenced some
function that would return SOME name for all defined code points, so
that it would be more understandable in error messages than the code
point alone. I guess such a function doesn’t exist in Python, and could
appropriately be a possible result of issue Serhiy mentioned, as well as
other functions that might return aliases of various types, etc.

Meanwhile, I guess I’ll try: unicodedata.name and on except:
substitute the code point escape.

Thanks.

guido · February 22, 2023, 9:10pm

So do you think we should fix that (ancient) bug, or update the docs?

storchaka · February 22, 2023, 9:31pm

What wrong with docs?

unicodedata.name(chr [, default ])

Returns the name assigned to the character chr as a string. If no name is defined, default is returned, or, if not given, ValueError is raised.

stoneleaf · February 22, 2023, 10:32pm

It doesn’t state that some “names” (possibly aliases) are not round-trippable – I’m fine with getting a different, aka canonical, name, but a ValueError for something I just used a name for is confusing.

That would be my preference (not that I have time to do it).

malemburg · February 22, 2023, 11:01pm

As discussed on the ticket that Serhiy quoted, a new function unicodedata.aliases() would need to be added, exposing defined aliases of Unicode code points to resolve this.

The unicodedata.name() function correctly raises a ValueError, because '\n' does not have a code point name (see https://www.unicode.org/Public/15.0.0/ucd/UnicodeData.txt), so this is not a bug.

Fore more context on Unicode aliases in a slightly easier to understand format that reading the Unicode standard, see Unicode alias names and abbreviations - Wikipedia

Note that ‘\n’ has quite a few aliases:

000A;LINE FEED;control
000A;NEW LINE;control
000A;END OF LINE;control
000A;LF;abbreviation
000A;NL;abbreviation
000A;EOL;abbreviation

(from https://www.unicode.org/Public/15.0.0/ucd/NameAliases.txt)

Glenn · February 22, 2023, 11:26pm

Guido:

So do you think we should fix that (ancient) bug, or update the docs?

I think the docs should be updated to alert than unicodedata.name
doesn’t produce a result for every defined code point, that some code
points defined by Unicode are defined without names. (Not sure how to
state this clearly, but the docs presently don’t give such a warning at
all, except to people that fully understand Unicode code point naming
and all its ramifications in gory detail.)

I think it would be useful to address the (ancient, yes I noticed 2014)
issue, some of the suggested APIs to expose more of the defined names in
standard ways would be great, so that people that do or want to
understand all the variations in the names, aliases, etc. of Unicode can
access them appropriately.

I also think it would be good to have an API that returns what seems to
be the “best” name. My opinion (probably easily swayed with reasonable
arguments by any of the many people that understand Unicode better than
I) would be to return, based on the version of Unicode supported by the
version of Python, the first of the following items:

latest corrected name
name
latest corrected alias (if such even exists)
first alias

This would seem to be a practical name that could be searched for, for
more information, and would seem to generally have the best spelling
(except maybe in the 4th case if the 3rd case doesn’t exist).

Adding any of these additional APIs would require more docs changes, of
course, and perhaps a cross-reference from .name to the newer ones.

Glenn · February 22, 2023, 11:35pm

Serhiy:

What wrong with docs?

Nothing is wrong, if you understand Unicode code point naming in depth.
For the rest of us (all but Unicode committee members and maybe 100
others), it is easy to get mislead. unicodedata.name seems like a
useful function to return a meaningful name for the character code of
interest… except that there are gotchas in the standard.

Ethan:

It doesn’t state that some “names” (possibly aliases) are not
round-trippable

If any of the aliases are used, they still won’t be round-trippable via
.name, and probably not any other API that returns only one name of the
many that might be assigned to a particular code point.

Pointing that out would also be helpful, in .name docs, and also for the
docs of any new APIs that return only a single name.

Having an API that returns all the names and their categorizations would
be useful, but not practical for error reporting (or pretty much any
other type of reporting, except educational exploration). Such an API,
if cross-referenced from .name, would probably have saved me some hours
of delving into the standard… I could have asked the REPL
unicodedata.tell_all(“\n”) and learned that there is no “name”, and that
LINE FEED was one of many aliases.

stoneleaf · February 23, 2023, 12:17am

By round-trippable I meant getting a meaningful name/alias back when using a name/alias to generate the code point. I.e. the current situation feels like:

>>> 1 + 1    # an alias for the number 2
2
>>> 2        # the number 2
ValueError

steven.daprano · February 23, 2023, 11:12am

Every Unicode code point has:

zero or exactly one name;
zero or more aliases;
zero or exactly one code point label (only for code points that have no name).

See section 4.8 of Character Properties.

As I understand it:

unicodedata.lookup will accept either the name or one of the aliases;
unicodedata.name will only return the name (if it exists), never an alias;
there is currently no support to retrieve the list of aliases for a code point;
there is no support for code point labels.

This means there are many code points that we cannot round-trip:


assert 0 <= n <= 0x10FFFF

c = chr(n)  # Some code point.

name_or_alias = unicodedata.name(c)  # This may succeed.

d = unicodedata.lookup(name_or_alias)  # But this may fail.

assert c == d

Section 4.8 has this to say about character name APIs:

“An API which is defined as strictly returning the value of the Unicode Name property (the “na” attribute), should return a null string for any Unicode code point other than graphic or format characters, as that is the actual value of the property for such code points. On the other hand, an API which returns a name for Unicode code points, but which is expected to provide useful, unique labels for unassigned, reserved code points and other special code point types, should return the value of the Unicode Name property for any code point for which it is non-null, but should otherwise construct a code point label to stand in for a character name.”

I propose two additions to unicodedata:

An API to return the aliases of a code point.
We already support this “strict” Unicode Name API, add support for the “useful, unique labels” API by adding a keyword-only parameter “strict=True” to unicodedata.name, with the slight modification that we prefer aliases rather than code point labels.

The first item has a simple interface:


def aliases(char, /):

    """Return a list of Unicode name aliases for the char, which may be empty."""

    ...

The second item is a slight modification to the existing unicodedata.name function.

When strict is true, the function behaves exactly as it does now:
- return the proper name of the character (the “na” attribute), if it exists
- otherwise return the supplied argument “default”, if given;
- otherwise raise KeyError.
When strict is false, also include aliases:
- return the proper name of the character (the “na” attribute), if it exists
- otherwise return an arbitrary alias, if there is one;
- otherwise return the supplied argument “default”, if given;
- otherwise raise KeyError.

Thoughts?

storchaka · February 23, 2023, 11:28am

No, it is not correct.

unicodedata.name() always returns a name, not an alias.
unicodedata.lookup() always succeed with a name returned by unicodedata.name().

name = unicodedata.name(c)  # This may succeed.
d = unicodedata.lookup(name)  # This always succeed.
assert d == c

steven.daprano · February 23, 2023, 11:39am

Here is a function to return the code point label (or name):

import unicodedata

def label(c):
    """Return the name, or the Code Point Label of character c.

    If c is a code point with a name, the name is used as the label;
    otherwise the Code Point Label is returned.

    >>> label('Δ')
    'GREEK CAPITAL LETTER DELTA'
    >>> label('\x1F')
    '<control-001F>'

    """
    name = unicodedata.name(c, '')
    if name == '':
        number = ord(c)
        category = unicodedata.category(c)
        assert category in ('Cc', 'Cn', 'Co', 'Cs')
        if category == 'Cc':
            kind = 'control'
        elif category == 'Cn':
            if (number in set(range(0xFDD0, 0xFDF0)) |
                    {n*0x10000 + 0xFFFE +i for n in range(17) for i in (0, 1)}
                    ):
                kind = 'noncharacter'
            else:
                kind = 'reserved'
        elif category == 'Co':
            kind = 'private-use'
        else:
            assert category == 'Cs'
            kind = 'surrogate'
        name = "<%s-%04X>" % (kind, number)
    return name

steven.daprano · February 23, 2023, 11:47am

@storchaka you are right, I confused myself.