Peculiar behaviour involving unicode: `len(s[0]) != len(s[0].lower())` Is this expected?

eldipa · March 3, 2023, 11:43pm

I have the following code:

s = 'İstan'

first = s[0]

assert len(first) == 1

weird = first.lower()

print("lower() makes 1 char into 2 chars:")
print("before lower() -> c:", first, "n:", ord(first))
print("after lower() -> c:", weird, "len:", len(weird))

print("\nSame but quoting the chars")
print("before lower() -> c:", f"'{first}'", "n:", ord(first))
print("after lower() -> c:", f"'{weird}'", "len:", len(weird))

In my machine it prints:

lower() makes 1 char into 2 chars:
before lower() -> c: İ n: 304
after lower() -> c: i̇ len: 2

Same but quoting the chars
before lower() -> c: 'İ' n: 304
after lower() -> c: 'i̇' len: 2

The first (uppercase) letter is the traditional I but with a “dot” above. Python says that it is 1 character (len(s[0]) == 1).
When I turn it into lowercase now it becomes a string of 2 chars (len(s[0].lower()) == 2).

In the prints above the I with-the-dot becomes a lower case ‘i’ followed-by-a-dot.

I don’t know if this is expected or not but certainly it was a surprise.

Rosuav · March 3, 2023, 11:46pm

There are quite a few places where upper/lowercasing a single letter produces more than one.

>>> "ß".upper()
'SS'

Not sure why this gets a combining dot, but that’ll be down to the Unicode standard, which Python simply imports and follows.

MRAB · March 4, 2023, 12:28am

Most languages that use the Latin alphabet have 'I' vs 'i'.

However, Turkish and others have 'I' vs 'ı' (dotless forms) and 'İ' vs 'i' (dotted forms).

Lowercasing 'İ' would normally give you 'i', but uppercasing that again would then give you 'I'. Adding the extra codepoint lets it remember the original dot so that you round-trip it (excepting that it’s 2 codepoints instead of the original 1).

I think that Unicode could do with adding a combining form that removes the dot because currently you can’t round-trip 'ı'; it uppercases to 'I' which then lowercases to 'i'.

eldipa · March 4, 2023, 12:41am

mmm, adding the following:

weird = weird.upper()
print("after upper() -> c:", weird, "len:", len(weird))
print("after upper() -> c:", f"'{weird}'", "len:", len(weird))

Yields:

after upper() -> c: İ len: 2
after upper() -> c: 'İ' len: 2

So indeed the 2-len lowercase i followed-by-a-dot gets back to its original I with-a-dot form so the “extra” dot appended to the former is used to reconstruct the original uppercase letter.

However, for some reason the uppercase version now has a length of 2, not of 1 as it was originally.

Anyway, thanks for the comments. I found this issue the past week and I’ve already applied a crude workaround but I wanted to know if this was a bug or not.

steven.daprano · March 4, 2023, 1:01am

This issue has nothing to do with Python itself, it is a Unicode thing which any programming language that uses basic Unicode strings will experience.

The character İ (capital I with dot) has two different forms in Unicode:

the single code point U+0130
a pair of two code points, an ordinary I followed by the combining character U+0307

The reason for this is technical and related to the history of pre-Unicode character sets such as Latin-1 and MacRoman.

Whichever version you use will be displayed the same, so it is impossible to tell them apart visually. But you can look at the lengths of the strings:


>>> a = "I\u0307stan"

>>> b = "\u0130stan"

>>> print(a, b)

İstan İstan

>>> print(len(a), len(b))

6 5

We can convert from one to the other using normalisation forms:


>>> import unicodedata

>>> a == b

False

>>> a == unicodedata.normalize('NFD', b)

True

>>> unicodedata.normalize('NFC', a) == b

True

Now we come to the tricky part. When you extract the first character from the two strings a or b, you will either get a regular I on its own (without the combining dot!), or the dotted I. Lowercasing the regular I will, of course, gives a regular i but lowercasing the dotted I returns the two code point combination:

regular i followed by the combining character \u0307.

I’m not entirely sure why Unicode does this, or why it doesn’t just lowercase U+0130 to U+0069. The Unicode Consortium does not do a good job of explaining the reasons for their decisions.

The lessons here are:

strings that look the same may not be the same;
lowercasing a single character does not always give you a single character back;
Unicode has to deal with the rules from hundreds of languages and dozens of legacy character sets;
unfortunately the so-called “Turkish I” problem makes it impossible to treat the dotted and undotted I completely consistently.

Language is hard, and consequently Unicode is tricky.

eryksun · March 4, 2023, 1:33am

Using locales is another way to address this problem. This is how WinAPI LCMapStringEx() addresses it. For example:

from _winapi import *

def lower(locale, s):
    return LCMapStringEx(locale, LCMAP_LOWERCASE | LCMAP_LINGUISTIC_CASING, s)

def upper(locale, s):
    return LCMapStringEx(locale, LCMAP_UPPERCASE | LCMAP_LINGUISTIC_CASING, s)

lower_i_dotless = '\N{LATIN SMALL LETTER DOTLESS I}'
upper_i_dotless = '\N{LATIN CAPITAL LETTER I}'
lower_i_dotted = '\N{LATIN SMALL LETTER I}'
upper_i_dotted = '\N{LATIN CAPITAL LETTER I WITH DOT ABOVE}'

Turkish locale:

>>> upper('tr-TR', lower_i_dotless) == upper_i_dotless
True
>>> lower('tr-TR', upper_i_dotless) == lower_i_dotless
True

>>> upper('tr-TR', lower_i_dotted) == upper_i_dotted
True
>>> lower('tr-TR', upper_i_dotted) == lower_i_dotted
True

In an English locale, the basic Latin cases differ from the Turkish locale:

>>> lower('en-UK', upper_i_dotless) == lower_i_dotted
True
>>> upper('en-UK', lower_i_dotted) == upper_i_dotless
True