How to calculate the length of a non-ASCII string for a curses app?

FelixFourcolor · June 19, 2024, 12:56am

I’m learning to write a TUI app with curses, something along the line of a text editor (but much simpler). To handle horizontal scroll and word wrap I need to measure how long a string is, so that I could compare it to how wide the terminal is and do appropriate calculations.

The problem is len is not accurate for non-ASCII characters. In particular, I need emojis to work. For example, len("🙂") is 1, but it actually takes up 2 spaces in the terminal.

My current solution is using the emoji library to detect whether a character is an emoji, if so count its length as 2. It sort of works, but it feels hacky. Surely there must be a more natural solution? Also I’m not sure that all emojis take up 2 spaces, it just seems to me that most do.

Am I approaching the word wrap problem the right way (or is this an XY problem)? If this is the right approach, what’s the best way to solve this? Thank you.

blhsing · June 19, 2024, 1:39am

You can use unicodedata.east_asian_width to determine the width of a unicode character. If it returns 'F' or 'W', it should have a display width of 2.

In particular, for emoji characters like the one in your question the function should return 'W', according to the Unicode® Standard Annex #11:

ED4. East Asian Wide (W) : All other characters that are always wide. These characters occur only in the context of East Asian typography where they are wide characters (such as the Unified Han Ideographs or Squared Katakana Symbols). This category includes characters that have explicit halfwidth counterparts, along with characters that have the [UTS51] property Emoji_Presentation , with the exception of characters that have the [UCD] property Regional_Indicator

kknechtel · June 19, 2024, 2:06am

Keep in mind that you’ll also have the opposite problem: for example,

>>> len('é')
2

That’s a different representation, as two distinct characters, of the same grapheme:

>>> len('é')
1

There are a ton of corner cases in this. Text is hard. You really need a third-party library for this.

FelixFourcolor · June 19, 2024, 4:17am

Which library do you recommend for this?

blhsing · June 19, 2024, 7:40am

I’d recommend wcwidth.wcswidth:

from wcwidth import wcswidth

for c in "🙂", "é", "각":
    print(len(c), wcswidth(c))

This outputs:

1 2
2 1
3 2