Add space-format characters to str.strip

nuno · March 19, 2021, 10:00pm

str.strip() (str.strip(None)) removes the leading and trailing characters whose Unicode property White_Space is yes, but leaves invisible space formatters like joiners, non-joiners, and separators.

IMO, this is an unexpected behavior that can lead to hard-to-detect “errors”, like sorting strings that begin with an A after Z or stopping whitespace stripping in the middle of a whitespace substring.

Also, the purpose of these characters is to control the joining of adjacent characters; so they have no function at the ends of a string, which IMO makes them even better candidates for removal in the default stripping.

As an example:

from enum import IntEnum


class NonWhiteSpace(IntEnum):
    '''Joiners, non-joiners, and separators.
    '''
    CGJ  = 0x34f   # Combining Grapheme Joiner
    MVS  = 0x180e  # Mongolian vowel separator
    ZWSP = 0x200b  # Zero-width space
    ZWJN = 0x200c  # Zero-width non-joiner
    ZWJ  = 0x200d  # Zero-width joiner
    WJ   = 0x2060  # Word joiner
    BOM  = 0xfeff  # Byte Order Mark, formerly ZWNBSP
                   #   (zero-width non-breaking space)


for c in NonWhiteSpace:
    s = chr(c) + 'A'
    ss = f"    {chr(c)}    A     {chr(c)}    "

    print(f'''\
    {c.name} - {hex(c)}
      (A == {s.strip()}) is {s.strip() == 'A'}
      no strip:  "{ss}"
      strip:     "{ss.strip()}"
      sorted list:    {{}}
      sorted string:  {{}}
    '''.format(''.join(sorted(['A', s, 'Z'])),
               ''.join(sorted('A' + s + 'Z'))))

shows:

    CGJ - 0x34f
      (A == ͏A) is False
      no strip:  "    ͏    A     ͏    "
      strip:     "͏    A     ͏"
      sorted list:    AZ͏A
      sorted string:  AAZ͏

… and so on.

I’ve also included the BOM, which is not always invisible (is usually displayed with the replacement character - a question mark inside a black diamond) but served as a non-breaking space before Unicode 3.2.