str.strip()
(str.strip(None)
) removes the leading and trailing characters whose Unicode property White_Space
is yes
, but leaves invisible space formatters like joiners, non-joiners, and separators.
IMO, this is an unexpected behavior that can lead to hard-to-detect “errors”, like sorting strings that begin with an A after Z or stopping whitespace stripping in the middle of a whitespace substring.
Also, the purpose of these characters is to control the joining of adjacent characters; so they have no function at the ends of a string, which IMO makes them even better candidates for removal in the default stripping.
As an example:
from enum import IntEnum
class NonWhiteSpace(IntEnum):
'''Joiners, non-joiners, and separators.
'''
CGJ = 0x34f # Combining Grapheme Joiner
MVS = 0x180e # Mongolian vowel separator
ZWSP = 0x200b # Zero-width space
ZWJN = 0x200c # Zero-width non-joiner
ZWJ = 0x200d # Zero-width joiner
WJ = 0x2060 # Word joiner
BOM = 0xfeff # Byte Order Mark, formerly ZWNBSP
# (zero-width non-breaking space)
for c in NonWhiteSpace:
s = chr(c) + 'A'
ss = f" {chr(c)} A {chr(c)} "
print(f'''\
{c.name} - {hex(c)}
(A == {s.strip()}) is {s.strip() == 'A'}
no strip: "{ss}"
strip: "{ss.strip()}"
sorted list: {{}}
sorted string: {{}}
'''.format(''.join(sorted(['A', s, 'Z'])),
''.join(sorted('A' + s + 'Z'))))
shows:
CGJ - 0x34f
(A == ͏A) is False
no strip: " ͏ A ͏ "
strip: "͏ A ͏"
sorted list: AZ͏A
sorted string: AAZ͏
… and so on.
I’ve also included the BOM, which is not always invisible (is usually displayed with the replacement character - a question mark inside a black diamond) but served as a non-breaking space before Unicode 3.2.