/d matches “Decimal Number”, but does not match “Letter Number” e.g. ᛮ (Runic Arlaug Symbol) and “Other Number” e.g. ৵ (Bengali Currency Numerator Two) or ³.
This means that “[^\W\d_]+” also matches these other numbers and not just letters. If we add a new \something (I had thought \¹, but that might not work), we could write a regular expression that would only match Unicode letters.
I did not think I was the first person to notice this, but I feel there should be a simple re pattern that matches only Unicode letters.
Isn’t that \w
?
No, \w
matches letters, digits and '_'
.
It’s become commonplace (but absolutely incorrect) to use the term
“unicode” to mean codepoints outside the ASCII set. While it’s
purely a guess, I suspect what they’re really looking for is a way
to exclusively match non-ASCII characters, they merely lack the
terminology to explain it accurately.
In my case, I mean both ASCII and other Unicode letters, what non-programmers think of as the components of words in various languages.
Got it, so you want a shortcut to match everything \w matches except
what \d matches and also not the underscore (_)? Are diacritics
letters, by your thinking? Or would you require strings to be
normalized first (and using which normalization)?
The regex
module supports a wide range of Unicode properties. With it you can use match letters with \p{L}
or \p{Letter}
and “letter numbers” with \p{Nl}
or \p{Letter Number}
.
\w(?<![\d_])
matches letters, but not digits, nor an underscore.
Alternatively, [^\W\d_]
.
[\W\d_]
matches anything that’s \W
(neither letter, digit nor underscore) or \d
or '_'
, leaving non-letters. Inverting that matches only letters.
Oh, I see, you’re looking for all and only. Yeah, that would be a good feature but I guess that’s one for the “grab a more powerful one from PyPI” case.
Definitely an option, but now you can’t add anything else to the class without messy alternation. Good to keep in mind though.
As I said above, [^\W\d_] also matches “Letter Number” e.g. ᛮ (Runic Arlaug Symbol) and “Other Number” e.g. ৵ (Bengali Currency Numerator Two) or ³.
Ah, OK, I missed that bit. I’d just use the regex
module.