Add a re class to match numerics that are not decimal numbers

/d matches “Decimal Number”, but does not match “Letter Number” e.g. ᛮ (Runic Arlaug Symbol) and “Other Number” e.g. ৵ (Bengali Currency Numerator Two) or ³.
This means that “[^\W\d_]+” also matches these other numbers and not just letters. If we add a new \something (I had thought \¹, but that might not work), we could write a regular expression that would only match Unicode letters.

I did not think I was the first person to notice this, but I feel there should be a simple re pattern that matches only Unicode letters.

Isn’t that \w ?

No, \w matches letters, digits and '_'.

It’s become commonplace (but absolutely incorrect) to use the term
“unicode” to mean codepoints outside the ASCII set. While it’s
purely a guess, I suspect what they’re really looking for is a way
to exclusively match non-ASCII characters, they merely lack the
terminology to explain it accurately.

In my case, I mean both ASCII and other Unicode letters, what non-programmers think of as the components of words in various languages.

Got it, so you want a shortcut to match everything \w matches except
what \d matches and also not the underscore (_)? Are diacritics
letters, by your thinking? Or would you require strings to be
normalized first (and using which normalization)?

The regex module supports a wide range of Unicode properties. With it you can use match letters with \p{L} or \p{Letter} and “letter numbers” with \p{Nl} or \p{Letter Number}.

\w(?<![\d_]) matches letters, but not digits, nor an underscore.

Alternatively, [^\W\d_].

[\W\d_] matches anything that’s \W (neither letter, digit nor underscore) or \d or '_', leaving non-letters. Inverting that matches only letters.

1 Like

Oh, I see, you’re looking for all and only. Yeah, that would be a good feature but I guess that’s one for the “grab a more powerful one from PyPI” case.

Definitely an option, but now you can’t add anything else to the class without messy alternation. Good to keep in mind though.

As I said above, [^\W\d_] also matches “Letter Number” e.g. ᛮ (Runic Arlaug Symbol) and “Other Number” e.g. ৵ (Bengali Currency Numerator Two) or ³.

Ah, OK, I missed that bit. I’d just use the regex module.

1 Like