Add a re class to match numerics that are not decimal numbers

jwes44 · July 20, 2023, 7:27pm

/d matches “Decimal Number”, but does not match “Letter Number” e.g. ᛮ (Runic Arlaug Symbol) and “Other Number” e.g. ৵ (Bengali Currency Numerator Two) or ³.
This means that “[^\W\d_]+” also matches these other numbers and not just letters. If we add a new \something (I had thought \¹, but that might not work), we could write a regular expression that would only match Unicode letters.

kknechtel · July 20, 2023, 10:46pm

jwes44 · July 21, 2023, 4:26pm

I did not think I was the first person to notice this, but I feel there should be a simple re pattern that matches only Unicode letters.

Rosuav · July 21, 2023, 5:41pm

Isn’t that \w ?

MRAB · July 21, 2023, 6:03pm

No, \w matches letters, digits and '_'.

fungi · July 21, 2023, 6:20pm

It’s become commonplace (but absolutely incorrect) to use the term
“unicode” to mean codepoints outside the ASCII set. While it’s
purely a guess, I suspect what they’re really looking for is a way
to exclusively match non-ASCII characters, they merely lack the
terminology to explain it accurately.

jwes44 · July 21, 2023, 6:32pm

In my case, I mean both ASCII and other Unicode letters, what non-programmers think of as the components of words in various languages.

fungi · July 21, 2023, 6:38pm

Got it, so you want a shortcut to match everything \w matches except
what \d matches and also not the underscore (_)? Are diacritics
letters, by your thinking? Or would you require strings to be
normalized first (and using which normalization)?

MRAB · July 21, 2023, 6:42pm

The regex module supports a wide range of Unicode properties. With it you can use match letters with \p{L} or \p{Letter} and “letter numbers” with \p{Nl} or \p{Letter Number}.

storchaka · July 21, 2023, 6:42pm

\w(?<![\d_]) matches letters, but not digits, nor an underscore.

MRAB · July 21, 2023, 6:57pm

Alternatively, [^\W\d_].

[\W\d_] matches anything that’s \W (neither letter, digit nor underscore) or \d or '_', leaving non-letters. Inverting that matches only letters.

Rosuav · July 21, 2023, 9:01pm

Oh, I see, you’re looking for all and only. Yeah, that would be a good feature but I guess that’s one for the “grab a more powerful one from PyPI” case.

Definitely an option, but now you can’t add anything else to the class without messy alternation. Good to keep in mind though.

jwes44 · July 22, 2023, 12:07am

As I said above, [^\W\d_] also matches “Letter Number” e.g. ᛮ (Runic Arlaug Symbol) and “Other Number” e.g. ৵ (Bengali Currency Numerator Two) or ³.

MRAB · July 22, 2023, 12:29am

Ah, OK, I missed that bit. I’d just use the regex module.