Regex for unicode letter

mark-summerfield · February 24, 2021, 9:44am

I want to create a regex to match a Unicode letter followed by any number of letters, digits, spaces, hyphens, or underscores.

If the first bit was just an ASCII letter then it is easy: [A-Za-z][-\w ]*

But what do I replace [A-Za-z] with for any Unicode letter?

I know that if I use the regex module from PyPI I could use [\w--[0-9_]] or simply \p{L} but it would be nice to use the std. lib.

facelessuser · February 24, 2021, 2:07pm

The Regex library is a if you want to use Unicode, but I understand that sometimes there can be requirements that exclude the use of external libraries. This specific case can be worked around by simply using a negative look ahead.

Here we see by using the \w character class that we match all the letters, but with the negative look ahead, we restrict any numbers or underscores from matching.

>>> import re
>>> re.match(r'(?:(?![0-9_])[\w])+', 'test')
<re.Match object; span=(0, 4), match='test'>
>>> re.match(r'(?:(?![0-9_])[\w])+', 'te3t')
<re.Match object; span=(0, 2), match='te'>
>>>

Obviously, this gets more cumbersome if you are doing far more complex Unicode character classes. For this simple case though, I think the above would work fine.

facelessuser · February 24, 2021, 2:38pm

I will add, I do hope that Python will one day support proper Unicode support in Re to match other modern languages. It is surprising that after all this time it still does not, even JavaScript supports general categories and scripts at this point.

EDIT: So a complete solution would look like (?![0-9_])[\w]+. Depending on your use case, you may want to add a word break at the beginning: \b(?![0-9_])[\w]+.

mark-summerfield · February 25, 2021, 8:40am

Thank you!

In then end I went with: ^(?![0-9_])\w[-/\w ]+ i.e., a Unicode letter followed by any number of hyphens, slashes, letters, digits, underscores or spaces.