Regex for unicode letter

mark-summerfield · February 24, 2021, 9:44am

I want to create a regex to match a Unicode letter followed by any number of letters, digits, spaces, hyphens, or underscores.

If the first bit was just an ASCII letter then it is easy: [A-Za-z][-\w ]*

But what do I replace [A-Za-z] with for any Unicode letter?

I know that if I use the regex module from PyPI I could use [\w--[0-9_]] or simply \p{L} but it would be nice to use the std. lib.

facelessuser · February 24, 2021, 2:07pm

The Regex library is a if you want to use Unicode, but I understand that sometimes there can be requirements that exclude the use of external libraries. This specific case can be worked around by simply using a negative look ahead.

Here we see by using the \w character class that we match all the letters, but with the negative look ahead, we restrict any numbers or underscores from matching.

>>> import re
>>> re.match(r'(?:(?![0-9_])[\w])+', 'test')
<re.Match object; span=(0, 4), match='test'>
>>> re.match(r'(?:(?![0-9_])[\w])+', 'te3t')
<re.Match object; span=(0, 2), match='te'>
>>>

Obviously, this gets more cumbersome if you are doing far more complex Unicode character classes. For this simple case though, I think the above would work fine.

facelessuser · February 24, 2021, 2:38pm

I will add, I do hope that Python will one day support proper Unicode support in Re to match other modern languages. It is surprising that after all this time it still does not, even JavaScript supports general categories and scripts at this point.

EDIT: So a complete solution would look like (?![0-9_])[\w]+. Depending on your use case, you may want to add a word break at the beginning: \b(?![0-9_])[\w]+.

mark-summerfield · February 25, 2021, 8:40am

Thank you!

In then end I went with: ^(?![0-9_])\w[-/\w ]+ i.e., a Unicode letter followed by any number of hyphens, slashes, letters, digits, underscores or spaces.

Topic		Replies	Views
How to fix utf-8 error when reading text file? Python Help	14	526	April 3, 2024
[SOLVED] Problem in version - delete the topic Python Help	9	355	October 21, 2023
Unicode in e-mail reading Python Help help	2	1167	August 15, 2022
Base-utf8 encoding without escape sequences? Python Help	10	1300	August 26, 2023
Why is the output different from what I thought? Python Help	9	353	September 3, 2023

Regex for unicode letter

Related Topics