A Regular Expression Problem

yeshuo · September 2, 2023, 7:03pm

>>> pattern = re.compile('\w+(?!\.com)')
>>> result = pattern.search('www_example123.com')
>>>
>>> print(result.group())
www_example12

I can’t understand why the result is like this?

I hope someone can tell me the specific implementation process

thanks

surfaceowl · September 2, 2023, 8:11pm

That result is returned because the regex of (?!\.com) is a negative lookahead, which makes sure whatever matches w+ is not the “.com”. Regex looks ahead after each possible \w+ match. WHen regex looks at www_example123 if would not find a match because .com is immediately after the w+ match… so what happens is the regex stops at www_example12 which leaves 3 between the match for w+ and the negative lookahead used (?!\.com).

If you want to capture the full string www_example123, you could change the regex to use a positive lookahead, so that whatever is after w+ must be “.com”… something like: r'\w+(?=\.com) (notice the equals sign).

If you haven’t already, you could check out sites like https://regex101.com/ and play around with different regex expressions and target strings to get the intuition for a particular search.

kknechtel · September 3, 2023, 6:19pm

The regular expression means “one or more word characters, that are not followed by .com”.

So we cannot match www_example123, because it is followed by .com. We have to backtrack one more character from there.

storchaka · September 5, 2023, 4:12am

You may try to use possessive quantifier

r'\w++(?!\.com)'

or atomic group

r'(?>\w+)(?!\.com)'

gwerbin · September 5, 2023, 4:42am

Another solution might be to use a capture group: (\w+)\.com