Python re lookahead don't ignore endpos

MegaIng · June 18, 2024, 8:06pm

I am not sure if this is a bug or an intent and badly documented feature.

I am using \b for simplicity, but the same applies to the full lookahead/behind expressions.

import re

pattern = re.compile(r'\b\w+\b') # matches full words

assert pattern.fullmatch("abc")
assert pattern.fullmatch("abc", pos=0, endpos=3)
assert pattern.fullmatch(" abc ", pos=1, endpos=4)
assert not pattern.fullmatch("xabc ", pos=1, endpos=4)
assert not pattern.fullmatch(" abcx", pos=1, endpos=4)

The problem in my eyes is the last one: It shouldn’t match because there is an x after abc, but it seems \b and the lookahead machinery in general doesn’t look beyond endpos. This is in conflict with the behavior of pos as can be seen from the "xabc " example.

Is this is intended? And if not, is this fixable, or are we at the point “too much code might rely on this behavior”?

jamestwebber · June 18, 2024, 8:17pm

This seems pretty clearly documented here?

MegaIng · June 18, 2024, 8:22pm

Aha, right, missed the last sentence. I still find it confusing behavior, but I guess it matches the docs, so