Python regex issues

I’m trying to find duplicates in a string and replace. I am largely successful but found a bug I cannot resolve.

string = 'XX111XX = 1, XX111XX = 3 , XX121XXX = 2 , XX123XXX = "HAPPYYYY" XX124XXX = "HAPPYYYY"'

print return re.sub(r'\b[^\.|^\d](\w+)\b(?=.*\b([\S+]\1)\b)', r'', string)

This matches and replaces XX111XX HAPPYYYY leaving the double quotes where HAPPYYYY once was.

I don’t want to replace the values in double “” on the XX111XX matches.

How can this be achieved in my existing regex ?
I’ve tried all these things…

'\b[^\.|^\d|^A-Z](\w+)\b(?=.*\b([\S+]\1)\b)'
'\b[^\.|^\d|^\"](\w+)\b(?=.*\b([\S+]\1)\b)'

How can I avoid matching anything between quotes or code it so it only matches substrings than contain numbers and alphabetical characters and not ony alphabetical characters ?

I answer my own question :slight_smile:

regex = r'\b[^\.|^\d^](\w+\d+\w*)\b(?=.*\b([\S+]\1)\b)'

1 Like

Cool that you’ve worked this out for yourself.

I find that particular Regex construct a little confusing (but each to their own), preferring the construct I used in this thread:

1 Like