Replacement argument in re.sub()


I have the following strings that I would like to match and replace. I am able to match the strings but struggling to carry out a replacement - the output either doesn’t change or it completely erases sections of the input string. i have included an example of one of many options I have tried.

scenario 1:
desired output string: (BXXADDB)(BXXXCAC1)(CXX2A)*CANEVER

pattern = r"^\([A-Za-z0-9]+\)\([A-Za-z0-9]+\)\([^)]*\)[a-zA-Z]+$"
replacement = r"^\([A-Za-z0-9]+\)\([A-Za-z0-9]+\)\([^)]*\)\*[a-zA-Z]+$"

df['column'] = [re.sub(pattern,replacement, str) for str in df['column']]

scenario 2

input string: somefunction(,A,B,2,C)
desired output string: somefunction(A,B,2,C)

pattern = r"^[A-Za-z0-9]+\([^)]*\)$"
replacement = r"^[A-Za-z0-9]+\([^)]*\)$"
df['column'] = [re.sub(pattern,replacement, str) for str in df['column']]

scenario 3

input string: (AUSM)ABCD
desired output string: (AUSM)*ABCD

pattern: r"^\([A-Za-z0-9]+\)[A-Za-z0-9]+$"
replacement : r"^\([A-Za-z0-9]+\)\*[A-Za-z0-9]+$"
df['column'] = [re.sub(pattern,replacement, str) for str in df['column']]


To refer to a match in the replacement you need to use groups:

pattern = r"^(\([A-Za-z0-9]+\)\([A-Za-z0-9]+\)\([^)]*\))([a-zA-Z]+)$"
replacement = r"\1*\2"
input_string = "(BXXADDB)(BXXXCAC1)(CXX2A)CANEVER"
desired_output = "(BXXADDB)(BXXXCAC1)(CXX2A)*CANEVER"
assert re.sub(pattern, replacement, input_string) == desired_output

oh wow, it worked but with replacement as r"\1*"

got an invalid group reference 2 error for r"\1*\2"

i will try to do the other scenarios


Note that the pattern I posted is not identical to yours:

# This is all one group, hence why \2 doesn't work in the replacement.
your_pattern = r"^\([A-Za-z0-9]+\)\([A-Za-z0-9]+\)\([^)]*\)[a-zA-Z]+$"

#                (                   Group 1               )( Group 2 )
my_pattern =  r"^(\([A-Za-z0-9]+\)\([A-Za-z0-9]+\)\([^)]*\))([a-zA-Z]+)$"

A replacement string of \1* using your original pattern results in the following output:
which is not what you want.

Yes, I followed your pattern and still got an error.

i have done scenario 3 but struggling with the grouping for scenario 2

If you copy what I wrote exactly you will get no error.

For scenario two, consider the following pattern:

pattern = r"^([a-zA-Z0-9]+\(),([^)]*\))$"

# Same pattern with explanations
pattern = re.compile(
    r"""^                 # Match beginning of line
        (                 # Group 1 start
            [a-zA-Z0-9]+  # Match an alphanumerical string of arbitrary length
            \(            # Match an opening parenthesis
        )                 # Group 1 end
        ,                 # Match a single comma
        (                 # Group 2 start
            [^)]*         # Match a string of arbitrary length until reaching a closing parenthesis
            \)            # Match a closing parenthesis
        )                 # Group 2 end
        $                 # Match end of line""",

Can you see what the replacement string should be to get the desired result?

match single comma need to be replaced with “” between 1 and 2.

tried that but just left with


instead of somefunction(A,B,2,c)

Not quite. If you remove the comma from the pattern it won’t match the comma, or anything after it, as you noticed.

Instead, note that the two groups contain everything except the first comma inside the parentheses. For the replacement string, you just need to put the two groups together:
re.sub(pattern, r"\1\2", string)

1 Like

how did I not see that? :face_with_hand_over_mouth: Thanks. It works now.

though I won’t have thought that you could close group 1 the way you did so as to isolate the single comma. learned a lot, thanks so much.