Replacement argument in re.sub()

issa · November 10, 2022, 11:23am

Hello,

I have the following strings that I would like to match and replace. I am able to match the strings but struggling to carry out a replacement - the output either doesn’t change or it completely erases sections of the input string. i have included an example of one of many options I have tried.

scenario 1:
input string: (BXXADDB)(BXXXCAC1)(CXX2A)CANEVER
desired output string: (BXXADDB)(BXXXCAC1)(CXX2A)*CANEVER

pattern = r"^\([A-Za-z0-9]+\)\([A-Za-z0-9]+\)\([^)]*\)[a-zA-Z]+$"
replacement = r"^\([A-Za-z0-9]+\)\([A-Za-z0-9]+\)\([^)]*\)\*[a-zA-Z]+$"

df['column'] = [re.sub(pattern,replacement, str) for str in df['column']]

scenario 2

input string: somefunction(,A,B,2,C)
desired output string: somefunction(A,B,2,C)

pattern = r"^[A-Za-z0-9]+\([^)]*\)$"
replacement = r"^[A-Za-z0-9]+\([^)]*\)$"
df['column'] = [re.sub(pattern,replacement, str) for str in df['column']]

scenario 3

input string: (AUSM)ABCD
desired output string: (AUSM)*ABCD

pattern: r"^\([A-Za-z0-9]+\)[A-Za-z0-9]+$"
replacement : r"^\([A-Za-z0-9]+\)\*[A-Za-z0-9]+$"
df['column'] = [re.sub(pattern,replacement, str) for str in df['column']]

thanks

abessman · November 10, 2022, 12:08pm

To refer to a match in the replacement you need to use groups:

pattern = r"^(\([A-Za-z0-9]+\)\([A-Za-z0-9]+\)\([^)]*\))([a-zA-Z]+)$"
replacement = r"\1*\2"
input_string = "(BXXADDB)(BXXXCAC1)(CXX2A)CANEVER"
desired_output = "(BXXADDB)(BXXXCAC1)(CXX2A)*CANEVER"
assert re.sub(pattern, replacement, input_string) == desired_output

issa · November 10, 2022, 12:38pm

oh wow, it worked but with replacement as r"\1*"

got an invalid group reference 2 error for r"\1*\2"

i will try to do the other scenarios

thanks!

abessman · November 10, 2022, 2:01pm

Note that the pattern I posted is not identical to yours:

# This is all one group, hence why \2 doesn't work in the replacement.
your_pattern = r"^\([A-Za-z0-9]+\)\([A-Za-z0-9]+\)\([^)]*\)[a-zA-Z]+$"

#                (                   Group 1               )( Group 2 )
my_pattern =  r"^(\([A-Za-z0-9]+\)\([A-Za-z0-9]+\)\([^)]*\))([a-zA-Z]+)$"

A replacement string of \1* using your original pattern results in the following output:
(BXXADDB)(BXXXCAC1)(CXX2A)CANEVER*
which is not what you want.

issa · November 10, 2022, 2:11pm

Yes, I followed your pattern and still got an error.

i have done scenario 3 but struggling with the grouping for scenario 2

abessman · November 10, 2022, 2:33pm

If you copy what I wrote exactly you will get no error.

For scenario two, consider the following pattern:

pattern = r"^([a-zA-Z0-9]+\(),([^)]*\))$"

# Same pattern with explanations
pattern = re.compile(
    r"""^                 # Match beginning of line
        (                 # Group 1 start
            [a-zA-Z0-9]+  # Match an alphanumerical string of arbitrary length
            \(            # Match an opening parenthesis
        )                 # Group 1 end
        ,                 # Match a single comma
        (                 # Group 2 start
            [^)]*         # Match a string of arbitrary length until reaching a closing parenthesis
            \)            # Match a closing parenthesis
        )                 # Group 2 end
        $                 # Match end of line""",
    re.X,
)

Can you see what the replacement string should be to get the desired result?

issa · November 10, 2022, 4:17pm

match single comma need to be replaced with “” between 1 and 2.

tried that but just left with

somefunction(

instead of somefunction(A,B,2,c)

abessman · November 10, 2022, 5:28pm

Not quite. If you remove the comma from the pattern it won’t match the comma, or anything after it, as you noticed.

Instead, note that the two groups contain everything except the first comma inside the parentheses. For the replacement string, you just need to put the two groups together:
re.sub(pattern, r"\1\2", string)

issa · November 10, 2022, 5:48pm

how did I not see that? Thanks. It works now.

though I won’t have thought that you could close group 1 the way you did so as to isolate the single comma. learned a lot, thanks so much.