Replacement argument in re.sub()

issa · November 10, 2022, 11:23am

Hello,

I have the following strings that I would like to match and replace. I am able to match the strings but struggling to carry out a replacement - the output either doesn’t change or it completely erases sections of the input string. i have included an example of one of many options I have tried.

scenario 1:
input string: (BXXADDB)(BXXXCAC1)(CXX2A)CANEVER
desired output string: (BXXADDB)(BXXXCAC1)(CXX2A)*CANEVER

pattern = r"^\([A-Za-z0-9]+\)\([A-Za-z0-9]+\)\([^)]*\)[a-zA-Z]+$"
replacement = r"^\([A-Za-z0-9]+\)\([A-Za-z0-9]+\)\([^)]*\)\*[a-zA-Z]+$"

df['column'] = [re.sub(pattern,replacement, str) for str in df['column']]

scenario 2

input string: somefunction(,A,B,2,C)
desired output string: somefunction(A,B,2,C)

pattern = r"^[A-Za-z0-9]+\([^)]*\)$"
replacement = r"^[A-Za-z0-9]+\([^)]*\)$"
df['column'] = [re.sub(pattern,replacement, str) for str in df['column']]

scenario 3

input string: (AUSM)ABCD
desired output string: (AUSM)*ABCD

pattern: r"^\([A-Za-z0-9]+\)[A-Za-z0-9]+$"
replacement : r"^\([A-Za-z0-9]+\)\*[A-Za-z0-9]+$"
df['column'] = [re.sub(pattern,replacement, str) for str in df['column']]

thanks

abessman · November 10, 2022, 12:08pm

To refer to a match in the replacement you need to use groups:

pattern = r"^(\([A-Za-z0-9]+\)\([A-Za-z0-9]+\)\([^)]*\))([a-zA-Z]+)$"
replacement = r"\1*\2"
input_string = "(BXXADDB)(BXXXCAC1)(CXX2A)CANEVER"
desired_output = "(BXXADDB)(BXXXCAC1)(CXX2A)*CANEVER"
assert re.sub(pattern, replacement, input_string) == desired_output

issa · November 10, 2022, 12:38pm

oh wow, it worked but with replacement as r"\1*"

got an invalid group reference 2 error for r"\1*\2"

i will try to do the other scenarios

thanks!

abessman · November 10, 2022, 2:01pm

Note that the pattern I posted is not identical to yours:

# This is all one group, hence why \2 doesn't work in the replacement.
your_pattern = r"^\([A-Za-z0-9]+\)\([A-Za-z0-9]+\)\([^)]*\)[a-zA-Z]+$"

#                (                   Group 1               )( Group 2 )
my_pattern =  r"^(\([A-Za-z0-9]+\)\([A-Za-z0-9]+\)\([^)]*\))([a-zA-Z]+)$"

A replacement string of \1* using your original pattern results in the following output:
(BXXADDB)(BXXXCAC1)(CXX2A)CANEVER*
which is not what you want.

issa · November 10, 2022, 2:11pm

Yes, I followed your pattern and still got an error.

i have done scenario 3 but struggling with the grouping for scenario 2

abessman · November 10, 2022, 2:33pm

If you copy what I wrote exactly you will get no error.

For scenario two, consider the following pattern:

pattern = r"^([a-zA-Z0-9]+\(),([^)]*\))$"

# Same pattern with explanations
pattern = re.compile(
    r"""^                 # Match beginning of line
        (                 # Group 1 start
            [a-zA-Z0-9]+  # Match an alphanumerical string of arbitrary length
            \(            # Match an opening parenthesis
        )                 # Group 1 end
        ,                 # Match a single comma
        (                 # Group 2 start
            [^)]*         # Match a string of arbitrary length until reaching a closing parenthesis
            \)            # Match a closing parenthesis
        )                 # Group 2 end
        $                 # Match end of line""",
    re.X,
)

Can you see what the replacement string should be to get the desired result?

issa · November 10, 2022, 4:17pm

match single comma need to be replaced with “” between 1 and 2.

tried that but just left with

somefunction(

instead of somefunction(A,B,2,c)

abessman · November 10, 2022, 5:28pm

Not quite. If you remove the comma from the pattern it won’t match the comma, or anything after it, as you noticed.

Instead, note that the two groups contain everything except the first comma inside the parentheses. For the replacement string, you just need to put the two groups together:
re.sub(pattern, r"\1\2", string)

issa · November 10, 2022, 5:48pm

how did I not see that? Thanks. It works now.

though I won’t have thought that you could close group 1 the way you did so as to isolate the single comma. learned a lot, thanks so much.

Topic		Replies	Views
Module re: add possibility to return matches during substitution Python Help	5	174	April 12, 2024
Substring replace using variables as pattern and replace string Python Help help	6	432	January 30, 2023
Regular expression substitution Python Help	1	335	July 17, 2021
Using {} regex within re.sub when using f-string Python Help	2	171	March 20, 2024
Python regex issues Python Help	2	644	June 5, 2022

Replacement argument in re.sub()

Related Topics