Selective replacement of matching text in a string?

I’ve having to deal with 1000’s of rows of strings that that can take any of the following form:

  • a single word
  • a string of words
  • a string of words separated by a delimiter (in this case \\)

I need to be able to replace all instances of matching text with another value, being sure not to inadvertently replace instances of the matching text where it forms part of another delimited string.

As an example:
The problem I need to solve is to replace only the standalone Joe in each record to Jose without inadvertently affecting ‘Joe Soap’, ‘Joe Bloggs’ or ‘Joe Little’.

So this:

  • Joe
  • Joe Soap\\Joe\\Joe Bloggs\\Joe
  • Joe Soap\\Joe Bloggs\\Joe
  • Joe\\Joe Soap\\Joe\\Joe Little

Becomes this:

  • Joel
  • Joe Soap\\Joel\\Joe Bloggs\\Joel
  • Joe Soap\\Joe Bloggs\\Joel
  • Joel\\Joe Soap\\Joel\\Joe Little

What is the most efficient way of achieving this in Python?

What have you tried?

Maybe something like: iterate over rows in file, split row on \\ (gives list), if item in list is equal to “Joe” replace it with “Joel” otherwise do nothing, join result back together to string using \\ as separator, write result somewhere.

It looks like the strings that you should replace form a regular language. You could do it using re.sub

Perhaps the definition of “forms part of another delimited string” should be more precise. Should the occurrence in “Joe \\AAA” be replaced? Note the white space after “Joe”.


Alternatively, you could split each line on the separators line.split("\\").
Then for each piece, check if it is exactly the pattern to replace to replace or not, and join the string back up.

pattern = "Joe"
replacement = "Joel"
separator = "\\"
result_line = separator.join(
    (replacement if token == pattern else token) for token in line.split(separator)
)

The most efficient way is to use str.split and a dictionary – as @aivarpaalberg and @franklinvp also suggested. Define the dictionary up front for all the string replacements you want to do:

mm = {"Joe": "Joel", "Pete": "Peter"}  # etc

Define a little conversion function:

def convert(s: str, delim: str = r"\\") -> str:
     return delim.join(mm.get(x, x) for x in s.split(delim))

If you only need to replace a few unique strings, you could also use a regex, but that would generally be about 2x as slow, and the code would be far more complex, less easy to maintain and change. So, especially if you have a lot of names you want to change, the above method is probably the simplest, clearest and the fastest.

If your strings are in a Pandas DataFrame, df, say in column ‘A’, you can simply assign df.A = df.A.apply(convert). This will generally also be faster than using df.A.str.replace with a regex substitution. If you didn’t need to use regexes, then df.A.str.replace should be faster, but that’s not the case here.

Thanks all for your replies. The current and replacement values are actually held in a SQLite table that serves as a reference of search and replace values. It has 1000’s of records that look like what I described in my first post. I could import that into a dictionary or a dataframe and use the convert function if it’ll scale sufficiently.

The end solution needs to be able to update a second table that contains around 800k records, and the update needs to be applied to four columns in the table that all hold similar/related data, so potentially 3.2m comparisons and updates.

If I loaded the 2nd table into a dataframe would there be an efficient means of doing the search / replace in a single operation or would I need to do it separately for each column in the df?

The answer would be no, only exact matches should be replaced. If 'Joe ’ was intended to be replaced it would appear as such in the table I referenced in the first paragraph of this reply.

@hansgeunsmeyer given what you’ve described above and my post that followed am I correct in inferring the fastest way to process the lot I refer to would be to create a dataframe from the database, and apply df.columnx = df.columnx.apply(convert) ?

I ultimately need to write the results back to the underlying table, which I guess I can do using a where clause to ensure I don’t write back data unnecessarily? It feels like a very “expensive” update, but I’m struggling to think of a more efficient way.

What I originally came up with works but was so inefficient as to be laughable and not something anyone with a modicum of self respect would share. :woozy_face:

800k rows is a very small amount (unless you have a very wide table with tons of columns or large amounts of data in single columns), even when using df.apply or df.A.apply(convert) (df.B.apply etc on those 4 columns) – which basically just does a dumb, non-vectorized iteration over all rows – this should run in less than 2 seconds or so on any current laptop. It will not really matter even if you process the columns separately or not, or transfer from one DataFrame into another.

If the conversion function itself is suboptimal (has quite a bit of complex logic, lots of if else/and or clauses or iterates each time over all 1000 possible replacements), then this would not work very well, but with the earlier suggested conversions this should work fine.

Thanks, I presume you meant 4 columns rather than rows?

Yes - corrected it :slight_smile:

1 Like

Here’s some test code which I believe implements your suggestions:

import pandas as pd
import sqlite3

conn = sqlite3.connect('/tmp/amg/allmusic-wip.db')
dbcursor = conn.cursor()

pd.set_option('display.max_rows', None)

def convert(s: str, delim: str = r"\\") -> str:
     return delim.join(disambiguation_dict.get(x, x) for x in s.split(delim))

dbcursor.execute('''SELECT existing_contributor, replacement_contributor FROM disambiguation WHERE alib_updated = FALSE ;''')

disambiguation_records = dbcursor.fetchall()
disambiguation_count = len(disambiguation_records)

# if there are unprocessed records in disambiguation table
if disambiguation_count > 0:

    # convert list of disambiguation records into a dict
    disambiguation_dict = {key: value for key, value in disambiguation_records}

    # load all artist, performer, albumartist and composer records in alib into a df then process the dataframe
    
    df1 = pd.read_sql_query('SELECT rowid, artist, performer, albumartist, composer from alib where artist is not null order by __path', conn)
    df2 = df1.copy()


    df2.artist = df2.artist.apply(convert)

    # show changed records
    print(df1.compare(df2, keep_shape=True).dropna(how='all'))
    
    # now write changes back to db

else:
    print('No remaining name corrections to process')

The code produces the expected results, however, I have a few questions if you’ll indulge me:

In the dataset it’s possible any of the fields imported from the database table into df1 are NULL in the table, but there will never be a scenario where all fields in a record are NULL.

Notice for the purposes of testing I’ve restricted df1 to records where artist IS NOT NULL. This is because convert() fails if presented with a NaN, throwing the following:

Traceback (most recent call last):
  File "/tmp/amg/testcode.py", line 30, in <module>
    df2.artist = df2.artist.apply(convert)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/pandas/core/series.py", line 4771, in apply
    return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/pandas/core/apply.py", line 1123, in apply
    return self.apply_standard()
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/pandas/core/apply.py", line 1174, in apply_standard
    mapped = lib.map_infer(
             ^^^^^^^^^^^^^^
  File "pandas/_libs/lib.pyx", line 2924, in pandas._libs.lib.map_infer
  File "/tmp/amg/testcode.py", line 11, in convert
    return delim.join(disambiguation_dict.get(x, x) for x in s.split(delim))
                                                             ^^^^^^^
AttributeError: 'NoneType' object has no attribute 'split'

I’m trying to figure out the best way to handle this:

  1. leave convert() as is, in which case I’d have to restrict the df to a single column only and separately import each field from the table into a df 'WHERE column IS NOT NULL`, process it and in turn write back to the database only those records that have changed.
    This has the advantage of processing fewer records for each column and avoids having to evaluate every entry in the df with a call to pd.isna(). But it means I need to repeat this process for every column that may require updates.

  2. Modify convert() as shown below and continue as I am - meaning I only need to create the df once and process each column. On the downside it means evaluating more records and having to evaluate every entry in the df with a call to pd.isna()

def convert2(s: str, delim: str = r"\\") -> str:
    if pd.isna(s):
        return None

    return delim.join(disambiguation_dict.get(x, x) for x in s.split(delim))
  1. Finally, if I’m not mistaken, 2. above means I can make a single call to df.apply(convert2) passing all columns to process?
df2[["artist", "performer", "albumartist", "composer"]] = df2[["artist", "performer", "albumartist", "composer"]].apply(convert2)
  1. Having now attempted df2[["artist", "performer", "albumartist", "composer"]] = df2[["artist", "performer", "albumartist", "composer"]].apply(convert2) it fails with:
Traceback (most recent call last):
  File "/tmp/amg/testcode.py", line 41, in <module>
    df2[["artist", "performer", "albumartist", "composer"]] = df2[["artist", "performer", "albumartist", "composer"]].apply(convert2)
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/pandas/core/frame.py", line 9568, in apply
    return op.apply().__finalize__(self, method="apply")
           ^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/pandas/core/apply.py", line 764, in apply
    return self.apply_standard()
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/pandas/core/apply.py", line 891, in apply_standard
    results, res_index = self.apply_series_generator()
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/pandas/core/apply.py", line 907, in apply_series_generator
    results[i] = self.f(v)
                 ^^^^^^^^^
  File "/tmp/amg/testcode.py", line 13, in convert2
    if pd.isna(s):
  File "/usr/lib/python3.11/site-packages/pandas/core/generic.py", line 1527, in __nonzero__
    raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

But running

df2.artist = df2.artist.apply(convert2)
df2.performer = df2.performer.apply(convert2)
df2.albumartist = df2.albumartist.apply(convert2)
df2.composer = df2.composer.apply(convert2)

works as expected. I’ve no idea why it would fail when applying in a single call?

You’re selecting rows and applying the function convert2 to each row.

That function says pd.isna(s), where s is the row, so the result is a series of true/false values.

But what are you asking it in the if condition? Are you asking whether any of those values are true or whether all of those values are true? That’s the ambiguity.

1 Like

Thanks, based on that I’m guessing that I cannot process all the columns in a single call unless because every column in the row needs to be evaluated in isolation.

You could just modify convert2 to work with a row, i.e. multiple values, instead of only one value.

Easier said than done. If this thread has taught me anything, it’s that I’ve got a veeeery long way to go to become proficient with Python.

Most of my learning has come from having to get something done. I feel like I need to step away and go back to basics of the language.

Try this:

def convert_row(row, delim: str = r"\\"):
    result = []

    for item in row:
        if not pd.isna(item):
            item = delim.join(disambiguation_dict.get(x, x) for x in item.split(delim))

        result.append(item)

    return result

Thanks everyone for your input.

Here’s the solution I’ve ended up with, hopefully it’ll help others. This was my first foray into Pandas in any meaningful context and I’ve clearly got a hell of a lot to learn on all fronts.

import pandas as pd
import sqlite3

conn = sqlite3.connect('/tmp/amg/allmusic-wip.db')
dbcursor = conn.cursor()

pd.set_option('display.max_rows', None)

def convert(s: str, delim: str = r"\\") -> str:
     return delim.join(disambiguation_dict.get(x, x) for x in s.split(delim))

def convert2(s: str, delim: str = r"\\") -> str:
    if pd.isna(s):
        return None

    return delim.join(disambiguation_dict.get(x, x) for x in s.split(delim))


def convert_dfrow(row, delim: str = r"\\"):
    result = []

    for item in row:
        if not pd.isna(item):
            item = delim.join(disambiguation_dict.get(x, x) for x in item.split(delim))

        result.append(item)

    return result    

def compare_large_dataframes(df1, df2):
    # Assuming df1 and df2 are your DataFrames
    # differing_records will contain only the rows in df2 that differ from df1
    # Example usage:
    # differing_records = compare_dataframes(df1, df2)

    '''This method uses the merge function with the indicator parameter set to True. It then filters the resulting df to keep only the rows present in df2 but not in df1.'''

    # Check if the DataFrames have the same shape
    if df1.shape != df2.shape:
        raise ValueError("DataFrames must have the same number of rows and columns")

    # Merge the DataFrames and keep only the rows that are different
    merged_df = pd.merge(df1, df2, how='outer', indicator=True).query('_merge == "right_only"').drop('_merge', axis=1)

    return merged_df





dbcursor.execute('''SELECT existing_contributor, replacement_contributor FROM disambiguation WHERE alib_updated = FALSE ;''')

disambiguation_records = dbcursor.fetchall()
disambiguation_count = len(disambiguation_records)

# if there are unprocessed records in disambiguation table
if disambiguation_count > 0:

    # convert list of disambiguation records into a dict
    disambiguation_dict = {key: value for key, value in disambiguation_records}

    # load all artist, performer, albumartist and composer records in alib into a df then process the dataframe
    df1 = pd.read_sql_query('SELECT rowid AS source_rowid, artist, performer, albumartist, composer from alib order by __path', conn)
    # make a copy against which to apply changes which we'll subsequently compare with df1 to isolate changes
    df2 = df1.copy() # by default this is a deep copy so change to df2 don't impact df1

    # transform the columns of interest
    df2[['artist', 'performer', 'albumartist', 'composer']] = df2[['artist', 'performer', 'albumartist', 'composer']].apply(convert_dfrow)
    df3 = compare_large_dataframes(df1, df2)

    print(df2.info(verbose=True))
    print(df3.info(verbose=True))  
    input("Press enter to continue")

    # now write changes back to db
    df3.to_sql('disambiguation_updates', conn, if_exists='replace', index=False)
    conn.commit()


else:
    print('No remaining name corrections to process')

dbcursor.close()
conn.close()