Here’s some test code which I believe implements your suggestions:
import pandas as pd
import sqlite3
conn = sqlite3.connect('/tmp/amg/allmusic-wip.db')
dbcursor = conn.cursor()
pd.set_option('display.max_rows', None)
def convert(s: str, delim: str = r"\\") -> str:
return delim.join(disambiguation_dict.get(x, x) for x in s.split(delim))
dbcursor.execute('''SELECT existing_contributor, replacement_contributor FROM disambiguation WHERE alib_updated = FALSE ;''')
disambiguation_records = dbcursor.fetchall()
disambiguation_count = len(disambiguation_records)
# if there are unprocessed records in disambiguation table
if disambiguation_count > 0:
# convert list of disambiguation records into a dict
disambiguation_dict = {key: value for key, value in disambiguation_records}
# load all artist, performer, albumartist and composer records in alib into a df then process the dataframe
df1 = pd.read_sql_query('SELECT rowid, artist, performer, albumartist, composer from alib where artist is not null order by __path', conn)
df2 = df1.copy()
df2.artist = df2.artist.apply(convert)
# show changed records
print(df1.compare(df2, keep_shape=True).dropna(how='all'))
# now write changes back to db
else:
print('No remaining name corrections to process')
The code produces the expected results, however, I have a few questions if you’ll indulge me:
In the dataset it’s possible any of the fields imported from the database table into df1 are NULL in the table, but there will never be a scenario where all fields in a record are NULL.
Notice for the purposes of testing I’ve restricted df1 to records where artist IS NOT NULL
. This is because convert()
fails if presented with a NaN, throwing the following:
Traceback (most recent call last):
File "/tmp/amg/testcode.py", line 30, in <module>
df2.artist = df2.artist.apply(convert)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/pandas/core/series.py", line 4771, in apply
return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/pandas/core/apply.py", line 1123, in apply
return self.apply_standard()
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/pandas/core/apply.py", line 1174, in apply_standard
mapped = lib.map_infer(
^^^^^^^^^^^^^^
File "pandas/_libs/lib.pyx", line 2924, in pandas._libs.lib.map_infer
File "/tmp/amg/testcode.py", line 11, in convert
return delim.join(disambiguation_dict.get(x, x) for x in s.split(delim))
^^^^^^^
AttributeError: 'NoneType' object has no attribute 'split'
I’m trying to figure out the best way to handle this:
-
leave convert()
as is, in which case I’d have to restrict the df to a single column only and separately import each field from the table into a df 'WHERE column IS NOT NULL`, process it and in turn write back to the database only those records that have changed.
This has the advantage of processing fewer records for each column and avoids having to evaluate every entry in the df with a call to pd.isna(). But it means I need to repeat this process for every column that may require updates.
-
Modify convert()
as shown below and continue as I am - meaning I only need to create the df once and process each column. On the downside it means evaluating more records and having to evaluate every entry in the df with a call to pd.isna()
def convert2(s: str, delim: str = r"\\") -> str:
if pd.isna(s):
return None
return delim.join(disambiguation_dict.get(x, x) for x in s.split(delim))
- Finally, if I’m not mistaken, 2. above means I can make a single call to df.apply(convert2) passing all columns to process?
df2[["artist", "performer", "albumartist", "composer"]] = df2[["artist", "performer", "albumartist", "composer"]].apply(convert2)
- Having now attempted
df2[["artist", "performer", "albumartist", "composer"]] = df2[["artist", "performer", "albumartist", "composer"]].apply(convert2)
it fails with:
Traceback (most recent call last):
File "/tmp/amg/testcode.py", line 41, in <module>
df2[["artist", "performer", "albumartist", "composer"]] = df2[["artist", "performer", "albumartist", "composer"]].apply(convert2)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/pandas/core/frame.py", line 9568, in apply
return op.apply().__finalize__(self, method="apply")
^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/pandas/core/apply.py", line 764, in apply
return self.apply_standard()
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/pandas/core/apply.py", line 891, in apply_standard
results, res_index = self.apply_series_generator()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/pandas/core/apply.py", line 907, in apply_series_generator
results[i] = self.f(v)
^^^^^^^^^
File "/tmp/amg/testcode.py", line 13, in convert2
if pd.isna(s):
File "/usr/lib/python3.11/site-packages/pandas/core/generic.py", line 1527, in __nonzero__
raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
But running
df2.artist = df2.artist.apply(convert2)
df2.performer = df2.performer.apply(convert2)
df2.albumartist = df2.albumartist.apply(convert2)
df2.composer = df2.composer.apply(convert2)
works as expected. I’ve no idea why it would fail when applying in a single call?