Help comparing df1 and df2 and creating df3 containing only the changed rows in df2

hobbycoder · October 27, 2024, 10:06pm

I’m generating a dataframe df1, making a copy df2, modifying values in df2, then comparing them and attempting to return the differences as df3

The function that compares df1 and df2, returning the rows that differ between df1 and df2 as df3 is as follows:

def compare_large_dataframes(df1, df2):

    if df1.shape != df2.shape:
        raise ValueError("DataFrames must have the same number of rows and columns")

    merged_df = pd.merge(df1, df2, how='outer', indicator=True).query('_merge == "right_only"').drop('_merge', axis=1)

return merged_df

When I run it I’m getting the following error:

df3 = compare_large_dataframes(df1, df2)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/ztm.py", line 4586, in compare_large_dataframes
    merged_df = pd.merge(df1, df2, how='outer', indicator=True).query('_merge == "right_only"').drop('_merge', axis=1)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 184, in merge
    return op.get_result(copy=copy)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 886, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
                                              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 1151, in _get_join_info
    (left_indexer, right_indexer) = self._get_join_indexers()
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 1125, in _get_join_indexers
    return get_join_indexers(
           ^^^^^^^^^^^^^^^^^^
  File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 1740, in get_join_indexers
    zipped = zip(*mapped)
             ^^^^^^^^^^^^
  File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 1737, in <genexpr>
    _factorize_keys(left_keys[n], right_keys[n], sort=sort)
  File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 2570, in _factorize_keys
    llab, rlab = _sort_labels(uniques, llab, rlab)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 2631, in _sort_labels
    _, new_labels = algos.safe_sort(uniques, labels, use_na_sentinel=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/mypy/lib/python3.12/site-packages/pandas/core/algorithms.py", line 1543, in safe_sort
    raise ValueError("values should be unique if codes is not None")
ValueError: values should be unique if codes is not None

df1 is created as follows:

df1 = pd.read_sql_query('SELECT rowid AS alib_rowid, col1, col2, col3, col4, col5, col6, col7 from table order by __path', conn)

col1 through col7 are all text fields from a SQLite table.

df2 is structurally identical to df1.

Any ideas what may be triggering this error?

onePythonUser · October 29, 2024, 10:17pm

Hi,

just an fyi, the return merged_df line should be indented. The way as shown, has no indentation and not part of the function body which appears to be what you intended.