Excel: Comparing two columns and copying the contents matching to the cell of column O in Column A

hi

i have data headers named A to O, with data in rows A2 and A3, etc.,. A 2 has cell value 5, and a3 has data 3. similarly O2 has 3 and O3 has 5. when the prgm compares O2 with A column and matches the data of O2 in the particular row, it has to copy the entire row from A to N and paste in the new sheet. Similary for all the rows of column O data has to be compared with column A data. the data in column A and O are unsorted.
Attached here the file

i am not getting the desired output

Output data

my code as follows

import pandas as pd

# Load the Excel file
df = pd.read_excel('file_name')

# Create an empty DataFrame to store the selected rows
result_df = pd.DataFrame(columns=df.columns)

# Iterate through the rows
for index, row in df.iterrows():
    if not pd.isna(row['A']) and not pd.isna(row['O']):
        if row['O'] == row['A']:
            result_df = pd.concat([result_df, pd.DataFrame(row).transpose()])
print(row)

# Save the modified data to a new Excel sheet
result_df.to_excel('new_sheet.xlsx', index=False)

Thanks

Hi Sekar, I’m not sure I fully understand what your desired output is, so will first try to repeat the problem.
Please correct me if I’m wrong.

  • You start with a DataFrame with columns A, B, C… up to O.
  • Columns A and O have numerical values.
  • If the values in the same rows in column A and O are (both not nan) and equal you want to keep the row and copy the whole row to a new DataFrame.

First tip: When testing if a == b with values that can also can be NaN, you don’t have to test separately whether or not they are NaN, since nan == (any number) and nan == nan both evaluate to False (!)

Second tip: If you find yourself using

  • for loops
  • .iterrows
  • .itertuples
  • .apply

you are very likely overcomplicating your code, especially if the problem can be described in a just a few words (as here). Not just overcomplicating, but also writing code that is probably orders of magnitude (>=100x) slower than an alternative and simpler script would be using pandas vectorization.

So, to get a handle on this problem, I think you need to start simpler, with a similar but smaller DataFrame and explore the incredibly powerful, simple and elegant methods that pandas provides:

import pandas as pd
import numpy as np
nan = np.nan
df = pd.DataFrame(dict(A=[5, 3, 1, 2, nan], 
                       B=[1, 2, 3, 4, 5], 
                       O=[5, 3, 2, nan, 1]))

So, we have as input

     A  B    O
0  5.0  1  5.0      # want to keep this
1  3.0  2  3.0      # want to keep this
2  1.0  3  2.0
3  2.0  4  NaN
4  NaN  5  1.0

Ok, to compare two columns, you can simply do this:

>>> df.A == df.O
0     True
1     True
2    False
3    False
4    False
dtype: bool

This is a Series of booleans… and this Series can be used directly as row selector…

>>> df_new = df[df.A == df.O]
>>> df_new
     A  B    O
0  5.0  1  5.0
1  3.0  2  3.0

And that’s it, right?! Assuming I understood the problem correctly.

This way of indexing is the same as df.loc, so df[df.A == df.O] means the same as df.loc[df.A == df.O]. See: Indexing and selecting data — pandas 2.1.1 documentation

As to performance: One presentation about pandas vectorization that was an eye opener to me a few years ago, was Sofia Heisler’s PyCon 2017 talk
No more Sad Pandas: Optimizing Pandas Code for Speed and Efficiency

Pandas truly showcases the Zen of Python:

Beautiful is better than ugly.
Simple is better than complex.

And in this case, the simplest code is also the fastest.

[PS - Your code is actually correct too. It behaves the same as the simpler code I suggested. So, the output you showed can not have been generated by it. Are you sure you saved the correct files and did not use other files?]

@hansgeunsmeyer
Thanks for explain in depth regards to pandas usage and showing direction. Hoping to learn a lot from you as I am just a beginer.

regarding this

Sorry.

  1. i made a mistake in the input sheet. The column A will be in serial format, i.e 1,2,3,4. 5 to n numbers, where n is dynamic. n can be 500,550,400,600 ,etc. I just used the sample data.
  2. The column O will be of unsorted list.
    Say let assume, if A column has numbers, starting from A2 to A201 having numbers from 1 to 200 Continuously. The O2 cell to O201, will have the same numbers jumbled up. ( i.e not in the sorted order or serially). All are numeric numbers on both the columns A & O.
  3. I am trying to achieve the following.
    If O2 cell is having let say 75, the 75 has to be searched in the A column. The cell location of 75 will be A76 (as we have a header at the first row). So the A76 Row values from the column, A76 to N76 has to be copied and placed in the new sheet in the first row.
    if O3 having 50, the location of the 50 in the A will be A51 row. the values with serial number of 50 from A 51 to N 51 has to be copied and placed in the second row.
    Like wise each row has to be appened on below another.

Can you check the google sheet once,i have updated the excel Sheet as Input Sheet and out put sheet.
Thanks.

Ok - that’s a different scenario. So, we have

  • columns A and O, both with arbitrary number values, sorted or unsorted
  • if for row n and m, df.O[n] == df.A[m] then we want to keep row m

So, the problem is for every value in df.O, to find which cell in df.A matches (if there is at least one). In other words, you want to find all rows where any df.O value is in df.A?

import pandas as pd
import numpy as np
nan = np.nan
df = pd.DataFrame(dict(A=[2, 2, 3, 4, 5], B=list("abcde"), O=[5,4,2,-1,nan]))
>>> df
   A  B    O
0  2  a  5.0
1  2  b  4.0
2  3  c  2.0
3  4  d -1.0
4  5  e  NaN

So, you can almost directly translate the natural language formulation of the problem into code (!):

>>> df.where(df.A.isin(df.O))
     A    B    O
0  2.0    a  5.0
1  2.0    b  4.0
2  NaN  NaN  NaN
3  4.0    d -1.0
4  5.0    e  NaN
#
# we want to get rid of the nans if df.A has a nan
#
>>> df1 = df.where(df.A.isin(df.O))
>>> df1 = df1.dropna(subset=['A'])  # or df1 = df1[~df.A.isna()]
>>> df1
     A  B    O
0  2.0  a  5.0
1  2.0  b  4.0
3  4.0  d -1.0
4  5.0  e  NaN

# or simply: df1 = d1.dropna() if you want to exclude nan in df.O

One slightly annoying thing in this is that the type of column A was changed to float64 (because of the nans), so if you’re a purist you could change that again, after removing the nans from that column.

Ok, and then you also want to re-order the rows according to the order in O? That’s an extra step… Your output does seem to be correct (only took a very quick look), so it seems you already found a way to do so.

One way to add the ordering (with implicit assumption that values in df.O are unique!)

order = dict((v, i) for (i, v) in df1.O.items())
df1 = df1.sort_values("A", key=lambda series: series.map(order))

Hi,

Column O (Unsorted list) will have same numbers as in Column A,(sorted list), and will not have NaN in ‘A’ & ‘O’, All are numeric values only. Hence there is no need of checking NaN

No.there will be only one occurence of value in the column A ( dictionary will be unique, no multiple re-occurence not 2,2,etc…), when the column O Value checked.

There wont be any NaN

yes. This was looking for re-order the rows according to the order in O

Yes the values are unique.

I tried using the above and

import pandas as pd

# Load data from Excel
df = pd.read_excel('J:\\python\\test\\test.xlsx')  # Replace with your actual file path

# Define the order based on column 'O'
order = dict((v, i) for (i, v) in df['O'].items())

# Sort the DataFrame based on the order of 'O' values
df_sorted = df.sort_values("A", key=lambda series: series.map(order))

# Save the sorted data to a new Excel sheet
df_sorted.to_excel('J:\\python\\test\\sorted_output.xlsx', index=False)  # Replace with desired output file name

print("Data sorted and saved successfully!")

This code not works now. wht is the mistake over here
One more clarification, if have to point the starting data of the table column A from cell A7, column O from cell O7, how to incorporate in the code

That code re-orders the rows of the input dataframe so that in the end column A will look identical to the original column O, assuming O and A are permutations of each other. What makes you say that it doesn’t work? What is unexpected in the output?
(So, the code does assume that values in O are unique. And by itself it will also not exclude rows when a value of O is not found in A.)

>>> df
   A    B     C    D     E     F       G    H    I    J     K    L     M     N  O
0  1    a     b    c     d     e       f    g    h    g     h    g     h     g  4
1  2    1     2    3     4     5       6    7    8    9    10   11    12    13  2
2  3   sd  asdf  sfd  fder  dter  fdgsre  rte  hrt  tyr  dtyr  hdf  ftyr  ftyr  5
3  4  101   105  109   113   117     121  125  129  133   137  141   145   149  3
4  5  145   165  185   205   225     245  265  285  305   325  345   365   385  1
>>> order = dict((v, i) for (i, v) in df['O'].items())
>>> order
{4: 0, 2: 1, 5: 2, 3: 3, 1: 4}
>>> df_sorted = df.sort_values("A", key=lambda series: series.map(order))
>>> df_sorted
   A    B     C    D     E     F       G    H    I    J     K    L     M     N  O
3  4  101   105  109   113   117     121  125  129  133   137  141   145   149  3
1  2    1     2    3     4     5       6    7    8    9    10   11    12    13  2
4  5  145   165  185   205   225     245  265  285  305   325  345   365   385  1
2  3   sd  asdf  sfd  fder  dter  fdgsre  rte  hrt  tyr  dtyr  hdf  ftyr  ftyr  5
0  1    a     b    c     d     e       f    g    h    g     h    g     h     g  4

It not works if the data is read from the excel. it throws error as below

Traceback (most recent call last):
  File "C:\Users\HOME\Desktop\terr.py", line 11, in <module>
    df_sorted = df.sort_values("A", key=lambda series: series.map(order))
  File "C:\Users\HOME\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\util\_decorat
ors.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\HOME\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\frame.py
", line 6324, in sort_values
    indexer = nargsort(
  File "C:\Users\HOME\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\sorting.
py", line 387, in nargsort
    items = ensure_key_mapped(items, key)
  File "C:\Users\HOME\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\sorting.
py", line 545, in ensure_key_mapped
    result = key(values.copy())
  File "C:\Users\HOME\Desktop\terr.py", line 11, in <lambda>
    df_sorted = df.sort_values("A", key=lambda series: series.map(order))
  File "C:\Users\HOME\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\series.p
y", line 4237, in map
    new_values = self._map_values(arg, na_action=na_action)
  File "C:\Users\HOME\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\base.py"
, line 852, in _map_values
    indexer = mapper.index.get_indexer(values)
  File "C:\Users\HOME\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexes\
base.py", line 3721, in get_indexer
    raise InvalidIndexError(self._requires_unique_msg)
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects

@hansgeunsmeyer
Now Works after pasting the data to the new workbook.
Thanks. But dont know, what caused the error.

I don’t completely understand the error here (would have to look at your script and data). It suggest that the internal index of the mapper somehow had duplicate values. So perhaps the index of column A (not column A values, but its indices) somehow was reset and had duplicates? The error would disappear/not show up in a new dataframe, since the indices should then be simple range indices (basically numbering all rows from 0 to … N). But, I guess, you’re all set now :slight_smile:

1 Like