Hi Sekar, I’m not sure I fully understand what your desired output is, so will first try to repeat the problem.
Please correct me if I’m wrong.
- You start with a DataFrame with columns A, B, C… up to O.
- Columns A and O have numerical values.
- If the values in the same rows in column A and O are (both not nan) and equal you want to keep the row and copy the whole row to a new DataFrame.
First tip: When testing if a == b with values that can also can be NaN, you don’t have to test separately whether or not they are NaN, since nan == (any number)
and nan == nan
both evaluate to False (!)
Second tip: If you find yourself using
- for loops
.iterrows
.itertuples
.apply
you are very likely overcomplicating your code, especially if the problem can be described in a just a few words (as here). Not just overcomplicating, but also writing code that is probably orders of magnitude (>=100x) slower than an alternative and simpler script would be using pandas vectorization.
So, to get a handle on this problem, I think you need to start simpler, with a similar but smaller DataFrame and explore the incredibly powerful, simple and elegant methods that pandas provides:
import pandas as pd
import numpy as np
nan = np.nan
df = pd.DataFrame(dict(A=[5, 3, 1, 2, nan],
B=[1, 2, 3, 4, 5],
O=[5, 3, 2, nan, 1]))
So, we have as input
A B O
0 5.0 1 5.0 # want to keep this
1 3.0 2 3.0 # want to keep this
2 1.0 3 2.0
3 2.0 4 NaN
4 NaN 5 1.0
Ok, to compare two columns, you can simply do this:
>>> df.A == df.O
0 True
1 True
2 False
3 False
4 False
dtype: bool
This is a Series of booleans… and this Series can be used directly as row selector…
>>> df_new = df[df.A == df.O]
>>> df_new
A B O
0 5.0 1 5.0
1 3.0 2 3.0
And that’s it, right?! Assuming I understood the problem correctly.
This way of indexing is the same as df.loc
, so df[df.A == df.O]
means the same as df.loc[df.A == df.O]
. See: Indexing and selecting data — pandas 2.1.1 documentation
As to performance: One presentation about pandas vectorization that was an eye opener to me a few years ago, was Sofia Heisler’s PyCon 2017 talk
No more Sad Pandas: Optimizing Pandas Code for Speed and Efficiency
Pandas truly showcases the Zen of Python:
Beautiful is better than ugly.
Simple is better than complex.
And in this case, the simplest code is also the fastest.
[PS - Your code is actually correct too. It behaves the same as the simpler code I suggested. So, the output you showed can not have been generated by it. Are you sure you saved the correct files and did not use other files?]