Switch values of field in dataframe if found

drewbty · December 4, 2022, 11:05am

I have a dataframe with 3 columns
For each row, I want to search the TicketStatus column for an target number within the String. If found, iiterate through all the other rows in the dataframe, searching for this target number as an INT in all other row’s TicketID column.
If/when we find a match, we will switch the Odds fields for that matching pair.

Example input:
TicketID, Odds, TicketStatus
123456, 4.0, 'Yes 654799'
654799, 11.0, 'Hello'

Example output:
TicketID, Odds, TicketStatus
123456, 11.0, 'Yes 654799'
654799, 4.0, 'Hello'

df['Odds'] = df.apply(lambda x: df.loc[df['TicketStatus'].str.contains(x['TicketID']), 'Odds'].values[0] 
                      if df['TicketStatus'].str.contains(x['TicketID']).any() 
                          else x['Odds'], axis=1)

I’m returning a TypeError: first argument must be string or compiled pattern, but can’t figure out what I’m doing wrong.

drewbty · December 4, 2022, 1:38pm

Solved. TicketID is type INT and needs to be an Object/String

drewbty · December 4, 2022, 11:23pm

@vovavili could you please provide example code? Mine ran overnight for 10 hours and still hasn’t finished. There are 20 million rows in the dataframe. So would have to loop 400 million times minimum. Also I tested on a small sample and I have an error in the code as it hasn’t switched both odds columns, it just replaces the first with the second. (doesn’t replace the second with the first)

vovavili · December 5, 2022, 12:22am

I think I slightly underestimated what you meant to accomplish at first. In fact, that’s actually quite a head-scratcher. My immediate thought process is the following:

Extract digits from TicketStatus column
Merge two columns together in a first dictionary. (Possibly, a structured NumPy array might be faster, not sure).
Combine first needed columns, drop duplicates, keeping both duplicates in separate dataframes for a clever dictionary trick
Create the second hash map
Conditionally replace values

Here is my untested mockup:

# Create needed dictionary for a first pair of recoding

df["Ticket_Only_Digits"] = df["TicketStatus"].str.extract('(\d+)').apply(pd.to_numeric, errors='coerce')
df_temp = df.copy()
df_temp = df_temp.dropna(subset="Ticket_Only_Digits")
hash_map_1 = dict(zip(df_temp["Ticket_Only_Digits"], df_temp["Odds"]))

# The deduplication trick

df["Ticket_Only_Digits"] = df["Ticket_Only_Digits"].combine_first(df["TicketID"])
keys = df.loc[df.duplicated(subset="Ticket_Only_Digits", keep="first")].copy()
values = df.loc[df.duplicated(subset="Ticket_Only_Digits", keep="last")].copy()
hash_map_2 = dict(zip(keys["TicketID"], values["Odds"]))
df = df.drop("Ticket_Only_Digits", axis="columns")

# Map values
df["Odds"] = df["Odds"].mask(df["TicketID"].isin({*hash_map_1}), df["TicketID"].replace(hash_map_1))
df["Odds"] = df["Odds"].mask(df["TicketID"].isin({*hash_map_2}), df["TicketID"].replace(hash_map_2))

Please let me know if this works for you.

drewbty · December 5, 2022, 1:08am

@vovavili Get a ValueError: cannot convert float NaN to integer, even though no N/A values in the TicketStatus column? There are TicketStatus that have no Int though within the String. (Like in the original example ‘Hello’)

have uploaded a little sample of actual data here if any use: https://file.io/Qdp27U780XBL

drewbty · December 5, 2022, 1:49am

First got AttributeError: ‘DataFrame’ object has no attribute ‘drop_na’

Removed the drop n/a part of line 3 so it read like this:

df_temp = df_temp(subset="Ticket_Only_Digits")

Now get Dataframe Object is not callable

Posting the datatypes from df.dtypes if any use:

TicketID int64
Odds float64
TicketStatus object
dtype: object

drewbty · December 5, 2022, 1:53am

If I remove that line completely (which I now realise I should have ) I get the following error

TypeError: Input must be Index or array-like

vovavili · December 5, 2022, 1:55am

I’ve fixed syntax errors now that I have a data sample to work with instead of going from memory.

drewbty · December 5, 2022, 2:18am

edit until can check

vovavili · December 5, 2022, 2:20am

Yes, I’ve slightly fixed the logic by swapping keep_duplicates with df[df.duplicated]. Does it work now?

drewbty · December 5, 2022, 2:30am

Yes I believe it works!

Oh dear though. I just realised now that there was a fatal flaw in my sample code

Deleted - new sample code

It turns out that we are looking for the String in the TicketID, instead of the other way around

The String will always appear in a row before the target TicketID.

When I run the code, it will replace Row 4’s odds with Row 2, but won’t Replace Row 2’s Odds with Row 4.

I’m sorry for wasting time and appreciate the help so far

Topic		Replies	Views
Replace the value row into the previous row Python Help help	3	952	March 17, 2023
How to get value in Dataframe given row & column values? Python Help	5	1679	May 2, 2021
Fault to access column in dataframe Python Help help	2	418	February 7, 2022
New Column If condition Python Help	7	819	June 2, 2020
I need to change if a number in column is -after numbers Python Help	8	1052	September 1, 2020

Switch values of field in dataframe if found

Related Topics