Switch values of field in dataframe if found

  • I have a dataframe with 3 columns
  • For each row, I want to search the TicketStatus column for an target number within the String. If found, iiterate through all the other rows in the dataframe, searching for this target number as an INT in all other row’s TicketID column.
  • If/when we find a match, we will switch the Odds fields for that matching pair.
Example input:
TicketID, Odds, TicketStatus
123456, 4.0, 'Yes 654799'
654799, 11.0, 'Hello'

Example output:
TicketID, Odds, TicketStatus
123456, 11.0, 'Yes 654799'
654799, 4.0, 'Hello'
df['Odds'] = df.apply(lambda x: df.loc[df['TicketStatus'].str.contains(x['TicketID']), 'Odds'].values[0] 
                      if df['TicketStatus'].str.contains(x['TicketID']).any() 
                          else x['Odds'], axis=1)

I’m returning a TypeError: first argument must be string or compiled pattern, but can’t figure out what I’m doing wrong.

1 Like

Solved. TicketID is type INT and needs to be an Object/String

@vovavili could you please provide example code? Mine ran overnight for 10 hours and still hasn’t finished. There are 20 million rows in the dataframe. So would have to loop 400 million times minimum. Also I tested on a small sample and I have an error in the code as it hasn’t switched both odds columns, it just replaces the first with the second. (doesn’t replace the second with the first)

I think I slightly underestimated what you meant to accomplish at first. In fact, that’s actually quite a head-scratcher. My immediate thought process is the following:

  1. Extract digits from TicketStatus column

  2. Merge two columns together in a first dictionary. (Possibly, a structured NumPy array might be faster, not sure).

  3. Combine first needed columns, drop duplicates, keeping both duplicates in separate dataframes for a clever dictionary trick

  4. Create the second hash map

  5. Conditionally replace values

Here is my untested mockup:

# Create needed dictionary for a first pair of recoding

df["Ticket_Only_Digits"] = df["TicketStatus"].str.extract('(\d+)').apply(pd.to_numeric, errors='coerce')
df_temp = df.copy()
df_temp = df_temp.dropna(subset="Ticket_Only_Digits")
hash_map_1 = dict(zip(df_temp["Ticket_Only_Digits"], df_temp["Odds"]))

# The deduplication trick

df["Ticket_Only_Digits"] = df["Ticket_Only_Digits"].combine_first(df["TicketID"])
keys = df.loc[df.duplicated(subset="Ticket_Only_Digits", keep="first")].copy()
values = df.loc[df.duplicated(subset="Ticket_Only_Digits", keep="last")].copy()
hash_map_2 = dict(zip(keys["TicketID"], values["Odds"]))
df = df.drop("Ticket_Only_Digits", axis="columns")

# Map values
df["Odds"] = df["Odds"].mask(df["TicketID"].isin({*hash_map_1}), df["TicketID"].replace(hash_map_1))
df["Odds"] = df["Odds"].mask(df["TicketID"].isin({*hash_map_2}), df["TicketID"].replace(hash_map_2))

Please let me know if this works for you.

1 Like

@vovavili Get a ValueError: cannot convert float NaN to integer, even though no N/A values in the TicketStatus column? There are TicketStatus that have no Int though within the String. (Like in the original example ‘Hello’)

have uploaded a little sample of actual data here if any use: https://file.io/Qdp27U780XBL

First got AttributeError: ‘DataFrame’ object has no attribute ‘drop_na’

Removed the drop n/a part of line 3 so it read like this:

df_temp = df_temp(subset="Ticket_Only_Digits")

Now get Dataframe Object is not callable

Posting the datatypes from df.dtypes if any use:

TicketID int64
Odds float64
TicketStatus object
dtype: object

If I remove that line completely (which I now realise I should have :slight_smile: ) I get the following error

TypeError: Input must be Index or array-like

I’ve fixed syntax errors now that I have a data sample to work with instead of going from memory.

edit until can check

Yes, I’ve slightly fixed the logic by swapping keep_duplicates with df[df.duplicated]. Does it work now?

Yes I believe it works!

Oh dear though. I just realised now that there was a fatal flaw in my sample code

Deleted - new sample code

It turns out that we are looking for the String in the TicketID, instead of the other way around :face_with_diagonal_mouth: :face_with_diagonal_mouth:

The String will always appear in a row before the target TicketID.

image

When I run the code, it will replace Row 4’s odds with Row 2, but won’t Replace Row 2’s Odds with Row 4.

I’m sorry for wasting time and appreciate the help so far

1 Like