Function to sort a dataframe (based on asc date), find the top 30%, 40% and 60% of rows and flag them

rugbynutter · February 17, 2021, 11:27am

Dear Python Community,

I am relatively new to the application of functions and tend to stick to more basic UDF. So your help is greatly appreciated here (I want to use this as an opportunity to learn).

Part 1:
I need to write a function which first sorts a dataframe’s values based on date column in ascending order. The dataframe at present has just 7000 rows (as I have filtered to one month of data for this test, to make it easier).

I then need to loop over the data to find and then create a flag in the dataframe for the first 30% of rows (this will be the first 2100 rows in this case).

I would like to do this looping process to flag the rows in the 40%, 50%, 60%, 70% brackets, creating flags for each of them.

Part 2:
The problem then gets a bit more tricky in that the initial filter needs to be removed so that I can then apply to every month within the dataframe (I am less concerned about this right now, but general ideas would be welcomed).

Thanks in advance for any help

kind regards
James

rugbynutter · February 17, 2021, 2:26pm

def flag_dataframe(df):
df = df.reset_index() #to make sure the row index its still in the right order
df[‘flag’] = pd.Series(None) #create column flag
flags = [30,40,50,60,70,80,90,100] #the flag percentages
for i, row in df.iterrows(): #iterate through the dataframe, i is the index of the row, which is reset on the second line
for flag in flags:
if(i / len(df) * 100 <= flag): #check which flag is the right flag
df.loc[i, “flag”] = f"{flag}%" #setting the flag value of this row
break #break out of this loop so it wont override the flag value for another one
return df

tjol · February 19, 2021, 10:00am

Please use the code environment ("````") to make sure whitespace is preserved in your code examples!

Now to the actual problem:

One of the nice things about pandas (and numpy) is that you can set a lot of values at once without looping. (Of course there is still a loop, it’s just hidden in the C code where you can’t see it, which makes your code easier to read and a lot faster)

You could do something like this:

flag_percent = [30, 40, 50]
df["flag"] = ""
for flag in reversed(flag_percent):
    flag_index = len(df) * flag / 100.
    df["flag"].iloc[:flag_index] = f"{flag}%"

For the larger problem you might be able to use pandas’ groupby feature, but that can be quite tricky and I don’t really know if it makes sense here…