Ranking and Filtering by Rank

jmpathy999 · May 22, 2023, 3:15pm

Hi,

Is there a convenient way to sort and rank a dataframe on one column to retrieve only the data that is in EITHER the TOP 10 OR the BOTTOM 10 of values by that column? I think I could do it with multiple passes but wondering if there is a more efficient and/or elegant solution. I need BOTH the TOP AND BOTTOM N rows.

Thank you in advance…

vovavili · May 22, 2023, 10:52pm

top_bottom_10_percent = df.loc[
    (df["col"] >= df["col"].quantile(0.9)) | (df["col"] <= df["col"].quantile(0.1))
]

jmpathy999 · May 22, 2023, 11:54pm

Thank you. I will try to work that into my code. Now your suggestion is based on quantile which really gives the top and bottom within quantile, meaning the actual number of returned rows might vary depending on how many are in the original dataframe. I have a RANK column based on ranking the dataframe on a column’s value. Is there a way to conveniently RANK ascending and take top 10, then RANK descending and then include the new top 10 which would be the bottom of the original?

BrenBarn · May 23, 2023, 7:02am

One way is to just use .iloc on your sorted DataFrame, passing in the numbers 0 through 9 and -10 through -1:

df.sort_values('SomeColumn').iloc[list(range(0, 10)) + list(range(-11, 0))]

Another way is to concatenate the head and tail of the sorted df:

sorted_df = df.sort_values("X")
pandas.concat([sorted_df.head(10), sorted_df.tail(10)], axis=0)

vovavili · May 23, 2023, 1:25pm

I see. There is an old trick with np.r_ to do this elegantly:

import pandas as pd
import numpy as np

indexes = np.r_[0:10, -10:0]
top_bottom_10 = df.sort_values("RANKED", ignore_index=True).iloc[indexes]

jmpathy999 · May 23, 2023, 1:47pm

AWESOME INPUT!!! Thank you so much! Now let me get greedy! — IF my TOP 10 are ‘Optimal’ along some scale and my BOTTOM 10 are ‘SUB Optimal’ on that scale, how can I insert a text field in the final outputted dataframe that reflects that??

vovavili · May 23, 2023, 4:08pm

top_bottom_10["ranking"] = np.where(top_bottom_10.index <= 10, "SUB Optimal", "Optimal")

jmpathy999 · May 23, 2023, 4:12pm

Thanks again! One important detail I neglected to mention(my bad) is that those rankings are WITHIN a groupby on DATE, so I need to have top and bottom 10 for EACH day written out to the final dataframe. I gave a quick try of the numpy code and it appears it yields top and bottom for the entire dataframe. Can this work WITHIN the group by??

vovavili · May 23, 2023, 4:28pm

If performance is not a concern, you can apply a function for each group while resetting the index.

def get_top_bottom_10(group):
    group = (
        group.sort_values("RANKED", ignore_index=True)
        .reset_index(drop=True)
        .iloc[np.r_[0:10, -10:0]]
    )
    group["ranking"] = np.where(group.index <= 10, "SUB Optimal", "Optimal")
    return group


top_bottom_10 = (
    df.groupby("date", as_index=False).apply(get_top_bottom_10).reset_index(drop=True)
)

This is not tested.

jmpathy999 · May 23, 2023, 5:21pm

Thank you! I’ll get to work on this and try testing and fitting things together. IF performance IS a concern I’m guessing the whole approach has to be different?

vovavili · May 23, 2023, 5:45pm

If performance is a concern, use pandas’ built-in groupby methods. You’d have to aggregate by date, calculate twice using nlargest and nsmallest methods respectively, combine calculations just like @BrenBarn has shown and get older values back using a left join (if you need them, of course). The trick here is that nlargest and nsmallest are exclusive to grouped Series, not DataFrames:

grouped = df.sort_values(["date", "RANKED"]).groupby("date")["RANKED"]
top_10 = pd.DataFrame(grouped.nlargest(10)).assign(ranking="Optimal")
bottom_10 = pd.DataFrame(grouped.nsmallest(10)).assign(ranking="SUB Optimal")
top_bottom_10 = (
    pd.concat([top_10, bottom_10])
    .merge(df, on=["date", "RANKED"], how="left")
    .sort_values(["date", "RANKED"], ascending=[True, False], ignore_index=True)
)