Ranking and Filtering by Rank

Hi,

Is there a convenient way to sort and rank a dataframe on one column to retrieve only the data that is in EITHER the TOP 10 OR the BOTTOM 10 of values by that column? I think I could do it with multiple passes but wondering if there is a more efficient and/or elegant solution. I need BOTH the TOP AND BOTTOM N rows.

Thank you in advance…

top_bottom_10_percent = df.loc[
    (df["col"] >= df["col"].quantile(0.9)) | (df["col"] <= df["col"].quantile(0.1))
]
1 Like

Thank you. I will try to work that into my code. Now your suggestion is based on quantile which really gives the top and bottom within quantile, meaning the actual number of returned rows might vary depending on how many are in the original dataframe. I have a RANK column based on ranking the dataframe on a column’s value. Is there a way to conveniently RANK ascending and take top 10, then RANK descending and then include the new top 10 which would be the bottom of the original?

1 Like

One way is to just use .iloc on your sorted DataFrame, passing in the numbers 0 through 9 and -10 through -1:

df.sort_values('SomeColumn').iloc[list(range(0, 10)) + list(range(-11, 0))]

Another way is to concatenate the head and tail of the sorted df:

sorted_df = df.sort_values("X")
pandas.concat([sorted_df.head(10), sorted_df.tail(10)], axis=0)
1 Like

I see. There is an old trick with np.r_ to do this elegantly:

import pandas as pd
import numpy as np

indexes = np.r_[0:10, -10:0]
top_bottom_10 = df.sort_values("RANKED", ignore_index=True).iloc[indexes]

AWESOME INPUT!!! Thank you so much! Now let me get greedy! :slight_smile: :slight_smile: — IF my TOP 10 are ‘Optimal’ along some scale and my BOTTOM 10 are ‘SUB Optimal’ on that scale, how can I insert a text field in the final outputted dataframe that reflects that??

1 Like

top_bottom_10["ranking"] = np.where(top_bottom_10.index <= 10, "SUB Optimal", "Optimal")

Thanks again! One important detail I neglected to mention(my bad) is that those rankings are WITHIN a groupby on DATE, so I need to have top and bottom 10 for EACH day written out to the final dataframe. I gave a quick try of the numpy code and it appears it yields top and bottom for the entire dataframe. Can this work WITHIN the group by??

If performance is not a concern, you can apply a function for each group while resetting the index.

def get_top_bottom_10(group):
    group = (
        group.sort_values("RANKED", ignore_index=True)
        .reset_index(drop=True)
        .iloc[np.r_[0:10, -10:0]]
    )
    group["ranking"] = np.where(group.index <= 10, "SUB Optimal", "Optimal")
    return group


top_bottom_10 = (
    df.groupby("date", as_index=False).apply(get_top_bottom_10).reset_index(drop=True)
)

This is not tested.

1 Like

Thank you! I’ll get to work on this and try testing and fitting things together. IF performance IS a concern I’m guessing the whole approach has to be different?

If performance is a concern, use pandas’ built-in groupby methods. You’d have to aggregate by date, calculate twice using nlargest and nsmallest methods respectively, combine calculations just like @BrenBarn has shown and get older values back using a left join (if you need them, of course). The trick here is that nlargest and nsmallest are exclusive to grouped Series, not DataFrames:

grouped = df.sort_values(["date", "RANKED"]).groupby("date")["RANKED"]
top_10 = pd.DataFrame(grouped.nlargest(10)).assign(ranking="Optimal")
bottom_10 = pd.DataFrame(grouped.nsmallest(10)).assign(ranking="SUB Optimal")
top_bottom_10 = (
    pd.concat([top_10, bottom_10])
    .merge(df, on=["date", "RANKED"], how="left")
    .sort_values(["date", "RANKED"], ascending=[True, False], ignore_index=True)
)