Is there a convenient way to sort and rank a dataframe on one column to retrieve only the data that is in EITHER the TOP 10 OR the BOTTOM 10 of values by that column? I think I could do it with multiple passes but wondering if there is a more efficient and/or elegant solution. I need BOTH the TOP AND BOTTOM N rows.
Thank you. I will try to work that into my code. Now your suggestion is based on quantile which really gives the top and bottom within quantile, meaning the actual number of returned rows might vary depending on how many are in the original dataframe. I have a RANK column based on ranking the dataframe on a column’s value. Is there a way to conveniently RANK ascending and take top 10, then RANK descending and then include the new top 10 which would be the bottom of the original?
AWESOME INPUT!!! Thank you so much! Now let me get greedy! — IF my TOP 10 are ‘Optimal’ along some scale and my BOTTOM 10 are ‘SUB Optimal’ on that scale, how can I insert a text field in the final outputted dataframe that reflects that??
Thanks again! One important detail I neglected to mention(my bad) is that those rankings are WITHIN a groupby on DATE, so I need to have top and bottom 10 for EACH day written out to the final dataframe. I gave a quick try of the numpy code and it appears it yields top and bottom for the entire dataframe. Can this work WITHIN the group by??
Thank you! I’ll get to work on this and try testing and fitting things together. IF performance IS a concern I’m guessing the whole approach has to be different?
If performance is a concern, use pandas’ built-in groupby methods. You’d have to aggregate by date, calculate twice using nlargest and nsmallest methods respectively, combine calculations just like @BrenBarn has shown and get older values back using a left join (if you need them, of course). The trick here is that nlargest and nsmallest are exclusive to grouped Series, not DataFrames: