Pandas groupby doesn't work

marian · February 27, 2023, 8:50pm

hello, any pandas guru here?
I’m trying to groupby a dataframe without any aggregation. I’ve found some solutions, but I’m getting some weird outputs.

import pandas as pd
df = pd.DataFrame( {'key': ['A', 'B', 'A', 'B'], 'value': [2,2,1,1]})
print(df)
print(df.groupby('key').nth[:]) # only this one works
print(df.groupby('key').head()) # doesn't work
print(df.groupby('key').filter(lambda x:True)) # doesn't work
print(df.groupby('key').apply(lambda x:x)) # doesn't work

In cases where it doesn’t work, it does not throw error, it just prints the original df dataframe. Any suggestions why?

p.s. the dataframe example and one of the solutions is picked from StackOverflow, but the other guy also commented that he was unable to reproduce the suggested…

Source:

CAM-Gerlach · March 1, 2023, 6:31am

It’s not really clear what you mean by “doesn’t work” or exactly what you’re asking here or actually trying to achieve, sorry. This seems like an XY problem—groupby is specifically for split-apply-combine aggregation, so if you aren’t looking to aggregate, another tool is likely more appropriate for the job rather than employing a hacky workaround with groupby.

What I would suggest instead that produces identical output to df.groupby('key').nth[:] (at least on the given dataframe) but is less hacky and more clear and explicit is:

df.set_index('key').sort_index()

Observe:

>>> print(df.groupby('key').nth[:])
     value
key       
A        2
A        1
B        2
B        1
>>> df.set_index('key').sort_index()
     value
key       
A        2
A        1
B        2
B        1

Because each of them are doing exactly what they are specified and intended to do, as described in the Pandas documentation:

GroupBy.apply() passes the subsets from the original dataframe group-wise to the function, which are left as is, and then recombines them in the same order as the original index
DataFrameGroupBy.filter() works similarly to apply(), except with a function that returns a boolean of whether the group should be included or not (with all groups being included per the above)
GroupBy.head(n) “returns a subset of rows from the original DataFrame with original index and order preserved”, specifically the first n rows in each group

marian · March 1, 2023, 12:27pm

Okay, I will simplify this - consider having the following dataframe:

df = pd.DataFrame({'Animal': ['Parrot', 'Falcon', 'Sparrow', 
                              'Parrot', 'Falcon', 'Sparrow'],
                   'Max Speed': [24., 380., None, 
                                 26., 370., None ]})
'''
    Animal	Max Speed
0	Parrot	     24.0
1	Falcon	    380.0
2	Sparrow    	  NaN
3	Parrot	     26.0
4	Falcon	    370.0
5	Sparrow       NaN
'''

How to achieve, to be grouped to this desired output:
(I don’t mean sorted like sort_values(), I want Animals to appear in order they are encountered… )

   Animal  Max Speed
0  Parrot       24.0
3  Parrot       26.0
1  Falcon      380.0
4  Falcon      370.0
2  Sparrow       NaN
5  Sparrow       NaN

Thank you.

tmk · March 1, 2023, 2:09pm

Best I could come up with:

>>> import pandas as pd
>>> df = pd.DataFrame({'Animal': ['Parrot', 'Falcon', 'Sparrow', 
...                               'Parrot', 'Falcon', 'Sparrow'],
...                    'Max Speed': [24., 380., None, 
...                                  26., 370., None ]})
>>> df.set_index("Animal").loc[df["Animal"].unique()].reset_index()
    Animal  Max Speed
0   Parrot       24.0
1   Parrot       26.0
2   Falcon      380.0
3   Falcon      370.0
4  Sparrow        NaN
5  Sparrow        NaN

EDIT: removed .sort_index() which was superfluous as pointed out by C.A.M. Gerlach below.

CAM-Gerlach · March 1, 2023, 2:47pm

Okay, thanks for clarifying—those specific constraints weren’t specified previously, and don’t match your previously stated “working solution” (nor is the motivation for them entirely clear).

You can skip the sort_index() call as the .loc reindexes anyway. Also, the above doesn’t preserve the index values as in @marian 's expected output. For that, it needs to either save and restore the existing index values or (cleaner, as I do below) keep them as a multi-index level. Thus:

>>> df.set_index("Animal", append=True).swaplevel().loc[df["Animal"].unique()].reset_index(level=0)
    Animal  Max Speed
0   Parrot       24.0
3   Parrot       26.0
1   Falcon      380.0
4   Falcon      370.0
2  Sparrow        NaN
5  Sparrow        NaN

tmk · March 1, 2023, 3:12pm

Ah yes, of course.

You can also skip the .swaplavel() if you use .loc[] like this:

df.set_index("Animal", append=True).loc[:, df["Animal"].unique(), :].reset_index(level=1)

but we’re micro-optimizing now.

marian · March 11, 2023, 12:12pm

just a follow-up,

(@CAM-Gerlach my motivation is simple, I am just learning.)

I have posted this question also on pandas github (groupby weird behavior · Issue #51692 · pandas-dev/pandas · GitHub).
And there a guy came up with some elegant solutions to arrange ‘Animals’… using pandas.Categorical , or sort_values by key.

So thank you all for comments, and let’s consider this as solved.