hello, any pandas guru here?
I’m trying to groupby a dataframe without any aggregation. I’ve found some solutions, but I’m getting some weird outputs.
import pandas as pd
df = pd.DataFrame( {'key': ['A', 'B', 'A', 'B'], 'value': [2,2,1,1]})
print(df)
print(df.groupby('key').nth[:]) # only this one works
print(df.groupby('key').head()) # doesn't work
print(df.groupby('key').filter(lambda x:True)) # doesn't work
print(df.groupby('key').apply(lambda x:x)) # doesn't work
In cases where it doesn’t work, it does not throw error, it just prints the original df dataframe. Any suggestions why?
p.s. the dataframe example and one of the solutions is picked from StackOverflow, but the other guy also commented that he was unable to reproduce the suggested…
It’s not really clear what you mean by “doesn’t work” or exactly what you’re asking here or actually trying to achieve, sorry. This seems like an XY problem—groupby is specifically for split-apply-combine aggregation, so if you aren’t looking to aggregate, another tool is likely more appropriate for the job rather than employing a hacky workaround with groupby.
What I would suggest instead that produces identical output to df.groupby('key').nth[:] (at least on the given dataframe) but is less hacky and more clear and explicit is:
df.set_index('key').sort_index()
Observe:
>>> print(df.groupby('key').nth[:])
value
key
A 2
A 1
B 2
B 1
>>> df.set_index('key').sort_index()
value
key
A 2
A 1
B 2
B 1
Because each of them are doing exactly what they are specified and intended to do, as described in the Pandas documentation:
GroupBy.apply() passes the subsets from the original dataframe group-wise to the function, which are left as is, and then recombines them in the same order as the original index
DataFrameGroupBy.filter() works similarly to apply(), except with a function that returns a boolean of whether the group should be included or not (with all groups being included per the above)
GroupBy.head(n) “returns a subset of rows from the original DataFrame with original index and order preserved”, specifically the first n rows in each group
How to achieve, to be grouped to this desired output:
(I don’t mean sorted like sort_values(), I want Animals to appear in order they are encountered… )
Animal Max Speed
0 Parrot 24.0
3 Parrot 26.0
1 Falcon 380.0
4 Falcon 370.0
2 Sparrow NaN
5 Sparrow NaN
Okay, thanks for clarifying—those specific constraints weren’t specified previously, and don’t match your previously stated “working solution” (nor is the motivation for them entirely clear).
You can skip the sort_index() call as the .loc reindexes anyway. Also, the above doesn’t preserve the index values as in @marian 's expected output. For that, it needs to either save and restore the existing index values or (cleaner, as I do below) keep them as a multi-index level. Thus:
>>> df.set_index("Animal", append=True).swaplevel().loc[df["Animal"].unique()].reset_index(level=0)
Animal Max Speed
0 Parrot 24.0
3 Parrot 26.0
1 Falcon 380.0
4 Falcon 370.0
2 Sparrow NaN
5 Sparrow NaN