So, in the each group that you want to make, the month should be the same for everything that is in a group, but it does not matter what the year is.
This means that you want to group by month, and not group by year.
So this code should only mention the month column. Pandas lets us pass a single label, instead of making a list for it, so we can simply write:
df1 = df1.groupby('month').sum().reset_index()
Alternately, we can look at the code that was used to create the other columns:
Because of expand = True
, it made two separate columns, and then they got assigned with separate year
and month
names.
This uses .apply
on a Series
(here, the YearMonth
column of the original DataFrame
). To get multiple separate column results, the lambda
would need to return a Series
(representing part of a row). However, we don’t want to make multiple columns. We only need a function that can tell us the month from the input data. That could look like lambda x: x.split('-')[1]
.
So, we can groupby
on the basis of that logic. But there are some problems: if we want to use a function to say how to groupby
, that function will be given the index value. So in our lambda
code, we need to:
- use the index to get the row (with
.loc
)
- get the
YearMonth
cell from the row
- finally split that on
-
and take the [1]
element (which is the month)
Putting all of that together, we can make a small example:
>>> import pandas as pd
>>> df = pd.DataFrame([
... ['1980-03', 109],
... ['1980-04', 95],
... ['1995-03', 188],
... ['1995-04', 195]
... ], columns=['YearMonth', 'Shoe_Sales'])
>>> # vvvvvvvvv use the index to get the row
>>> # vvvvvvvvvv get that cell
>>> # vvvvvvvvvvv split it
>>> # vvv get month
>>> df.groupby(lambda x: df.loc[x].YearMonth.split('-')[1]).sum()
Shoe_Sales
03 297
04 290
The sales for the March 1980 and March 1995 were added together (109+188=297), and similarly for April (95+195).