Pandas add the similar months in a column

shomikc · August 1, 2023, 12:49am

Hello experts,

I have csv file that looks like this
Capture

I want to get a boxplot like this
Capture1
Please help. Ihave no how to get the same months from a column and sum
Groupby only sums all the months in a year.

Ihave separated month and year like this

df1 = df.groupby(['Shoe_Sales', 'YearMonth']).sum().reset_index()
df1['year','month'] = df1['YearMonth'].apply(lambda x: str(x.split('-')))
df1[["year", "month"]] = df1["YearMonth"].str.split("-", expand = True)
df1 = df1.groupby(['year', 'month']).sum().reset_index()
#sums = gb.loc[gb.month.isin([1])].sum()            
#s2 = df1.groupby([lambda x: x.year, lambda x: x.month]).sum()

df1

Capture2

Please give me a hint. Do I use Lambda or .agg. Please help.

kknechtel · August 1, 2023, 5:21am

So, in the each group that you want to make, the month should be the same for everything that is in a group, but it does not matter what the year is.

This means that you want to group by month, and not group by year.

So this code should only mention the month column. Pandas lets us pass a single label, instead of making a list for it, so we can simply write:

df1 = df1.groupby('month').sum().reset_index()

Alternately, we can look at the code that was used to create the other columns:

Because of expand = True, it made two separate columns, and then they got assigned with separate year and month names.

This uses .apply on a Series (here, the YearMonth column of the original DataFrame). To get multiple separate column results, the lambda would need to return a Series (representing part of a row). However, we don’t want to make multiple columns. We only need a function that can tell us the month from the input data. That could look like lambda x: x.split('-')[1].

So, we can groupby on the basis of that logic. But there are some problems: if we want to use a function to say how to groupby, that function will be given the index value. So in our lambda code, we need to:

use the index to get the row (with .loc)
get the YearMonth cell from the row
finally split that on - and take the [1] element (which is the month)

Putting all of that together, we can make a small example:

>>> import pandas as pd
>>> df = pd.DataFrame([
...     ['1980-03', 109],
...     ['1980-04', 95],
...     ['1995-03', 188],
...     ['1995-04', 195]
... ], columns=['YearMonth', 'Shoe_Sales'])
>>> #                    vvvvvvvvv use the index to get the row
>>> #                             vvvvvvvvvv get that cell
>>> #                                       vvvvvvvvvvv split it
>>> #                                                  vvv get month
>>> df.groupby(lambda x: df.loc[x].YearMonth.split('-')[1]).sum()
    Shoe_Sales
03         297
04         290

The sales for the March 1980 and March 1995 were added together (109+188=297), and similarly for April (95+195).

shomikc · August 1, 2023, 6:05am

Thankyou for taking the time. You saved me maybe like 4 days(at least) . I thank you from the functioning part of my head. Thanks

shomikc · August 2, 2023, 1:43am

Hello knowledgeable people,

I am still stuck on my last question.

I didnt word my question properly.

My dataset looks like this.

Capture2

I am looking for

Capture1

I am thinking the Month column has to grouped using the duplicates and each group of duplicates has to plotted in a boxplot.

Is this correct.

Please help.

kknechtel · August 3, 2023, 4:05am

What exactly went wrong with the next steps, after you tried my suggestion?

shomikc · August 3, 2023, 5:01am

Yes. Thanks for replying. Yes I tried your suggestion. It works of course but I get the sum of all January months, the sum of
all February months of say 1980,1981 etc…

I think I need the individual values for each of the January months to make the boxplot for January, and February etc.

Anyway, not a problem. I will keep on trying till I get it.

If not then I guess I will come crying again.

And thanks so much for your help You are amazing.

Thanks and Regards,

Shomik Chakraborty.

shomikc · August 3, 2023, 8:33am

Hi. I have found a really childish way to do it.

Can you please have a look and tell me what I am failing to consider.

The code is like this.

df=pd.read_csv("shoesales.csv")
df1 = df.copy()

df1[['Year', 'Month']] = df1['YearMonth'].str.split('-', expand=True)


df1['Month'] = pd.to_numeric(df1['Month'], errors='coerce')
january = df1[df1['Month'] == 1]
january

plt.figure(figsize=(8, 6))
january.boxplot('Shoe_Sales')
plt.title('January')
plt.ylabel('Shoe Sales')
plt.show()

february = df1[df1['Month'] == 2]

plt.figure(figsize=(8, 6))
february.boxplot('Shoe_Sales')
plt.title('February')
plt.ylabel('Shoe Sales')
plt.show()

It works but I get seperate boxplots. I need to combine the boxplots one figure.

january

february

shomikc · August 5, 2023, 1:51am

Hello,

I have a csv file like this
csv

Now I am trying to get this
monthly sales

This is the code I have written

df=pd.read_csv("shoesales.csv")
df1[['Year', 'Month']] = df1['YearMonth'].str.split('-', expand=True)

f1['Month'] = pd.to_numeric(df1['Month'], errors='coerce')
#january 
january= df1[df1['Month'] == 1]
february= df1[df1['Month'] == 2]
march= df1[df1['Month'] == 3]
april= df1[df1['Month'] == 4]
may= df1[df1['Month'] == 5]
june= df1[df1['Month'] == 6]
july= df1[df1['Month'] == 7]
august= df1[df1['Month'] == 8]
september= df1[df1['Month'] == 9]
october= df1[df1['Month'] == 10]
november= df1[df1['Month'] == 11]
december= df1[df1['Month'] == 12]

january = january.rename(columns={'Shoe_Sales': 'January '})
february= february.rename(columns={'Shoe_Sales': 'February '})
march= march.rename(columns={'Shoe_Sales': 'March '})
april= april.rename(columns={'Shoe_Sales': 'April '})
may= may.rename(columns={'Shoe_Sales': 'May '})
june= june.rename(columns={'Shoe_Sales': 'June '})
july= july.rename(columns={'Shoe_Sales': 'July '})
august= august.rename(columns={'Shoe_Sales': 'August '})
september= september.rename(columns={'Shoe_Sales': 'Sept '})
october= october.rename(columns={'Shoe_Sales': 'Oct '})
november= november.rename(columns={'Shoe_Sales': 'Nov '})
december= december.rename(columns={'Shoe_Sales': 'Dec '})
# Step 2: Combine data into a single DataFrame
combined_df = (pd.concat([january['January '], february['February '], march['March '],april['April '],
                          may['May '], june['June '],july['July '],august['August '],september['Sept '],
                          october['Oct '],november['Nov '],december['Dec ']], axis=1))

# Step 3: Draw boxplots on the same plot
plt.figure(figsize=(8, 6))
sns.boxplot(data=combined_df)
plt.xlabel('Columns')
plt.ylabel('Values')
plt.title('Monthly Sales from years 1980 to 1997')
plt.show()

Please, I need your help to identify a better way. Though it works I am sure there must be a better way.

Thankyou,
Shomik C.