Using groupby() to group TV shows

I want to categorize by titles of TV shows, so I have decided to use groupby(). However, I tried new_df.groupby('Title').sum().reset_index(), but this does not what I want.

The following hypothetical data is provided:

data = {
'Title': ['Suit season 1: hdiuh', 'Suit season 3: dhdg', 'Suit season 1: hiwlp', 'prince seaon 2: ld2', 'prince season 2: hwio', 'prince season 2: jdioy', 'prince season 1: pwjjlk'],
'Duration': [00:00:78:93, 00:00:06:78, 00:00:78:93, 00:00:02:93, 00:00:82:09, 00:00:08:94, 00:00:03:60]
}
df = pd.DataFrame(data)

Here’s what I’m looking for:
Title Duration

Suit season 1 00:01:57:86

Suit season 3 00:00:06:78

prince seaon 2 00:00:93:96

prince season 1 00:00:03:60

What function can I use to get this data? I tried looking it up, but it doesn’t have anything similar.

There are a few problems here. First, Python doesn’t have literals like 00:00:78:93, so we can’t create the DataFrame with this code. If you use strings, then “summing” them won’t work the way you want. Even then, these strings aren’t in the right format to convert to timedelta. Finally, there’s a typo in the titles (“prince seaon 2”).

I guess you wanted:

data = {
'Title': ['Suit season 1: hdiuh', 'Suit season 3: dhdg', 'Suit season 1: hiwlp', 'prince season 2: ld2', 'prince season 2: hwio', 'prince season 2: jdioy', 'prince season 1: pwjjlk'],
'Duration': ['00:00:78.93', '00:00:06.78', '00:00:78.93', '00:00:02.93', '00:00:82.09', '00:00:08.94', '00:00:03.60']
}
df = pd.DataFrame(data)
df['Duration'] = pd.to_timedelta(df['Duration'])

The next step: groupby only groups exact matches. So we need to extract the part of the “titles” that actually tells us the season, so that we can group by season.

df['Season'] = df['Title'].str.split(':', n=1, expand=True)[0]

Now we can get what we want, by grouping by this new column, doing the sum, and taking only the durations:

>>> df.groupby('Season')['Duration'].sum()
Season
Suit season 1     0 days 00:02:37.860000
Suit season 3     0 days 00:00:06.780000
prince season 1   0 days 00:00:03.600000
prince season 2   0 days 00:01:33.960000
Name: Duration, dtype: timedelta64[ns]

Keep in mind that .sum() will sum each column. So the titles (and seasons!) in each group also get summed (it concatenates the strings), and we have to discard those. I don’t use .reset_index because I want to use the season group names to label the data.

1 Like