Pandas + Groupby + extract values from other column (oldest date when category repeated)

I have this dataset:

df=pd.DataFrame({'user':[1,1,2,2,2,3,3,3,3,3,4,4],
                  'date':['1995-09-01','1995-09-02','1995-10-03','1995-10-04','1995-10-05','1995-11-07','1995-11-08','1995-11-09','1995-11-10','1995-11-15','1995-12-18','1995-12-20'],
                  'type':['a','b','a','x','b','a','x','a','x','b','a','b']})

Which gives me:

user    date    type
1   1995-09-01  a
1   1995-09-02  b
2   1995-10-03  a
2   1995-10-04  x
2   1995-10-05  b
3   1995-11-07  a
3   1995-11-08  x
3   1995-11-09  a
3   1995-11-10  x
3   1995-11-15  b
4   1995-12-18  a
4   1995-12-20  b

and I want to get a new column, with dates based on a values. When a user has more than an value type a value , the date of the oldest a value of this user should be selected to show on the new column.

For example, user 3 has several a values on the type column. Thus, on the a_type_date column, the eldest date for the a value is chosen.

Here is the desired output:

user    date      type   a_type_date
    1   1995-09-01  a     1995-09-01
    1   1995-09-02  b     1995-09-01
    2   1995-10-03  a     1995-10-03
    2   1995-10-04  x     1995-10-03
    2   1995-10-05  b     1995-10-03
    3   1995-11-07  a     1995-11-07
    3   1995-11-08  x     1995-11-07
    3   1995-11-09  a     1995-11-07
    3   1995-11-10  x     1995-11-07
    3   1995-11-15  b     1995-11-07
    4   1995-12-18  a     1995-12-18
    4   1995-12-20  b     1995-12-18

Any help will be highly appreciated.

Hi Josep,
The first thing to need to do is to make sure the dates are in the correct date type, otherwise they will be treated as strings, and the sort order will be wrong. You can do this on creation of the df, or you can do
df['date'] = pd.to_datetime(df['date'])

After that, it is probably easiest to filter on just df['type']='a' if they are the only dates you care about. Then you can do a groupby then use min to get the earliest date for each user.

@Irjball, thanks. Date type was properly stated.
df['type']='a' will bring up all a values, however I am interested only in the most recent ones when an user has more than an avalue.
For example, user 3 has several a values on the type column. Thus, on the a_type_date column, the eldest date for the a value is chosen.
Solution implies using groupby.

Here is the answer:

  1. First, convert the date column into a datetime() index:

df.date = pd.to_datetime(df.date)

  1. Next use min() and list comprehension to create a new column containing your earliest (use max() for latest) type a dates per user as follows:

df['a_type_date'] = [min(df[(df.user==i)&(df.type=='a')]['date']).date() for i in df.user]

This will work, although will not scale well as you are filtering on the DataFrame multiple times. You would be better off using groupby as suggested above, something like:

a_dates_df = df[df['user'] = 'a'].groupby('user')['date'].min()

You can the use merge to join a_dates_df to the current df on the ‘user’ column.

1 Like

Thanks @lrjball

Your suggestion is right. Thank you very much for your time and help.

I really appreciate it.