Merge some rows in a pandas dataframe

Hi ,
It’s an easy math question. I want to know how to write the code clearly and efficiently (when the data size is large). We have passengers waiting for boats (in this example, on dates 1,2,3,4,5) but we have boats only on a few of these days (in this example, dates 1,3,5). The passengers arrive at the dock on the day with a boat will depart on that day; if the passenger arrive on other days will wait until the day when there is a boat. Given the data of passenger arrival dates, we want to find the passenger departure dates. Assume that there is a boat on the last day. The input/outputs are in pandas, and I wonder if there is a good way to improve my code. In particular, is there a way to do without a for loop (which I think may be more efficient)?

# Inputs: passenger waiting data and boat schedule
df_passenger_waiting = pd.DataFrame({'date': [5, 2, 3, 4, 1], 'name': [['Bill', 'John'], ['Chris', 'Bob'], ['Alice', 'Rob', 'Ed'], ['Albert'], ['Joe']]})
df_passenger_waiting = df_passenger_waiting.sort_values('date')
boat_date = [1, 5, 3]
boat_date.sort()

# initialize a pandas dataframe for passenger departure
df_passenger_departing = pd.DataFrame(columns = ['date', 'name'])
df_passenger_departing['date'] = df_passenger_waiting['date']

# Let the passenger arrive on no-boat days wait until there is a boat 
passenger_waiting_overnight = []
for i in range(len(df_passenger_departing)):
  df_passenger_departing['name'][i] = []
  if i in boat_date:
    df_passenger_departing['name'][i] += (df_passenger_waiting['name'][i] + passenger_waiting_overnight)
    passenger_waiting_overnight = []
  else:
    passenger_waiting_overnight += df_passenger_waiting['name'][i]

# drop the dates when there is no boat, but unfortunately the following line does not work
df_passenger_departing.dropna(inplace = True)   
  • Did you check that your current script works correctly? I have doubts about that.
  • In your loop you are ignoring an important Panda warning - it’s sometimes harmless, and annoying, but it always indicates a potentially serious problem that can mess up the actual behavior/result:
    A value is trying to be set on a copy of a slice from a DataFrame
    Whenever you see this, you probably should make sure that it is harmless. But better yet, just always rewrite code to prevent this warning.
  • I agree that looping is kind of bad style here: it leads to convoluted code. Already in this simple example the code is difficult to follow or verify just by code inspection. (And if under the covers pandas could use vectorization, then you’d also be missing out on that.)

So, how to improve… Hm, now I’ve to come off my little opinionated sandbox and do actual coding… :slight_smile:

First, I think you can simplify by keeping everything in just one dataframe:

df = pd.DataFrame({'date': [5, 2, 3, 4, 1], 
                   'name': [['Bill', 'John'], ['Chris', 'Bob'], ['Alice', 'Rob', 'Ed'], ['Albert'], ['Joe']]})

Then you add the dates in

dates = [1,5,3]
df["boat"] = df.date.apply(lambda x: x in dates)

Lists as values are kind of cumbersome in pandas, so I would explode that df:

df = df.explode("name")

This gives

   date    name   boat
0     5    Bill   True
0     5    John   True
1     2   Chris  False
1     2     Bob  False
2     3   Alice   True
2     3     Rob   True
2     3      Ed   True
3     4  Albert  False
4     1     Joe   True

Now it’s easy:

df["depart"] = df.apply(lambda row: row.date if row.boat else row.date + 1, 
                        axis=1)

And if you want to group those together again, you get

df.groupby('depart').agg({'name': list}).reset_index()

Which gives:

   depart                          name
0       1                         [Joe]
1       3  [Chris, Bob, Alice, Rob, Ed]
2       5          [Bill, John, Albert]

Or to also show that everything works as intended:

>>> df.groupby('depart').agg({'name': list, 'date': list}).reset_index()
   depart                          name             date
0       1                         [Joe]              [1]
1       3  [Chris, Bob, Alice, Rob, Ed]  [2, 2, 3, 3, 3]
2       5          [Bill, John, Albert]        [5, 5, 4]

The .apply function, btw, is not really more efficient than looping - under the covers it just loops. But it makes the code clearer, I think.

This should be more efficient then .apply:

df["boat"] = df.date.isin(dates)  # much better than the apply call
df["depart"] = df.date - df.boat.astype(int) + 1 # also better perhaps
                                                 # perhaps less clear?

The last two examples also illustrate why it’s better to bring all processing over into one dataframe. Anyway, apart from never using loops, it’s also always better (if possible) to prevent .apply calls. Better, in terms of both readability and performance.

There is a bug in the above code, since I didn’t exactly follow your spec, so for different input dates, the ‘depart’ col may not get the right values, but I’ll leave that to you to fix :rofl: (Easiest is to go back to using an apply, otherwise you have to define a helper function that operates on a Series.)

2 Likes

I didn’t immediately see a way to fix the bug without using .apply calls, but after a bit of experimentation, it turned out to be ridiculously simple (I just never before had had an opportunity to use the ‘bfill’ method of replacing values):

import pandas as pd
import numpy as np

df = pd.DataFrame({'date': [5, 2, 3, 4, 1], 
                   'name': [['Bill', 'John'], ['Chris', 'Bob'], 
                            ['Alice', 'Rob', 'Ed'], ['Albert'], ['Joe']]})
df.sort_values("date", inplace=True)  # the sort is required for correctness later

boat_depart_dates = [1, 2, 5]  # note that I changed this, and updated the name

# first a partial solution for dates when there are boats
df["depart"] = df.date.isin(boat_depart_dates) * df.date

# complete solution, including for passengers arriving earlier at the dock
df.depart = df.depart.replace(0, method="bfill")

To see how this works, print out the dfs after each operation or sub-operation. Implicit assumption here is of course that a date is always a positive, non-zero integer.
In this solution, I’m not grouping by departure date, but if you wanted to do that, that’s pretty straightforward, based on the final dataframe:

df.explode('name').groupby('depart').agg(list)  # or agg(set) if you prefer

Every time I do anything with pandas, I’m again amazed how beautiful, powerful - and ultimately how simple – those pandas APIs are…

1 Like