Filling NaN values with the average of the period before and after

Purple_Leprichaun · June 5, 2023, 12:49pm

I have a big dataframe that consists of financial data. Now, this dataframe contains some columns with categorical values and with standard numerical values. I want to fill the NaN values of numerical columns with less than 20k NaN values with the average of the value of that column in the period before and after (and it needs to be from the same company which makes it even more complex). And only those periods, so if the period is 2022-Q3, the before value should always be taken from 2022-Q2.

I tried the following code:

def average_col(merged_df_nan):
    #Filling NaN values for columns with less than 20k NaNs with the average of the period before and after
    na_counts = merged_df_nan.isna().sum()
    columns_to_fill2 = na_counts[(na_counts < 20000) & (~na_counts.index.isin(['Text', 'gvkey', 'datadate', 'prccq', 'naics', 'sic', 'processed_text', 'stemmed_text', 'sentiment_score', 'fyr', 'indfmt','consol','popsrc','tic','conm','apdedateq','fdateq','rdq']))].index
    
    #Iterate over the filtered columns and fill NA values with the average of previous and next non-NA values
    for column in tqdm(columns_to_fill2, desc = 'Columns done'):
        na_mask = merged_df_nan[column].isna()
        for index, row in merged_df_nan.iterrows():
            if na_mask[index]:
                fyearq = row['fyearq']
                fqtr = row['fqtr']
                cik = row['cik']
                   
                #Looking for the values and taking the average of them
                prev_value = merged_df_nan.loc[(merged_df_nan['cik'] == cik) & ((merged_df_nan['fyearq'] < fyearq) | ((merged_df_nan['fyearq'] == fyearq) & (merged_df_nan['fqtr'] < fqtr)))].sort_values(['fyearq', 'fqtr'], ascending=[False, False]).head(1)[column].item()
                next_value = merged_df_nan.loc[(merged_df_nan['cik'] == cik) & ((df['fyearq'] > fyearq) | ((merged_df_nan['fyearq'] == fyearq) & (merged_df_nan['fqtr'] > fqtr)))].sort_values(['fyearq', 'fqtr'], ascending=[True, True]).head(1)[column].item()
                average_value = np.mean([prev_value, next_value])
                merged_df_nan.loc[index, column] = average_value
                
    return merged_df_nan

I have a big dataframe that consists of financial data. Now, this dataframe contains some columns with categorical values and with standard numerical values. I want to fill the NaN values of numerical columns with less than 20k NaN values with the average of the value of that column in the period before and after (and it needs to be from the same company which makes it even more complex). And only those periods, so if the period is 2022-Q3, the before value should always be taken from 2022-Q2.

I tried the following code:

def average_col(merged_df_nan):
    #Filling NaN values for columns with less than 20k NaNs with the average of the period before and after
    na_counts = merged_df_nan.isna().sum()
    columns_to_fill2 = na_counts[(na_counts < 20000) & (~na_counts.index.isin(['Text', 'gvkey', 'datadate', 'prccq', 'naics', 'sic', 'processed_text', 'stemmed_text', 'sentiment_score', 'fyr', 'indfmt','consol','popsrc','tic','conm','apdedateq','fdateq','rdq']))].index
    
    #Iterate over the filtered columns and fill NA values with the average of previous and next non-NA values
    for column in tqdm(columns_to_fill2, desc = 'Columns done'):
        na_mask = merged_df_nan[column].isna()
        for index, row in merged_df_nan.iterrows():
            if na_mask[index]:
                fyearq = row['fyearq']
                fqtr = row['fqtr']
                cik = row['cik']
                   
                #Looking for the values and taking the average of them
                prev_value = merged_df_nan.loc[(merged_df_nan['cik'] == cik) & ((merged_df_nan['fyearq'] < fyearq) | ((merged_df_nan['fyearq'] == fyearq) & (merged_df_nan['fqtr'] < fqtr)))].sort_values(['fyearq', 'fqtr'], ascending=[False, False]).head(1)[column].item()
                next_value = merged_df_nan.loc[(merged_df_nan['cik'] == cik) & ((df['fyearq'] > fyearq) | ((merged_df_nan['fyearq'] == fyearq) & (merged_df_nan['fqtr'] > fqtr)))].sort_values(['fyearq', 'fqtr'], ascending=[True, True]).head(1)[column].item()
                average_value = np.mean([prev_value, next_value])
                merged_df_nan.loc[index, column] = average_value
                
    return merged_df_nan

But it is both incredibly slow and I am not sure if it really does what I want it to do (because I couldn’t test it due to the processing time being over 20 hours).

Is there anyone that could help me to create what I want to create?