Groupby a column that exist in a data frame raise key error

elenora · July 8, 2024, 7:01pm

I have a data frame called subset_df and I want to group by it based on a column named “rolling_agg_timestep”. here is my code to do it:

def get_nodeToIndexMap(self, subset_df):     #,rolling_agg_timestep):
        
        print('subset_df:',subset_df.columns)

        if rolling_agg_timestep not in subset_df.columns:
            raise KeyError(f"Column '{rolling_agg_timestep}' does not exist in the DataFrame.")
        all_mappings = {}
        subset_group = subset_df.groupby("rolling_agg_timestep")

Here is what I see in the console after running the code:

subset_df: Index(['id', 'pos', 'lane', 'rolling_agg_timestep', 'rolling_mean_speed',
       'rolling_std_speed', 'rolling_mean_accel', 'rolling_std_accel',
       'rolling_std_y', 'rolling_mean_y', 'label'],
      dtype='object')
Traceback (most recent call last):
  File "/Users/Documents/conference-code/STGCN.py", line 815, in <module>
    nodeToIndexList = gInfo.get_nodeToIndexMap(subset_df)    #,rolling_agg_timestep
  File "/Users/Documents/conference-code/STGCN.py", line 245, in get_nodeToIndexMap
    raise KeyError(f"Column '{rolling_agg_timestep}' does not exist in the DataFrame.")
KeyError: "Column '1' does not exist in the DataFrame."

I appreciate if anyone can help me how can I group by the mentioned data frame bases on rolling_agg_timestep.

cameron · July 8, 2024, 9:48pm

It looks like your dataframe literally has a column named
“rolling_agg_timestep”. But you’re using the variable
rolling_agg_timestep, which contains the string "Column '1'" which
is recited in your exception message.

Your .groupby() call uses the literal "rolling_agg_timestep", so it
would probably work. Your pretest on the variable is incorrect,
because the variable holds the wrong string.

Do you know how the variable came to hold the string "Column '1'" ?

elenora · July 9, 2024, 12:29am

Thank you for your answer. Here is what I do to get subset_df before calling the method:

df = pd.read_excel('/Users/SUMOTutorials/m3.xlsx')

selected_columns = ['id','pos','lane','rolling_agg_timestep','rolling_mean_speed','rolling_std_speed','label']

subset_df = df[selected_columns].copy()

If I group by ‘rolling_agg_timestep’, I will have 239 groups and now after running the code, the error is:

File "/Users/Documents/conference-code/STGCN.py", line 838, in <module>
    nodeToIndexList = gInfo.get_nodeToIndexMap(subset_df_raw)    #,rolling_agg_timestep
  File "/Users/Documents/conference-code/STGCN.py", line 245, in get_nodeToIndexMap
    raise KeyError(f"Column '{rolling_agg_timestep}' does not exist in the DataFrame.")
KeyError: "Column '239' does not exist in the DataFrame."

cameron · July 9, 2024, 12:36am

Your subset_df looks sane.

What I’m concerned by is your test:

if rolling_agg_timestep not in subset_df.columns:
    raise KeyError(f"Column '{rolling_agg_timestep}' does not exist in the DataFrame.")

which isn’t looking for a column named "rolling_agg_timestep" in the columns but for a column named by whatever value is in the global variable rolling_agg_timestep, which in your example above has the string "Column '239'".

This line:

subset_group = subset_df.groupby("rolling_agg_timestep")

looks like it should work just fine. Try commenting out the test.

cameron · July 9, 2024, 12:39am

The distinction I’m making here is a bit like this:

column_names = ["foo", "bar", "baz"]
foo = "Column '1'"
assert "foo" in column_names   # this should be ok
assert foo in column_names     # this looks up "Column '1'" and fails

MRAB · July 9, 2024, 1:07am

I don’t think that rolling_agg_timestep is "Column '239'", but instead is 239, because the raise statement says:

raise KeyError(f"Column '{rolling_agg_timestep}' does not exist in the DataFrame.")

The value of rolling_agg_timestep is between the single quotes.