Correlation Matrix

Here is the coding:

# Create correlation matrix
corr_matrix = heads.corr()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(abs(upper[column]) > 0.95)]

to_drop

It threw me this message:

/var/folders/7y/rvhwt04n48g8nsfq443cymkh0000gn/T/ipykernel_18239/2672702943.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  corr_matrix = heads.corr()

Could someone help to explain the message? how could I fix the corr_matrix = heads.corr()?

Thanks.

The default value of numeric_only in DataFrame.corr

This means: when you called DataFrame.corr, because you did not tell it a value for the numeric_only parameter, it used a default value for that parameter. I.e., it decided for you, that it should only look at the numeric columns in order to calculate the correlation.

A correlation requires numbers to correlate, so it can only use columns that are numeric. It is guessing, that you mean every column that is numeric.

is deprecated. In a future version, it will default to False.

This means: if you update Pandas to 2.0, the code as written will break. It will try to use every column, including the ones that aren’t numeric, causing an error.

Select only valid columns or specify the value of numeric_only to silence this warning.

Exactly as it says. You can tell it explicitly to use the numeric columns: heads.corr(numeric_only=True). Or you can tell it which columns to use, by selecting them first (to make a new DataFrame that only has numeric columns).

1 Like

Thank you!