ML Zoomcamp 2023 – Machine Learning for Classification– Part 7

Feature importance: Correlation

For measuring feature importance for numerical variables, one common approach is to use the correlation coefficient, specifically Pearson’s correlation coefficient. The Pearson correlation coefficient quantifies the degree of linear dependency between two numerical variables.

The correlation coefficient (often denoted as “r”) has a range of -1 to 1:

  • A negative correlation (r = -1) indicates a strong inverse relationship, where one variable increases as the other decreases.
  • A positive correlation (r = 1) indicates a strong positive relationship, where both variables increase together.
  • An r value close to 0 suggests a weak or no linear relationship between the variables.

The strength of the correlation is indicated by the absolute value of r:

  • 0.0 < |r| < 0.2: Low correlation
  • 0.2 < |r| < 0.5: Moderate correlation
  • 0.6 < |r| < 1.0: Strong correlation

You can calculate the Pearson correlation coefficient between your numerical variables and the target variable (churn) to assess their importance. Higher absolute values of r indicate a stronger linear relationship, which can be interpreted as higher feature importance.

To calculate the Pearson correlation coefficient in Python, you can use the corr() function from pandas:

correlation = df[numerical_vars].corr()['churn']

This will give you a series of correlation values between each numerical variable and the churn variable, and you can sort them in descending order to identify the most important numerical features.

In the context of churn prediction:

  • Positive Correlation: A positive correlation between a numerical variable (e.g., tenure) and churn means that as the numerical variable increases (e.g., longer tenure), the likelihood of churn (1) increases.
  • Negative Correlation: A negative correlation between a numerical variable (e.g., tenure) and churn means that as the numerical variable increases (e.g., longer tenure), the likelihood of churn (1) decreases.
  • Zero Correlation: A correlation close to zero suggests that there is no significant linear relationship between the numerical variable and churn.

You can calculate the Pearson correlation coefficient (r) between each numerical variable and the target variable (churn) to assess the feature importance of numerical variables. Higher absolute values of r indicate a stronger linear relationship, which can be interpreted as higher feature importance. This analysis helps you identify which numerical variables have the most impact on churn prediction.

df_full_train[numerical]
tenuremonthlychargestotalcharges
01219.70258.35
14273.903160.55
27165.154681.75
37185.456300.85
43070.402044.75
56299100.50918.60
56306019.951189.90
563128105.702979.50
5632254.40114.10
56331668.251114.85
5634 rows × 3 columns
df_full_train[numerical].corrwith(df_full_train.churn)

# Outlook:
# tenure           -0.351885
# monthlycharges    0.196805
# totalcharges     -0.196353
# dtype: float64

If you’re primarily interested in the importance of numerical variables without considering the direction of the correlation, you can focus on the absolute values of the correlation coefficients (|r|). This approach allows you to rank numerical variables based on their overall impact on the target variable (churn) regardless of whether the relationship is positive or negative.

By sorting the absolute values of the correlation coefficients in descending order, you can identify which numerical variables have the greatest magnitude of correlation with churn and therefore contribute most significantly to predicting churn. This approach simplifies the interpretation by treating both positive and negative correlations as having the same importance.

df_full_train[numerical].corrwith(df_full_train.churn).abs()

# Output:
# tenure            0.351885
# monthlycharges    0.196805
# totalcharges      0.196353
# dtype: float64

It seems that tenure is the most important numerical variable. Let’s look at some examples:

df_full_train[df_full_train.tenure <= 2].churn.mean()
# Output: 0.5953420669577875

df_full_train[df_full_train.tenure > 2].churn.mean()
# Output: 0.22478269658378816

df_full_train[(df_full_train.tenure > 2) & (df_full_train.tenure <= 12)].churn.mean()
# Output: 0.3994413407821229

df_full_train[df_full_train.tenure > 12].churn.mean()
# Output: 0.17634908339788277

Regarding the variable “tenure,” we can observe that the group of customers with the highest likelihood to churn consists of those with a tenure of less than or equal to 2. It has a churn rate of almost 60%.

Let’s look at another example:

df_full_train[df_full_train.monthlycharges <= 20].churn.mean()
# Output: 0.08795411089866156

df_full_train[(df_full_train.monthlycharges > 20) & (df_full_train.monthlycharges <= 50)].churn.mean()
# Output: 0.18340943683409436

df_full_train[df_full_train.monthlycharges > 50].churn.mean()
# Output: 0.32499341585462205

Concerning the variable “monthlycharges,” we can observe that the group of customers with the highest likelihood to churn belongs to the group with monthly charges greater than $50. It has a churn rate of 32.5%.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.