Feature importance: Churn rate and risk ratio
Feature importance analysis is a part of exploratory data analysis (EDA) and involves identifying which features affect our target variable.
- Churn rate
- Risk ratio
- Mutual information – later
Churn rate
Last time, we examined the global churn rate. Now, we are focusing on the churn rate within different groups. For example, we are interested in determining the churn rate for the gender group.
# Selecting the subset of female customers
df_full_train[df_full_train.gender == 'female']
| customerid | gender | seniorcitizen | partner | dependants | tenure | phoneservice | multiplelines | internetservice | onlinesecurity | … | deviceprotection | techsupport | streamingtv | streamingmovies | contract | paperlessbilling | paymentmethod | monthlycharges | totalcharges | churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 6261-rcvns | female | 0 | no | no | 42 | yes | no | dsl | yes | … | yes | yes | no | yes | one_year | no | credit_card_(automatic) | 73.90 | 3160.55 | 1 |
| 5 | 4765-oxppd | female | 0 | yes | yes | 9 | yes | no | dsl | yes | … | yes | yes | no | no | month-to-month | no | mailed_check | 65.00 | 663.05 | 1 |
| 9 | 1732-vhubq | female | 1 | yes | yes | 47 | yes | no | fiber_optic | no | … | no | no | no | no | month-to-month | no | bank_transfer_(automatic) | 70.55 | 3309.25 | 1 |
| 11 | 7017-vfuly | female | 0 | yes | no | 2 | yes | no | no | no_internet_service | … | no_internet_service | no_internet_service | no_internet_service | no_internet_service | month-to-month | no | bank_transfer_(automatic) | 20.10 | 43.15 | 0 |
| 13 | 1374-dmzui | female | 1 | no | no | 4 | yes | yes | fiber_optic | no | … | no | no | yes | yes | month-to-month | yes | electronic_check | 94.30 | 424.45 | 1 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 5618 | 8065-ykxkd | female | 0 | no | no | 10 | yes | yes | fiber_optic | no | … | no | no | no | no | month-to-month | yes | electronic_check | 74.75 | 799.65 | 1 |
| 5619 | 5627-tvbpp | female | 0 | no | yes | 35 | yes | no | no | no_internet_service | … | no_internet_service | no_internet_service | no_internet_service | no_internet_service | one_year | yes | credit_card_(automatic) | 20.10 | 644.50 | 0 |
| 5626 | 3262-eidhv | female | 0 | yes | yes | 72 | yes | yes | dsl | yes | … | yes | yes | yes | yes | two_year | no | credit_card_(automatic) | 84.70 | 5893.90 | 0 |
| 5627 | 7446-sfaoa | female | 0 | yes | no | 37 | yes | no | no | no_internet_service | … | no_internet_service | no_internet_service | no_internet_service | no_internet_service | one_year | yes | bank_transfer_(automatic) | 19.85 | 717.50 | 0 |
| 5633 | 5840-nvdcg | female | 0 | yes | yes | 16 | yes | no | dsl | yes | … | no | yes | no | yes | two_year | no | bank_transfer_(automatic) | 68.25 | 1114.85 | 0 |
The following snippet displays the value of the global churn rate. In comparison to that value, we can also calculate the churn rates for the female and male groups. We observe that the female churn rate is slightly higher than the global rate, while the male churn rate is slightly lower than the global rate. This suggests that women are somewhat more likely to churn.
global_churn = df_full_train.churn.mean()
global_churn
# Output: 0.26996805111821087
churn_female = df_full_train[df_full_train.gender == 'female'].churn.mean()
churn_female
# Output: 0.27682403433476394
global_churn - churn_female
# Output: -0.006855983216553063
churn_male = df_full_train[df_full_train.gender == 'male'].churn.mean()
churn_male
# Output: 0.2632135306553911
global_churn - churn_male
# Output: 0.006754520462819769
Let’s check the churn rate of another group (with partner vs. without partner).
df_full_train.partner.value_counts()
# Output:
# no 2932
# yes 2702
# Name: partner, dtype: int64
When examining this group, we notice that customers with partners are significantly less likely to churn. The churn rate for this group is approximately 20.5%, contrasting with the global churn rate of almost 27%. On the other hand, customers without partners have a much higher churn rate compared to the global rate, standing at 33% as opposed to 27%.
global_churn = df_full_train.churn.mean()
global_churn
# Output: 0.26996805111821087
churn_partner = df_full_train[df_full_train.partner == 'yes'].churn.mean()
churn_partner
# Output: 0.20503330866025166
global_churn - churn_partner
# Output: 0.06493474245795922
churn_no_partner = df_full_train[df_full_train.partner == 'no'].churn.mean()
churn_no_partner
# Output: 0.3298090040927694
global_churn - churn_no_partner
# Output: -0.05984095297455855
This observation suggests that the partner variable may be more influential for predicting churn than the gender variable.
Risk ratio
In the context of machine learning and classification, the “risk ratio” typically refers to a statistical measure used to assess the likelihood or probability of a certain event occurring in one group compared to another. It’s a useful concept in various fields, including healthcare, finance, and customer churn analysis.
In the specific context of churn rate, the risk ratio can help you understand the relative risk of churn (i.e., customers leaving) for different groups or segments within your dataset. It can provide insights into which features or factors are associated with a higher or lower risk of churn.
Here’s a simplified explanation of how risk ratio works in the context of churn rate:
- Definition of Risk Ratio: The risk ratio (also known as the relative risk) is defined as the probability of an event occurring in one group divided by the probability of the same event occurring in another group. In the case of churn rate, you’re typically comparing two groups: one group that exhibits a certain characteristic or behavior (e.g., customer has churned) and another group that does not exhibit that characteristic (e.g., customer hasn’t churned).
- Interpretation: A risk ratio greater than 1 suggests that the event (churn in this case) is more likely in the first group compared to the second group. A risk ratio less than 1 suggests the event is less likely in the first group. A risk ratio equal to 1 means there is no difference in risk between the two groups.
- Application: We can use risk ratios to assess the impact of different features or interventions on churn rate. For example, we might calculate the risk ratio of churn for customers who received a promotional offer versus those who did not. If the risk ratio is significantly greater than 1, it indicates that the promotional offer had a positive impact on reducing churn.
- Statistical Significance: It’s important to also consider statistical significance when interpreting risk ratios. Statistical tests such as chi-squared tests or confidence intervals can help determine if the observed differences in churn rates are statistically significant.
So the risk ratio is a valuable tool for assessing the impact of different factors or features on churn rate in classification tasks. It helps you quantify and compare the relative risk of churn between different groups, providing insights that can inform decision-making and strategies for reducing churn.
Let’s compare the risk ratio for churning between people with partners and those without partners.
churn_no_partner / global_churn
# Output: 1.2216593879412643
churn_partner / global_churn
# Output: 0.7594724924338315
This demonstrates that the churn rate for people without partners is 22% higher, whereas for people with partners, it is 24% lower than the global churn rate.
Let’s take the data and group it by gender, and for each variable within the gender group, let’s calculate the average churn rate within that group and calculate the difference and risk. We can perform this analysis for all the variables, not just the gender variable.
The SQL query would look like:
SELECT
gender,
AVG(churn),
AVG(churn) - global_churn AS diff,
AVG(churn) / global_churn AS risk
FROM
date
GROUP BY
gender;
df_full_train.groupby('gender').churn.mean()
# Output:
# gender
# female 0.276824
# male 0.263214
# Name: churn, dtype: float64
# agg takes a list of different aggregations
df_full_train.groupby('gender').churn.agg(['mean', 'count'])
| gender | mean | count |
|---|---|---|
| female | 0.276824 | 2796 |
| male | 0.263214 | 2838 |
df_group = df_full_train.groupby('gender').churn.agg(['mean', 'count'])
df_group['diff'] = df_group['mean'] - global_churn
df_group['risk'] = df_group['mean'] / global_churn
df_group
| gender | mean | count | diff | risk |
|---|---|---|---|---|
| female | 0.276824 | 2796 | 0.006856 | 1.025396 |
| male | 0.263214 | 2838 | -0.006755 | 0.974980 |
This table is interesting, but it only displays information for the gender groups. Now, let’s extend this analysis to include all the categorical columns.
from IPython.display import display
for c in categorical:
#print(c)
df_group = df_full_train.groupby(c).churn.agg(['mean', 'count'])
df_group['diff'] = df_group['mean'] - global_churn
df_group['risk'] = df_group['mean'] / global_churn
display(df_group)
print()
print()
| GENDER | MEAN | COUNT | DIFF | RISK |
|---|---|---|---|---|
| FEMALE | 0.276824 | 2796 | 0.006856 | 1.025396 |
| MALE | 0.263214 | 2838 | -0.006755 | 0.974980 |
| seniorcitizen | mean | count | diff | risk |
|---|---|---|---|---|
| 0 | 0.242270 | 4722 | -0.027698 | 0.897403 |
| 1 | 0.413377 | 912 | 0.143409 | 1.531208 |
| partner | mean | count | diff | risk |
|---|---|---|---|---|
| no | 0.329809 | 2932 | 0.059841 | 1.221659 |
| yes | 0.205033 | 2702 | -0.064935 | 0.759472 |
| dependents | mean | count | diff | risk |
|---|---|---|---|---|
| no | 0.313760 | 3968 | 0.043792 | 1.162212 |
| yes | 0.165666 | 1666 | -0.104302 | 0.613651 |
| phoneservice | mean | count | diff | risk |
|---|---|---|---|---|
| no | 0.241316 | 547 | -0.028652 | 0.893870 |
| yes | 0.273049 | 5087 | 0.003081 | 1.011412 |
| multiplelines | mean | count | diff | risk |
|---|---|---|---|---|
| no | 0.257407 | 2700 | -0.012561 | 0.953474 |
| no_phone_service | 0.241316 | 547 | -0.028652 | 0.893870 |
| yes | 0.290742 | 2387 | 0.020773 | 1.076948 |
| internetservice | mean | count | diff | risk |
|---|---|---|---|---|
| dsl | 0.192347 | 1934 | -0.077621 | 0.712482 |
| fiber_optic | 0.425171 | 2479 | 0.155203 | 1.574895 |
| no | 0.077805 | 1221 | -0.192163 | 0.288201 |
| onlinesecurity | mean | count | diff | risk |
|---|---|---|---|---|
| no | 0.420921 | 2801 | 0.150953 | 1.559152 |
| no_internet_service | 0.077805 | 1221 | -0.192163 | 0.288201 |
| yes | 0.153226 | 1612 | -0.116742 | 0.567570 |
| onlinebackup | mean | count | diff | risk |
|---|---|---|---|---|
| no | 0.404323 | 2498 | 0.134355 | 1.497672 |
| no_internet_service | 0.077805 | 1221 | -0.192163 | 0.288201 |
| yes | 0.217232 | 1915 | -0.052736 | 0.804660 |
| deviceprotection | mean | count | diff | risk |
|---|---|---|---|---|
| no | 0.395875 | 2473 | 0.125907 | 1.466379 |
| no_internet_service | 0.077805 | 1221 | -0.192163 | 0.288201 |
| yes | 0.230412 | 1940 | -0.039556 | 0.853480 |
| techsupport | mean | count | diff | risk |
|---|---|---|---|---|
| no | 0.418914 | 2781 | 0.148946 | 1.551717 |
| no_internet_service | 0.077805 | 1221 | -0.192163 | 0.288201 |
| yes | 0.159926 | 1632 | -0.110042 | 0.592390 |
| streamingtv | mean | count | diff | risk |
|---|---|---|---|---|
| no | 0.342832 | 2246 | 0.072864 | 1.269897 |
| no_internet_service | 0.077805 | 1221 | -0.192163 | 0.288201 |
| yes | 0.302723 | 2167 | 0.032755 | 1.121328 |
| streamingmovies | mean | count | diff | risk |
|---|---|---|---|---|
| no | 0.338906 | 2213 | 0.068938 | 1.255358 |
| no_internet_service | 0.077805 | 1221 | -0.192163 | 0.288201 |
| yes | 0.307273 | 2200 | 0.037305 | 1.138182 |
| contract | mean | count | diff | risk |
|---|---|---|---|---|
| month-to-month | 0.431701 | 3104 | 0.161733 | 1.599082 |
| one_year | 0.120573 | 1186 | -0.149395 | 0.446621 |
| two_year | 0.028274 | 1344 | -0.241694 | 0.104730 |
| paperlessbilling | mean | count | diff | risk |
|---|---|---|---|---|
| no | 0.172071 | 2313 | -0.097897 | 0.637375 |
| yes | 0.338151 | 3321 | 0.068183 | 1.252560 |
| paymentmethod | mean | count | diff | risk |
|---|---|---|---|---|
| bank_transfer_(automatic) | 0.168171 | 1219 | -0.101797 | 0.622928 |
| credit_card_(automatic) | 0.164339 | 1217 | -0.105630 | 0.608733 |
| electronic_check | 0.455890 | 1893 | 0.185922 | 1.688682 |
| mailed_check | 0.193870 | 1305 | -0.076098 | 0.718121 |
Summary
This article has covered the difference and the risk ratio as two important tools for assessing feature importance.
Concerning the difference, we calculate it by subtracting the group’s churn rate from the global churn rate. Here, we are primarily interested in significant differences, unlike in the gender case. Values for this difference smaller than 0 indicate a higher likelihood to churn, while values larger than 0 indicate a lower likelihood to churn.
As for the risk ratio, it is obtained by dividing the group’s churn rate by the global churn rate. Values greater than 1 suggest a higher likelihood to churn, whereas values less than 1 suggest a lower likelihood to churn.
In essence, both difference and risk ratio convey similar information but in different ways, providing insights into the importance of features with respect to churn prediction.
We observe certain categories in which people tend to churn more or less frequently compared to the global average. These are the types of variables we are interested in and want to use in machine learning algorithms. While it’s informative to see this for individual variables in each table, it would be valuable to have a measure that quantifies the overall importance of each variable.
To determine how we can assess whether the “contract” variable is less or more important than “streamingmovies,” we will proceed with the following steps.