ML Zoomcamp 2023 – Machine Learning for Classification– Part 5

Feature importance: Churn rate and risk ratio

Feature importance analysis is a part of exploratory data analysis (EDA) and involves identifying which features affect our target variable.

  • Churn rate
  • Risk ratio
  • Mutual information – later

Churn rate

Last time, we examined the global churn rate. Now, we are focusing on the churn rate within different groups. For example, we are interested in determining the churn rate for the gender group.

# Selecting the subset of female customers
df_full_train[df_full_train.gender == 'female']
customeridgenderseniorcitizenpartnerdependantstenurephoneservicemultiplelinesinternetserviceonlinesecuritydeviceprotectiontechsupportstreamingtvstreamingmoviescontractpaperlessbillingpaymentmethodmonthlychargestotalchargeschurn
16261-rcvnsfemale0nono42yesnodslyesyesyesnoyesone_yearnocredit_card_(automatic)73.903160.551
54765-oxppdfemale0yesyes9yesnodslyesyesyesnonomonth-to-monthnomailed_check65.00663.051
91732-vhubqfemale1yesyes47yesnofiber_opticnononononomonth-to-monthnobank_transfer_(automatic)70.553309.251
117017-vfulyfemale0yesno2yesnonono_internet_serviceno_internet_serviceno_internet_serviceno_internet_serviceno_internet_servicemonth-to-monthnobank_transfer_(automatic)20.1043.150
131374-dmzuifemale1nono4yesyesfiber_opticnononoyesyesmonth-to-monthyeselectronic_check94.30424.451
56188065-ykxkdfemale0nono10yesyesfiber_opticnononononomonth-to-monthyeselectronic_check74.75799.651
56195627-tvbppfemale0noyes35yesnonono_internet_serviceno_internet_serviceno_internet_serviceno_internet_serviceno_internet_serviceone_yearyescredit_card_(automatic)20.10644.500
56263262-eidhvfemale0yesyes72yesyesdslyesyesyesyesyestwo_yearnocredit_card_(automatic)84.705893.900
56277446-sfaoafemale0yesno37yesnonono_internet_serviceno_internet_serviceno_internet_serviceno_internet_serviceno_internet_serviceone_yearyesbank_transfer_(automatic)19.85717.500
56335840-nvdcgfemale0yesyes16yesnodslyesnoyesnoyestwo_yearnobank_transfer_(automatic)68.251114.850
2796 rows × 21 columns

The following snippet displays the value of the global churn rate. In comparison to that value, we can also calculate the churn rates for the female and male groups. We observe that the female churn rate is slightly higher than the global rate, while the male churn rate is slightly lower than the global rate. This suggests that women are somewhat more likely to churn.

global_churn = df_full_train.churn.mean()
global_churn
# Output: 0.26996805111821087

churn_female = df_full_train[df_full_train.gender == 'female'].churn.mean()
churn_female
# Output: 0.27682403433476394

global_churn - churn_female
# Output: -0.006855983216553063

churn_male = df_full_train[df_full_train.gender == 'male'].churn.mean()
churn_male
# Output: 0.2632135306553911

global_churn - churn_male
# Output: 0.006754520462819769

Let’s check the churn rate of another group (with partner vs. without partner).

df_full_train.partner.value_counts()

# Output:
# no     2932
# yes    2702
# Name: partner, dtype: int64

When examining this group, we notice that customers with partners are significantly less likely to churn. The churn rate for this group is approximately 20.5%, contrasting with the global churn rate of almost 27%. On the other hand, customers without partners have a much higher churn rate compared to the global rate, standing at 33% as opposed to 27%.

global_churn = df_full_train.churn.mean()
global_churn
# Output: 0.26996805111821087

churn_partner = df_full_train[df_full_train.partner == 'yes'].churn.mean()
churn_partner
# Output: 0.20503330866025166

global_churn - churn_partner
# Output: 0.06493474245795922

churn_no_partner = df_full_train[df_full_train.partner == 'no'].churn.mean()
churn_no_partner
# Output: 0.3298090040927694

global_churn - churn_no_partner
# Output: -0.05984095297455855

This observation suggests that the partner variable may be more influential for predicting churn than the gender variable.

Risk ratio

In the context of machine learning and classification, the “risk ratio” typically refers to a statistical measure used to assess the likelihood or probability of a certain event occurring in one group compared to another. It’s a useful concept in various fields, including healthcare, finance, and customer churn analysis.

In the specific context of churn rate, the risk ratio can help you understand the relative risk of churn (i.e., customers leaving) for different groups or segments within your dataset. It can provide insights into which features or factors are associated with a higher or lower risk of churn.

Here’s a simplified explanation of how risk ratio works in the context of churn rate:

  1. Definition of Risk Ratio: The risk ratio (also known as the relative risk) is defined as the probability of an event occurring in one group divided by the probability of the same event occurring in another group. In the case of churn rate, you’re typically comparing two groups: one group that exhibits a certain characteristic or behavior (e.g., customer has churned) and another group that does not exhibit that characteristic (e.g., customer hasn’t churned).
  2. Interpretation: A risk ratio greater than 1 suggests that the event (churn in this case) is more likely in the first group compared to the second group. A risk ratio less than 1 suggests the event is less likely in the first group. A risk ratio equal to 1 means there is no difference in risk between the two groups.
  3. Application: We can use risk ratios to assess the impact of different features or interventions on churn rate. For example, we might calculate the risk ratio of churn for customers who received a promotional offer versus those who did not. If the risk ratio is significantly greater than 1, it indicates that the promotional offer had a positive impact on reducing churn.
  4. Statistical Significance: It’s important to also consider statistical significance when interpreting risk ratios. Statistical tests such as chi-squared tests or confidence intervals can help determine if the observed differences in churn rates are statistically significant.

So the risk ratio is a valuable tool for assessing the impact of different factors or features on churn rate in classification tasks. It helps you quantify and compare the relative risk of churn between different groups, providing insights that can inform decision-making and strategies for reducing churn.

Let’s compare the risk ratio for churning between people with partners and those without partners.

churn_no_partner / global_churn
# Output: 1.2216593879412643

churn_partner / global_churn
# Output: 0.7594724924338315

This demonstrates that the churn rate for people without partners is 22% higher, whereas for people with partners, it is 24% lower than the global churn rate.

Let’s take the data and group it by gender, and for each variable within the gender group, let’s calculate the average churn rate within that group and calculate the difference and risk. We can perform this analysis for all the variables, not just the gender variable.

The SQL query would look like:

SELECT
gender,
AVG(churn),
AVG(churn) - global_churn AS diff,
AVG(churn) / global_churn AS risk
FROM
date
GROUP BY
gender;

df_full_train.groupby('gender').churn.mean()

# Output:
# gender
# female    0.276824
# male      0.263214
# Name: churn, dtype: float64

# agg takes a list of different aggregations
df_full_train.groupby('gender').churn.agg(['mean', 'count'])
gendermeancount
female0.2768242796
male0.2632142838
df_group = df_full_train.groupby('gender').churn.agg(['mean', 'count'])
df_group['diff'] = df_group['mean'] - global_churn
df_group['risk'] = df_group['mean'] / global_churn
df_group
gendermeancountdiffrisk
female0.27682427960.0068561.025396
male0.2632142838-0.0067550.974980
mean, count, diff, and risk for gender column

This table is interesting, but it only displays information for the gender groups. Now, let’s extend this analysis to include all the categorical columns.

from IPython.display import display

for c in categorical:
    #print(c)
    df_group = df_full_train.groupby(c).churn.agg(['mean', 'count'])
    df_group['diff'] = df_group['mean'] - global_churn
    df_group['risk'] = df_group['mean'] / global_churn
    display(df_group)
    print()
    print()
GENDERMEANCOUNTDIFFRISK
FEMALE0.27682427960.0068561.025396
MALE0.2632142838-0.0067550.974980
mean, count, diff, and risk for gender column
seniorcitizenmeancountdiffrisk
00.2422704722-0.0276980.897403
10.4133779120.1434091.531208
mean, count, diff, and risk for senior citizen column
partnermeancountdiffrisk
no0.32980929320.0598411.221659
yes0.2050332702-0.0649350.759472
mean, count, diff, and risk for partner column
dependentsmeancountdiffrisk
no0.31376039680.0437921.162212
yes0.1656661666-0.1043020.613651
mean, count, diff, and risk for dependents column
phoneservicemeancountdiffrisk
no0.241316547-0.0286520.893870
yes0.27304950870.0030811.011412
mean, count, diff, and risk for phone service column
multiplelinesmeancountdiffrisk
no0.2574072700-0.0125610.953474
no_phone_service0.241316547-0.0286520.893870
yes0.29074223870.0207731.076948
mean, count, diff, and risk for multiple lines column
internetservicemeancountdiffrisk
dsl0.1923471934-0.0776210.712482
fiber_optic0.42517124790.1552031.574895
no0.0778051221-0.1921630.288201
mean, count, diff, and risk for internet service column
onlinesecuritymeancountdiffrisk
no0.42092128010.1509531.559152
no_internet_service0.0778051221-0.1921630.288201
yes0.1532261612-0.1167420.567570
mean, count, diff, and risk for online security column
onlinebackupmeancountdiffrisk
no0.40432324980.1343551.497672
no_internet_service0.0778051221-0.1921630.288201
yes0.2172321915-0.0527360.804660
mean, count, diff, and risk for onlinebackup column
deviceprotectionmeancountdiffrisk
no0.39587524730.1259071.466379
no_internet_service0.0778051221-0.1921630.288201
yes0.2304121940-0.0395560.853480
mean, count, diff, and risk for device protection column
techsupportmeancountdiffrisk
no0.41891427810.1489461.551717
no_internet_service0.0778051221-0.1921630.288201
yes0.1599261632-0.1100420.592390
mean, count, diff, and risk for tech support column
streamingtvmeancountdiffrisk
no0.34283222460.0728641.269897
no_internet_service0.0778051221-0.1921630.288201
yes0.30272321670.0327551.121328
mean, count, diff, and risk for streaming tv column
streamingmoviesmeancountdiffrisk
no0.33890622130.0689381.255358
no_internet_service0.0778051221-0.1921630.288201
yes0.30727322000.0373051.138182
mean, count, diff, and risk for streaming movies column
contractmeancountdiffrisk
month-to-month0.43170131040.1617331.599082
one_year0.1205731186-0.1493950.446621
two_year0.0282741344-0.2416940.104730
mean, count, diff, and risk for contract column
paperlessbillingmeancountdiffrisk
no0.1720712313-0.0978970.637375
yes0.33815133210.0681831.252560
mean, count, diff, and risk for paperless billing column
paymentmethodmeancountdiffrisk
bank_transfer_(automatic)0.1681711219-0.1017970.622928
credit_card_(automatic)0.1643391217-0.1056300.608733
electronic_check0.45589018930.1859221.688682
mailed_check0.1938701305-0.0760980.718121
mean, count, diff, and risk for payment method column

Summary

This article has covered the difference and the risk ratio as two important tools for assessing feature importance.

Concerning the difference, we calculate it by subtracting the group’s churn rate from the global churn rate. Here, we are primarily interested in significant differences, unlike in the gender case. Values for this difference smaller than 0 indicate a higher likelihood to churn, while values larger than 0 indicate a lower likelihood to churn.

As for the risk ratio, it is obtained by dividing the group’s churn rate by the global churn rate. Values greater than 1 suggest a higher likelihood to churn, whereas values less than 1 suggest a lower likelihood to churn.

In essence, both difference and risk ratio convey similar information but in different ways, providing insights into the importance of features with respect to churn prediction.

We observe certain categories in which people tend to churn more or less frequently compared to the global average. These are the types of variables we are interested in and want to use in machine learning algorithms. While it’s informative to see this for individual variables in each table, it would be valuable to have a measure that quantifies the overall importance of each variable.

To determine how we can assess whether the “contract” variable is less or more important than “streamingmovies,” we will proceed with the following steps.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.