Feature importance: Churn rate and risk ratio

Feature importance analysis is a part of exploratory data analysis (EDA) and involves identifying which features affect our target variable.

Churn rate
Risk ratio
Mutual information – later

Churn rate

Last time, we examined the global churn rate. Now, we are focusing on the churn rate within different groups. For example, we are interested in determining the churn rate for the gender group.

# Selecting the subset of female customers
df_full_train[df_full_train.gender == 'female']

	customerid	gender	seniorcitizen	partner	dependants	tenure	phoneservice	multiplelines	internetservice	onlinesecurity	…	deviceprotection	techsupport	streamingtv	streamingmovies	contract	paperlessbilling	paymentmethod	monthlycharges	totalcharges	churn
1	6261-rcvns	female	0	no	no	42	yes	no	dsl	yes	…	yes	yes	no	yes	one_year	no	credit_card_(automatic)	73.90	3160.55	1
5	4765-oxppd	female	0	yes	yes	9	yes	no	dsl	yes	…	yes	yes	no	no	month-to-month	no	mailed_check	65.00	663.05	1
9	1732-vhubq	female	1	yes	yes	47	yes	no	fiber_optic	no	…	no	no	no	no	month-to-month	no	bank_transfer_(automatic)	70.55	3309.25	1
11	7017-vfuly	female	0	yes	no	2	yes	no	no	no_internet_service	…	no_internet_service	no_internet_service	no_internet_service	no_internet_service	month-to-month	no	bank_transfer_(automatic)	20.10	43.15	0
13	1374-dmzui	female	1	no	no	4	yes	yes	fiber_optic	no	…	no	no	yes	yes	month-to-month	yes	electronic_check	94.30	424.45	1
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
5618	8065-ykxkd	female	0	no	no	10	yes	yes	fiber_optic	no	…	no	no	no	no	month-to-month	yes	electronic_check	74.75	799.65	1
5619	5627-tvbpp	female	0	no	yes	35	yes	no	no	no_internet_service	…	no_internet_service	no_internet_service	no_internet_service	no_internet_service	one_year	yes	credit_card_(automatic)	20.10	644.50	0
5626	3262-eidhv	female	0	yes	yes	72	yes	yes	dsl	yes	…	yes	yes	yes	yes	two_year	no	credit_card_(automatic)	84.70	5893.90	0
5627	7446-sfaoa	female	0	yes	no	37	yes	no	no	no_internet_service	…	no_internet_service	no_internet_service	no_internet_service	no_internet_service	one_year	yes	bank_transfer_(automatic)	19.85	717.50	0
5633	5840-nvdcg	female	0	yes	yes	16	yes	no	dsl	yes	…	no	yes	no	yes	two_year	no	bank_transfer_(automatic)	68.25	1114.85	0

2796 rows × 21 columns

The following snippet displays the value of the global churn rate. In comparison to that value, we can also calculate the churn rates for the female and male groups. We observe that the female churn rate is slightly higher than the global rate, while the male churn rate is slightly lower than the global rate. This suggests that women are somewhat more likely to churn.

global_churn = df_full_train.churn.mean()
global_churn
# Output: 0.26996805111821087

churn_female = df_full_train[df_full_train.gender == 'female'].churn.mean()
churn_female
# Output: 0.27682403433476394

global_churn - churn_female
# Output: -0.006855983216553063

churn_male = df_full_train[df_full_train.gender == 'male'].churn.mean()
churn_male
# Output: 0.2632135306553911

global_churn - churn_male
# Output: 0.006754520462819769

Let’s check the churn rate of another group (with partner vs. without partner).

df_full_train.partner.value_counts()

# Output:
# no     2932
# yes    2702
# Name: partner, dtype: int64

When examining this group, we notice that customers with partners are significantly less likely to churn. The churn rate for this group is approximately 20.5%, contrasting with the global churn rate of almost 27%. On the other hand, customers without partners have a much higher churn rate compared to the global rate, standing at 33% as opposed to 27%.

global_churn = df_full_train.churn.mean()
global_churn
# Output: 0.26996805111821087

churn_partner = df_full_train[df_full_train.partner == 'yes'].churn.mean()
churn_partner
# Output: 0.20503330866025166

global_churn - churn_partner
# Output: 0.06493474245795922

churn_no_partner = df_full_train[df_full_train.partner == 'no'].churn.mean()
churn_no_partner
# Output: 0.3298090040927694

global_churn - churn_no_partner
# Output: -0.05984095297455855

This observation suggests that the partner variable may be more influential for predicting churn than the gender variable.

Risk ratio

In the context of machine learning and classification, the “risk ratio” typically refers to a statistical measure used to assess the likelihood or probability of a certain event occurring in one group compared to another. It’s a useful concept in various fields, including healthcare, finance, and customer churn analysis.

In the specific context of churn rate, the risk ratio can help you understand the relative risk of churn (i.e., customers leaving) for different groups or segments within your dataset. It can provide insights into which features or factors are associated with a higher or lower risk of churn.

Here’s a simplified explanation of how risk ratio works in the context of churn rate:

Definition of Risk Ratio: The risk ratio (also known as the relative risk) is defined as the probability of an event occurring in one group divided by the probability of the same event occurring in another group. In the case of churn rate, you’re typically comparing two groups: one group that exhibits a certain characteristic or behavior (e.g., customer has churned) and another group that does not exhibit that characteristic (e.g., customer hasn’t churned).
Interpretation: A risk ratio greater than 1 suggests that the event (churn in this case) is more likely in the first group compared to the second group. A risk ratio less than 1 suggests the event is less likely in the first group. A risk ratio equal to 1 means there is no difference in risk between the two groups.
Application: We can use risk ratios to assess the impact of different features or interventions on churn rate. For example, we might calculate the risk ratio of churn for customers who received a promotional offer versus those who did not. If the risk ratio is significantly greater than 1, it indicates that the promotional offer had a positive impact on reducing churn.
Statistical Significance: It’s important to also consider statistical significance when interpreting risk ratios. Statistical tests such as chi-squared tests or confidence intervals can help determine if the observed differences in churn rates are statistically significant.

So the risk ratio is a valuable tool for assessing the impact of different factors or features on churn rate in classification tasks. It helps you quantify and compare the relative risk of churn between different groups, providing insights that can inform decision-making and strategies for reducing churn.

Let’s compare the risk ratio for churning between people with partners and those without partners.

churn_no_partner / global_churn
# Output: 1.2216593879412643

churn_partner / global_churn
# Output: 0.7594724924338315

This demonstrates that the churn rate for people without partners is 22% higher, whereas for people with partners, it is 24% lower than the global churn rate.

Let’s take the data and group it by gender, and for each variable within the gender group, let’s calculate the average churn rate within that group and calculate the difference and risk. We can perform this analysis for all the variables, not just the gender variable.

The SQL query would look like:

SELECT gender, AVG(churn), AVG(churn) - global_churn AS diff, AVG(churn) / global_churn AS risk FROM date GROUP BY gender;

df_full_train.groupby('gender').churn.mean()

# Output:
# gender
# female    0.276824
# male      0.263214
# Name: churn, dtype: float64

# agg takes a list of different aggregations
df_full_train.groupby('gender').churn.agg(['mean', 'count'])

gender	mean	count
female	0.276824	2796
male	0.263214	2838

df_group = df_full_train.groupby('gender').churn.agg(['mean', 'count'])
df_group['diff'] = df_group['mean'] - global_churn
df_group['risk'] = df_group['mean'] / global_churn
df_group

gender	mean	count	diff	risk
female	0.276824	2796	0.006856	1.025396
male	0.263214	2838	-0.006755	0.974980

mean, count, diff, and risk for gender column

This table is interesting, but it only displays information for the gender groups. Now, let’s extend this analysis to include all the categorical columns.

from IPython.display import display

for c in categorical:
    #print(c)
    df_group = df_full_train.groupby(c).churn.agg(['mean', 'count'])
    df_group['diff'] = df_group['mean'] - global_churn
    df_group['risk'] = df_group['mean'] / global_churn
    display(df_group)
    print()
    print()

GENDER	MEAN	COUNT	DIFF	RISK
FEMALE	0.276824	2796	0.006856	1.025396
MALE	0.263214	2838	-0.006755	0.974980

mean, count, diff, and risk for gender column

seniorcitizen	mean	count	diff	risk
0	0.242270	4722	-0.027698	0.897403
1	0.413377	912	0.143409	1.531208

mean, count, diff, and risk for senior citizen column

partner	mean	count	diff	risk
no	0.329809	2932	0.059841	1.221659
yes	0.205033	2702	-0.064935	0.759472

mean, count, diff, and risk for partner column

dependents	mean	count	diff	risk
no	0.313760	3968	0.043792	1.162212
yes	0.165666	1666	-0.104302	0.613651

mean, count, diff, and risk for dependents column

phoneservice	mean	count	diff	risk
no	0.241316	547	-0.028652	0.893870
yes	0.273049	5087	0.003081	1.011412

mean, count, diff, and risk for phone service column

multiplelines	mean	count	diff	risk
no	0.257407	2700	-0.012561	0.953474
no_phone_service	0.241316	547	-0.028652	0.893870
yes	0.290742	2387	0.020773	1.076948

mean, count, diff, and risk for multiple lines column

internetservice	mean	count	diff	risk
dsl	0.192347	1934	-0.077621	0.712482
fiber_optic	0.425171	2479	0.155203	1.574895
no	0.077805	1221	-0.192163	0.288201

mean, count, diff, and risk for internet service column

onlinesecurity	mean	count	diff	risk
no	0.420921	2801	0.150953	1.559152
no_internet_service	0.077805	1221	-0.192163	0.288201
yes	0.153226	1612	-0.116742	0.567570

mean, count, diff, and risk for online security column

onlinebackup	mean	count	diff	risk
no	0.404323	2498	0.134355	1.497672
no_internet_service	0.077805	1221	-0.192163	0.288201
yes	0.217232	1915	-0.052736	0.804660

mean, count, diff, and risk for onlinebackup column

deviceprotection	mean	count	diff	risk
no	0.395875	2473	0.125907	1.466379
no_internet_service	0.077805	1221	-0.192163	0.288201
yes	0.230412	1940	-0.039556	0.853480

mean, count, diff, and risk for device protection column

techsupport	mean	count	diff	risk
no	0.418914	2781	0.148946	1.551717
no_internet_service	0.077805	1221	-0.192163	0.288201
yes	0.159926	1632	-0.110042	0.592390

mean, count, diff, and risk for tech support column

streamingtv	mean	count	diff	risk
no	0.342832	2246	0.072864	1.269897
no_internet_service	0.077805	1221	-0.192163	0.288201
yes	0.302723	2167	0.032755	1.121328

mean, count, diff, and risk for streaming tv column

streamingmovies	mean	count	diff	risk
no	0.338906	2213	0.068938	1.255358
no_internet_service	0.077805	1221	-0.192163	0.288201
yes	0.307273	2200	0.037305	1.138182

mean, count, diff, and risk for streaming movies column

contract	mean	count	diff	risk
month-to-month	0.431701	3104	0.161733	1.599082
one_year	0.120573	1186	-0.149395	0.446621
two_year	0.028274	1344	-0.241694	0.104730

mean, count, diff, and risk for contract column

paperlessbilling	mean	count	diff	risk
no	0.172071	2313	-0.097897	0.637375
yes	0.338151	3321	0.068183	1.252560

mean, count, diff, and risk for paperless billing column

paymentmethod	mean	count	diff	risk
bank_transfer_(automatic)	0.168171	1219	-0.101797	0.622928
credit_card_(automatic)	0.164339	1217	-0.105630	0.608733
electronic_check	0.455890	1893	0.185922	1.688682
mailed_check	0.193870	1305	-0.076098	0.718121

mean, count, diff, and risk for payment method column

Summary

This article has covered the difference and the risk ratio as two important tools for assessing feature importance.

Concerning the difference, we calculate it by subtracting the group’s churn rate from the global churn rate. Here, we are primarily interested in significant differences, unlike in the gender case. Values for this difference smaller than 0 indicate a higher likelihood to churn, while values larger than 0 indicate a lower likelihood to churn.

As for the risk ratio, it is obtained by dividing the group’s churn rate by the global churn rate. Values greater than 1 suggest a higher likelihood to churn, whereas values less than 1 suggest a lower likelihood to churn.

In essence, both difference and risk ratio convey similar information but in different ways, providing insights into the importance of features with respect to churn prediction.

We observe certain categories in which people tend to churn more or less frequently compared to the global average. These are the types of variables we are interested in and want to use in machine learning algorithms. While it’s informative to see this for individual variables in each table, it would be valuable to have a measure that quantifies the overall importance of each variable.

To determine how we can assess whether the “contract” variable is less or more important than “streamingmovies,” we will proceed with the following steps.

ML Zoomcamp 2023 – Machine Learning for Classification– Part 5

Feature importance: Churn rate and risk ratio

Churn rate

Risk ratio

Summary

Leave a comment Cancel reply

Feature importance: Churn rate and risk ratio

Churn rate

Risk ratio

Summary

Teilen mit:

Related

Leave a comment Cancel reply