EDA – Exploratory Data Analysis
The topics that we cover in this section are:
Checking missing values
The following snippet indicates that the dataset ‘df_full_train’ contains no missing values:
df_full_train.isnull().sum()
# Output:
# customerid 0
# gender 0
# seniorcitizen 0
# partner 0
# dependents 0
# tenure 0
# phoneservice 0
# multiplelines 0
# internetservice 0
# onlinesecurity 0
# onlinebackup 0
# deviceprotection 0
# techsupport 0
# streamingtv 0
# streamingmovies 0
# contract 0
# paperlessbilling 0
# paymentmethod 0
# monthlycharges 0
# totalcharges 0
# churn 0
# dtype: int64
Looking at the target variable (churn)
df_full_train.churn
# Output:
# 0 0
# 1 1
# 2 0
# 3 0
# 4 0
# ..
# 5629 1
# 5630 0
# 5631 1
# 5632 1
# 5633 0
# Name: churn, Length: 5634, dtype: int64
First what we can check is the distribution of our target variable ‘churn’. How many customers are churning and how many are not-churning.
df_full_train.churn.value_counts()
# Output:
# 0 4113
# 1 1521
# Name: churn, dtype: int64
There is information about a total of 5634 customers. Among these, 1521 are dissatisfied customers (churning), while the remaining 4113 are satisfied customers (not churning). Understanding the distribution of your target variable is an essential step in any data analysis or modeling task, as it provides valuable insights into the data’s class balance, which can influence modeling decisions and evaluation metrics.
Using the value_counts function with the normalize=True parameter provides the churn rate, which represents the proportion of churning customers relative to the total number of customers. In our case, we’ve calculated that the churn rate is almost 27%.
df_full_train.churn.value_counts(normalize=True)
# Output:
# 0 0.730032
# 1 0.269968
# Name: churn, dtype: float64
There is another way to calculate the global churn rate; we can simply use the mean() function, as shown in the next snippet.
global_churn_rate = df_full_train.churn.mean()
round(global_churn_rate, 2)
# Output: 0.27
We realize that it’s the same value as the churn rate. Let’s explore why this works here.
The formula for the mean is given by:
mean=(1/n)∑x
In this case, where x∈{0,1}, it simplifies to:
mean=(number of ones)/n
And that is indeed the churn rate. This principle holds true for all binary datasets, because the mean of binary values corresponds directly to the proportion of ones in the dataset, which is essentially the churn rate in this context.
Looking at numerical and categorical variables
To identify the categorical and numerical variables in your dataset, you can use the dtypes function as mentioned earlier. Here’s how you can use it in general:
numerical_vars = df.select_dtypes(include=['int64', 'float64'])
categorical_vars = df.select_dtypes(include=['object'])
print("Numerical Variables:")
print(numerical_vars.columns)
print("\nCategorical Variables:")
print(categorical_vars.columns)
This code will help you separate and display the numerical and categorical variables in your dataset, making it easier to understand the data’s structure and plan your data analysis accordingly. Let’s look at our dataframe.
df_full_train.dtypes
# Output:
# customerid object
# gender object
# seniorcitizen int64
# partner object
# dependents object
# tenure int64
# phoneservice object
# multiplelines object
# internetservice object
# onlinesecurity object
# onlinebackup object
# deviceprotection object
# techsupport object
# streamingtv object
# streamingmovies object
# contract object
# paperlessbilling object
# paymentmethod object
# monthlycharges float64
# totalcharges float64
# churn int64
# dtype: object
As we know, there are three numerical variables: tenure, monthly charges, and total charges. Let’s define numerical and categorical columns.
df_full_train.columns
# Output:
# Index(['customerid', 'gender', 'seniorcitizen', 'partner', 'dependents',
# 'tenure', 'phoneservice', 'multiplelines', 'internetservice',
# 'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
# 'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
# 'paymentmethod', 'monthlycharges', 'totalcharges', 'churn'],
# dtype='object')
numerical = ['tenure', 'monthlycharges', 'totalcharges']
# Removing 'customerid', 'tenure', 'monthlycharges', 'totalcharges', and 'churn'
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
'phoneservice', 'multiplelines', 'internetservice',
'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
'paymentmethod']
To determine the number of unique values for all the categorical variables, we can use the nunique() function.
df_full_train[categorical].nunique()
# Output:
# gender 2
# seniorcitizen 2
# partner 2
# dependents 2
# phoneservice 2
# multiplelines 3
# internetservice 3
# onlinesecurity 3
# onlinebackup 3
# deviceprotection 3
# techsupport 3
# streamingtv 3
# streamingmovies 3
# contract 3
# paperlessbilling 2
# paymentmethod 4
# dtype: int64