ML Zoomcamp 2023 – Machine Learning for Classification– Part 4

EDA – Exploratory Data Analysis

The topics that we cover in this section are:

  1. EDA – Exploratory Data Analysis
    1. Checking missing values
    2. Looking at the target variable (churn)
    3. Looking at numerical and categorical variables

Checking missing values

The following snippet indicates that the dataset ‘df_full_train’ contains no missing values:

df_full_train.isnull().sum()

# Output:
# customerid          0
# gender              0
# seniorcitizen       0
# partner             0
# dependents          0
# tenure              0
# phoneservice        0
# multiplelines       0
# internetservice     0
# onlinesecurity      0
# onlinebackup        0
# deviceprotection    0
# techsupport         0
# streamingtv         0
# streamingmovies     0
# contract            0
# paperlessbilling    0
# paymentmethod       0
# monthlycharges      0
# totalcharges        0
# churn               0
# dtype: int64

Looking at the target variable (churn)

df_full_train.churn

# Output:
# 0       0
# 1       1
# 2       0
# 3       0
# 4       0
#        ..
# 5629    1
# 5630    0
# 5631    1
# 5632    1
# 5633    0
# Name: churn, Length: 5634, dtype: int64

First what we can check is the distribution of our target variable ‘churn’. How many customers are churning and how many are not-churning.

df_full_train.churn.value_counts()

# Output:
# 0    4113
# 1    1521
# Name: churn, dtype: int64

There is information about a total of 5634 customers. Among these, 1521 are dissatisfied customers (churning), while the remaining 4113 are satisfied customers (not churning). Understanding the distribution of your target variable is an essential step in any data analysis or modeling task, as it provides valuable insights into the data’s class balance, which can influence modeling decisions and evaluation metrics.

Using the value_counts function with the normalize=True parameter provides the churn rate, which represents the proportion of churning customers relative to the total number of customers. In our case, we’ve calculated that the churn rate is almost 27%.

df_full_train.churn.value_counts(normalize=True)

# Output:
# 0    0.730032
# 1    0.269968
# Name: churn, dtype: float64

There is another way to calculate the global churn rate; we can simply use the mean() function, as shown in the next snippet.

global_churn_rate = df_full_train.churn.mean()
round(global_churn_rate, 2)

# Output: 0.27

We realize that it’s the same value as the churn rate. Let’s explore why this works here.

The formula for the mean is given by:

mean=(1/n)∑x

In this case, where x∈{0,1}, it simplifies to:

mean=(number of ones)/n​

And that is indeed the churn rate. This principle holds true for all binary datasets, because the mean of binary values corresponds directly to the proportion of ones in the dataset, which is essentially the churn rate in this context.

Looking at numerical and categorical variables

To identify the categorical and numerical variables in your dataset, you can use the dtypes function as mentioned earlier. Here’s how you can use it in general:

numerical_vars = df.select_dtypes(include=['int64', 'float64'])
categorical_vars = df.select_dtypes(include=['object'])

print("Numerical Variables:")
print(numerical_vars.columns)

print("\nCategorical Variables:")
print(categorical_vars.columns)

This code will help you separate and display the numerical and categorical variables in your dataset, making it easier to understand the data’s structure and plan your data analysis accordingly. Let’s look at our dataframe.

df_full_train.dtypes

# Output:
# customerid           object
# gender               object
# seniorcitizen         int64
# partner              object
# dependents           object
# tenure                int64
# phoneservice         object
# multiplelines        object
# internetservice      object
# onlinesecurity       object
# onlinebackup         object
# deviceprotection     object
# techsupport          object
# streamingtv          object
# streamingmovies      object
# contract             object
# paperlessbilling     object
# paymentmethod        object
# monthlycharges      float64
# totalcharges        float64
# churn                 int64
# dtype: object

As we know, there are three numerical variables: tenure, monthly charges, and total charges. Let’s define numerical and categorical columns.

df_full_train.columns

# Output:
# Index(['customerid', 'gender', 'seniorcitizen', 'partner', 'dependents',
#       'tenure', 'phoneservice', 'multiplelines', 'internetservice',
#       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
#       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
#       'paymentmethod', 'monthlycharges', 'totalcharges', 'churn'],
#      dtype='object')

numerical = ['tenure', 'monthlycharges', 'totalcharges']

# Removing 'customerid', 'tenure', 'monthlycharges', 'totalcharges', and 'churn'
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
       'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod']

To determine the number of unique values for all the categorical variables, we can use the nunique() function.

df_full_train[categorical].nunique()

# Output:
# gender              2
# seniorcitizen       2
# partner             2
# dependents          2
# phoneservice        2
# multiplelines       3
# internetservice     3
# onlinesecurity      3
# onlinebackup        3
# deviceprotection    3
# techsupport         3
# streamingtv         3
# streamingmovies     3
# contract            3
# paperlessbilling    2
# paymentmethod       4
# dtype: int64

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.