ML Zoomcamp 2023 – Machine Learning for Classification– Part 2

Data preparation

The topics that we cover in this section are:

  1. Data preparation
    1. Downloading the data
    2. Reading the data
    3. Making column names and values look uniform
    4. Verify that all the columns have been read correctly
    5. Checking if the churn variable needs any preparation

Downloading the data

First, we import all the necessary packages. Then, we can download our CSV file using the ‘wget’ command. When using Jupyter Notebook, it’s important to note that ‘!’ indicates the execution of a shell command, and the ‘$’ symbol, as seen in ‘$data,’ is the way to reference data within this shell command.

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

data = "https://..."
!wget $data -O data-week-3.csv

Reading the data

When reading the data this time, we can see that there are a lot of columns – 21 in total. The three dots ‘…’ in the header row indicate that not all columns are shown, making it a bit more challenging to get a complete overview.

df = pd.read_csv('data-week-3.csv')
df.head()
customerIDgenderSeniorCitizenPartnerDependentstenurePhoneServiceMultipleLinesInternetService
OnlineSecurityDeviceProtectionTechSupportStreamingTVStreamingMoviesContractPaperlessBillingPaymentMethodMonthlyChargesTotalChargesChurn
07590-VHVEGFemale0YesNo1NoNo phone serviceDSLNoNoNoNoNoMonth-to-monthYesElectronic check29.8529.85No
15575-GNVDEMale0NoNo34YesNoDSLYesYesNoNoNoOne yearNoMailed check56.951889.5No
23668-QPYBKMale0NoNo2YesNoDSLYesNoNoNoNoMonth-to-monthYesMailed check53.85108.15Yes
37795-CFOCWMale0NoNo45NoNo phone serviceDSLYesYesYesNoNoOne yearNoBank transfer (automatic)42.301840.75No
49237-HQITUFemale0NoNo2YesNoFiber opticNoNoNoNoNoMonth-to-monthYesElectronic check70.70151.65Yes
5 rows × 21 columns

To display all of them simultaneously, we can use the transpose function. This will switch the rows to become columns and the columns to become rows.

df.head().T
01234
customerID7590-VHVEG5575-GNVDE3668-QPYBK7795-CFOCW9237-HQITU
genderFemaleMaleMaleMaleFemale
SeniorCitizen00000
PartnerYesNoNoNoNo
DependentsNoNoNoNoNo
tenure1342452
PhoneServiceNoYesYesNoYes
MultipleLinesNo phone serviceNoNoNo phone serviceNo
InternetServiceDSLDSLDSLDSLFiber optic
OnlineSecurityNoYesYesYesNo
OnlineBackupYesNoYesNoNo
DeviceProtectionNoYesNoYesNo
TechSupportNoNoNoYesNo
StreamingTVNoNoNoNoNo
StreamingMoviesNoNoNoNoNo
ContractMonth-to-monthOne yearMonth-to-monthOne yearMonth-to-month
PaperlessBillingYesNoYesNoYes
PaymentMethodElectronic checkMailed checkMailed checkBank transfer (automatic)Electronic check
MonthlyCharges29.8556.9553.8542.370.7
TotalCharges29.851889.5108.151840.75151.65
ChurnNoNoYesNoYes

Making column names and values look uniform

What we can observe here is that the data is not consistent or uniform. So, we’ll follow a similar approach as we did before in the car price prediction project.

df.columns = df.columns.str.lower().str.replace(' ', '_')

categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')

df.head().T
01234
customerid7590-vhveg5575-gnvde3668-qpybk7795-cfocw9237-hqitu
genderfemalemalemalemalefemale
seniorcitizen00000
partneryesnononono
dependentsnonononono
tenure1342452
phoneservicenoyesyesnoyes
multiplelinesno_phone_servicenonono_phone_serviceno
internetservicedsldsldsldslfiber_optic
onlinesecuritynoyesyesyesno
onlinebackupyesnoyesnono
deviceprotectionnoyesnoyesno
techsupportnononoyesno
streamingtvnonononono
streamingmoviesnonononono
contractmonth-to-monthone_yearmonth-to-monthone_yearmonth-to-month
paperlessbillingyesnoyesnoyes
paymentmethodelectronic_checkmailed_checkmailed_checkbank_transfer_(automatic)electronic_check
monthlycharges29.8556.9553.8542.370.7
totalcharges29.851889.5108.151840.75151.65
churnnonoyesnoyes

Verify that all the columns have been read correctly

Now, all the column names are uniform, and all spaces have been replaced with underscores. Next, let’s examine the data types we have.

df.dtypes
# Output:
# customerid object
# gender object
# seniorcitizen int64
# partner object
# dependents object
# tenure int64
# phoneservice object
# multiplelines object
# internetservice object
# onlinesecurity object
# onlinebackup object
# deviceprotection object
# techsupport object
# streamingtv object
# streamingmovies object
# contract object
# paperlessbilling object
# paymentmethod object
# monthlycharges float64
# totalcharges object
# churn object
# dtype: object

We notice a few interesting things here. ‘Seniorcitizen’ is represented as a numerical value (0 or 1) rather than a string (‘yes’ or ‘no’), and ‘totalcharges’ is currently classified as an object but should be a numerical data type.

df.totalcharges
# Output:
# 0 29.85
# 1 1889.5
# 2 108.15
# 3 1840.75
# 4 151.65
# …
# 7038 1990.5
# 7039 7362.9
# 7040 346.45
# 7041 306.6
# 7042 6844.5
# Name: totalcharges, Length: 7043, dtype: object

Indeed, ‘totalcharges’ appears to be numeric in nature. It seems that some of the values are not in numeric format. Let’s attempt to convert them into numbers…

pd.to_numeric(df.totalcharges)
# Output: ValueError: Unable to parse string “_” at position 488

This means that the column doesn’t only contain numbers but also includes characters like ‘_’, which are not numeric. The reason for this is that in this dataset, a space represents ‘not available,’ indicating missing data. We have replaced these spaces with underscores. We can utilize Pandas for this task. If Pandas encounters a string that it cannot parse as a number, we can instruct it to ignore the string and replace it with ‘NaN’ (Not a number). This can be achieved by using a parameter called ‘errors,’ and setting it to ‘coerce,’ which means Pandas will disregard such errors.

# tc means total charges
tc = pd.to_numeric(df.totalcharges, errors=’coerce’)
tc
# Output:
# 0 29.85
# 1 1889.50
# 2 108.15
# 3 1840.75
# 4 151.65
# …
# 7038 1990.50
# 7039 7362.90
# 7040 346.45
# 7041 306.60
# 7042 6844.50
# Name: totalcharges, Length: 7043, dtype: float64

tc.isnull().sum()
# Output: 11

df[tc.isnull()]
# it's not easy to see so we need a different one
df[tc.isnull()][['customerid', 'totalcharges']]
customeridtotalcharges
4884472-lvygi_
7533115-czmzd_
9365709-lvoeq_
10824367-nuyao_
13401371-dwpaz_
33317644-omvmy_
38263213-vvolg_
43802520-sgtta_
52182923-arzlg_
66704075-wkniu_
67542775-sefee_

Now let’s take of that values. We can fill those values with 0. However, it’s important to keep in mind that 0 is not always the ideal choice, but in many practical scenarios, it’s an acceptable one.

df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')

df.totalcharges = df.totalcharges.fillna(0)
df.totalcharges.isnull().sum()
# Output: 0

Checking if the churn variable needs any preparation

The last thing we wanted to do is look at the churn variable. We see that the values are either “yes” or “no.” In machine learning, for classification, we are interested in numerical values. In this case, we can represent “churn” as 1 and “not churn” as 0.

df.churn.head()
# Output:
# 0 no
# 1 no
# 2 yes
# 3 no
# 4 yes
# Name: churn, dtype: object

To obtain these numerical values, we can replace “yes” with 1 and “no” with 0, as demonstrated in the following snippet.

(df.churn == ‘yes’).astype(int).head()
# Output:
# 0 0
# 1 0
# 2 1
# 3 0
# 4 1
# Name: churn, dtype: int64

Now, let’s apply this transformation to all values in this column and update it in the df.churn column.

df.churn = (df.churn == ‘yes’).astype(int)
df.churn
# Output:
# 0 0
# 1 0
# 2 1
# 3 0
# 4 1
# ..
# 7038 0
# 7039 0
# 7040 0
# 7041 1
# 7042 0
# Name: churn, Length: 7043, dtype: int64

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.