Data preparation

The topics that we cover in this section are:

Data preparation

Downloading the data

First, we import all the necessary packages. Then, we can download our CSV file using the ‘wget’ command. When using Jupyter Notebook, it’s important to note that ‘!’ indicates the execution of a shell command, and the ‘$’ symbol, as seen in ‘$data,’ is the way to reference data within this shell command.

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

data = "https://..."
!wget $data -O data-week-3.csv

Reading the data

When reading the data this time, we can see that there are a lot of columns – 21 in total. The three dots ‘…’ in the header row indicate that not all columns are shown, making it a bit more challenging to get a complete overview.

df = pd.read_csv('data-week-3.csv')
df.head()

	customerID	gender	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	…	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	Female	Yes	No	1	No	No phone service	DSL	No	…	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
1	5575-GNVDE	Male	No	No	34	Yes	No	DSL	Yes	…	Yes	No	No	No	One year	No	Mailed check	56.95	1889.5	No
2	3668-QPYBK	Male	No	No	2	Yes	No	DSL	Yes	…	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes
3	7795-CFOCW	Male	No	No	45	No	No phone service	DSL	Yes	…	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No
4	9237-HQITU	Female	No	No	2	Yes	No	Fiber optic	No	…	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes

5 rows × 21 columns

To display all of them simultaneously, we can use the transpose function. This will switch the rows to become columns and the columns to become rows.

df.head().T

	0	1	2	3	4
customerID	7590-VHVEG	5575-GNVDE	3668-QPYBK	7795-CFOCW	9237-HQITU
gender	Female	Male	Male	Male	Female
SeniorCitizen	0	0	0	0	0
Partner	Yes	No	No	No	No
Dependents	No	No	No	No	No
tenure	1	34	2	45	2
PhoneService	No	Yes	Yes	No	Yes
MultipleLines	No phone service	No	No	No phone service	No
InternetService	DSL	DSL	DSL	DSL	Fiber optic
OnlineSecurity	No	Yes	Yes	Yes	No
OnlineBackup	Yes	No	Yes	No	No
DeviceProtection	No	Yes	No	Yes	No
TechSupport	No	No	No	Yes	No
StreamingTV	No	No	No	No	No
StreamingMovies	No	No	No	No	No
Contract	Month-to-month	One year	Month-to-month	One year	Month-to-month
PaperlessBilling	Yes	No	Yes	No	Yes
PaymentMethod	Electronic check	Mailed check	Mailed check	Bank transfer (automatic)	Electronic check
MonthlyCharges	29.85	56.95	53.85	42.3	70.7
TotalCharges	29.85	1889.5	108.15	1840.75	151.65
Churn	No	No	Yes	No	Yes

Making column names and values look uniform

What we can observe here is that the data is not consistent or uniform. So, we’ll follow a similar approach as we did before in the car price prediction project.

df.columns = df.columns.str.lower().str.replace(' ', '_')

categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')

df.head().T

	0	1	2	3	4
customerid	7590-vhveg	5575-gnvde	3668-qpybk	7795-cfocw	9237-hqitu
gender	female	male	male	male	female
seniorcitizen	0	0	0	0	0
partner	yes	no	no	no	no
dependents	no	no	no	no	no
tenure	1	34	2	45	2
phoneservice	no	yes	yes	no	yes
multiplelines	no_phone_service	no	no	no_phone_service	no
internetservice	dsl	dsl	dsl	dsl	fiber_optic
onlinesecurity	no	yes	yes	yes	no
onlinebackup	yes	no	yes	no	no
deviceprotection	no	yes	no	yes	no
techsupport	no	no	no	yes	no
streamingtv	no	no	no	no	no
streamingmovies	no	no	no	no	no
contract	month-to-month	one_year	month-to-month	one_year	month-to-month
paperlessbilling	yes	no	yes	no	yes
paymentmethod	electronic_check	mailed_check	mailed_check	bank_transfer_(automatic)	electronic_check
monthlycharges	29.85	56.95	53.85	42.3	70.7
totalcharges	29.85	1889.5	108.15	1840.75	151.65
churn	no	no	yes	no	yes

Verify that all the columns have been read correctly

Now, all the column names are uniform, and all spaces have been replaced with underscores. Next, let’s examine the data types we have.

df.dtypes
# Output:
# customerid object
# gender object
# seniorcitizen int64
# partner object
# dependents object
# tenure int64
# phoneservice object
# multiplelines object
# internetservice object
# onlinesecurity object
# onlinebackup object
# deviceprotection object
# techsupport object
# streamingtv object
# streamingmovies object
# contract object
# paperlessbilling object
# paymentmethod object
# monthlycharges float64
# totalcharges object
# churn object
# dtype: object

We notice a few interesting things here. ‘Seniorcitizen’ is represented as a numerical value (0 or 1) rather than a string (‘yes’ or ‘no’), and ‘totalcharges’ is currently classified as an object but should be a numerical data type.

df.totalcharges
# Output:
# 0 29.85
# 1 1889.5
# 2 108.15
# 3 1840.75
# 4 151.65
# …
# 7038 1990.5
# 7039 7362.9
# 7040 346.45
# 7041 306.6
# 7042 6844.5
# Name: totalcharges, Length: 7043, dtype: object

Indeed, ‘totalcharges’ appears to be numeric in nature. It seems that some of the values are not in numeric format. Let’s attempt to convert them into numbers…

pd.to_numeric(df.totalcharges)
# Output: ValueError: Unable to parse string “_” at position 488

This means that the column doesn’t only contain numbers but also includes characters like ‘_’, which are not numeric. The reason for this is that in this dataset, a space represents ‘not available,’ indicating missing data. We have replaced these spaces with underscores. We can utilize Pandas for this task. If Pandas encounters a string that it cannot parse as a number, we can instruct it to ignore the string and replace it with ‘NaN’ (Not a number). This can be achieved by using a parameter called ‘errors,’ and setting it to ‘coerce,’ which means Pandas will disregard such errors.

# tc means total charges
tc = pd.to_numeric(df.totalcharges, errors=’coerce’)
tc
# Output:
# 0 29.85
# 1 1889.50
# 2 108.15
# 3 1840.75
# 4 151.65
# …
# 7038 1990.50
# 7039 7362.90
# 7040 346.45
# 7041 306.60
# 7042 6844.50
# Name: totalcharges, Length: 7043, dtype: float64

tc.isnull().sum()
# Output: 11

df[tc.isnull()]
# it's not easy to see so we need a different one
df[tc.isnull()][['customerid', 'totalcharges']]

	customerid	totalcharges
488	4472-lvygi	_
753	3115-czmzd	_
936	5709-lvoeq	_
1082	4367-nuyao	_
1340	1371-dwpaz	_
3331	7644-omvmy	_
3826	3213-vvolg	_
4380	2520-sgtta	_
5218	2923-arzlg	_
6670	4075-wkniu	_
6754	2775-sefee	_

Now let’s take of that values. We can fill those values with 0. However, it’s important to keep in mind that 0 is not always the ideal choice, but in many practical scenarios, it’s an acceptable one.

df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')

df.totalcharges = df.totalcharges.fillna(0)
df.totalcharges.isnull().sum()
# Output: 0

Checking if the churn variable needs any preparation

The last thing we wanted to do is look at the churn variable. We see that the values are either “yes” or “no.” In machine learning, for classification, we are interested in numerical values. In this case, we can represent “churn” as 1 and “not churn” as 0.

df.churn.head()
# Output:
# 0 no
# 1 no
# 2 yes
# 3 no
# 4 yes
# Name: churn, dtype: object

To obtain these numerical values, we can replace “yes” with 1 and “no” with 0, as demonstrated in the following snippet.

(df.churn == ‘yes’).astype(int).head()
# Output:
# 0 0
# 1 0
# 2 1
# 3 0
# 4 1
# Name: churn, dtype: int64

Now, let’s apply this transformation to all values in this column and update it in the df.churn column.

df.churn = (df.churn == ‘yes’).astype(int)
df.churn
# Output:
# 0 0
# 1 0
# 2 1
# 3 0
# 4 1
# ..
# 7038 0
# 7039 0
# 7040 0
# 7041 1
# 7042 0
# Name: churn, Length: 7043, dtype: int64

ML Zoomcamp 2023 – Machine Learning for Classification– Part 2

Data preparation

Downloading the data

Reading the data

Making column names and values look uniform

Verify that all the columns have been read correctly

Checking if the churn variable needs any preparation

Leave a comment Cancel reply

Data preparation

Downloading the data

Reading the data

Making column names and values look uniform

Verify that all the columns have been read correctly

Checking if the churn variable needs any preparation

Teilen mit:

Related

Leave a comment Cancel reply