Data preparation
The topics that we cover in this section are:
Downloading the data
First, we import all the necessary packages. Then, we can download our CSV file using the ‘wget’ command. When using Jupyter Notebook, it’s important to note that ‘!’ indicates the execution of a shell command, and the ‘$’ symbol, as seen in ‘$data,’ is the way to reference data within this shell command.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = "https://..."
!wget $data -O data-week-3.csv
Reading the data
When reading the data this time, we can see that there are a lot of columns – 21 in total. The three dots ‘…’ in the header row indicate that not all columns are shown, making it a bit more challenging to get a complete overview.
df = pd.read_csv('data-week-3.csv')
df.head()
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | … | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | … | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | … | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
| 2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | … | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | … | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | … | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
To display all of them simultaneously, we can use the transpose function. This will switch the rows to become columns and the columns to become rows.
df.head().T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| customerID | 7590-VHVEG | 5575-GNVDE | 3668-QPYBK | 7795-CFOCW | 9237-HQITU |
| gender | Female | Male | Male | Male | Female |
| SeniorCitizen | 0 | 0 | 0 | 0 | 0 |
| Partner | Yes | No | No | No | No |
| Dependents | No | No | No | No | No |
| tenure | 1 | 34 | 2 | 45 | 2 |
| PhoneService | No | Yes | Yes | No | Yes |
| MultipleLines | No phone service | No | No | No phone service | No |
| InternetService | DSL | DSL | DSL | DSL | Fiber optic |
| OnlineSecurity | No | Yes | Yes | Yes | No |
| OnlineBackup | Yes | No | Yes | No | No |
| DeviceProtection | No | Yes | No | Yes | No |
| TechSupport | No | No | No | Yes | No |
| StreamingTV | No | No | No | No | No |
| StreamingMovies | No | No | No | No | No |
| Contract | Month-to-month | One year | Month-to-month | One year | Month-to-month |
| PaperlessBilling | Yes | No | Yes | No | Yes |
| PaymentMethod | Electronic check | Mailed check | Mailed check | Bank transfer (automatic) | Electronic check |
| MonthlyCharges | 29.85 | 56.95 | 53.85 | 42.3 | 70.7 |
| TotalCharges | 29.85 | 1889.5 | 108.15 | 1840.75 | 151.65 |
| Churn | No | No | Yes | No | Yes |
Making column names and values look uniform
What we can observe here is that the data is not consistent or uniform. So, we’ll follow a similar approach as we did before in the car price prediction project.
df.columns = df.columns.str.lower().str.replace(' ', '_')
categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)
for c in categorical_columns:
df[c] = df[c].str.lower().str.replace(' ', '_')
df.head().T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| customerid | 7590-vhveg | 5575-gnvde | 3668-qpybk | 7795-cfocw | 9237-hqitu |
| gender | female | male | male | male | female |
| seniorcitizen | 0 | 0 | 0 | 0 | 0 |
| partner | yes | no | no | no | no |
| dependents | no | no | no | no | no |
| tenure | 1 | 34 | 2 | 45 | 2 |
| phoneservice | no | yes | yes | no | yes |
| multiplelines | no_phone_service | no | no | no_phone_service | no |
| internetservice | dsl | dsl | dsl | dsl | fiber_optic |
| onlinesecurity | no | yes | yes | yes | no |
| onlinebackup | yes | no | yes | no | no |
| deviceprotection | no | yes | no | yes | no |
| techsupport | no | no | no | yes | no |
| streamingtv | no | no | no | no | no |
| streamingmovies | no | no | no | no | no |
| contract | month-to-month | one_year | month-to-month | one_year | month-to-month |
| paperlessbilling | yes | no | yes | no | yes |
| paymentmethod | electronic_check | mailed_check | mailed_check | bank_transfer_(automatic) | electronic_check |
| monthlycharges | 29.85 | 56.95 | 53.85 | 42.3 | 70.7 |
| totalcharges | 29.85 | 1889.5 | 108.15 | 1840.75 | 151.65 |
| churn | no | no | yes | no | yes |
Verify that all the columns have been read correctly
Now, all the column names are uniform, and all spaces have been replaced with underscores. Next, let’s examine the data types we have.
df.dtypes
# Output:
# customerid object
# gender object
# seniorcitizen int64
# partner object
# dependents object
# tenure int64
# phoneservice object
# multiplelines object
# internetservice object
# onlinesecurity object
# onlinebackup object
# deviceprotection object
# techsupport object
# streamingtv object
# streamingmovies object
# contract object
# paperlessbilling object
# paymentmethod object
# monthlycharges float64
# totalcharges object
# churn object
# dtype: object
We notice a few interesting things here. ‘Seniorcitizen’ is represented as a numerical value (0 or 1) rather than a string (‘yes’ or ‘no’), and ‘totalcharges’ is currently classified as an object but should be a numerical data type.
df.totalcharges
# Output:
# 0 29.85
# 1 1889.5
# 2 108.15
# 3 1840.75
# 4 151.65
# …
# 7038 1990.5
# 7039 7362.9
# 7040 346.45
# 7041 306.6
# 7042 6844.5
# Name: totalcharges, Length: 7043, dtype: object
Indeed, ‘totalcharges’ appears to be numeric in nature. It seems that some of the values are not in numeric format. Let’s attempt to convert them into numbers…
pd.to_numeric(df.totalcharges)
# Output: ValueError: Unable to parse string “_” at position 488
This means that the column doesn’t only contain numbers but also includes characters like ‘_’, which are not numeric. The reason for this is that in this dataset, a space represents ‘not available,’ indicating missing data. We have replaced these spaces with underscores. We can utilize Pandas for this task. If Pandas encounters a string that it cannot parse as a number, we can instruct it to ignore the string and replace it with ‘NaN’ (Not a number). This can be achieved by using a parameter called ‘errors,’ and setting it to ‘coerce,’ which means Pandas will disregard such errors.
# tc means total charges
tc = pd.to_numeric(df.totalcharges, errors=’coerce’)
tc
# Output:
# 0 29.85
# 1 1889.50
# 2 108.15
# 3 1840.75
# 4 151.65
# …
# 7038 1990.50
# 7039 7362.90
# 7040 346.45
# 7041 306.60
# 7042 6844.50
# Name: totalcharges, Length: 7043, dtype: float64
tc.isnull().sum()
# Output: 11
df[tc.isnull()]
# it's not easy to see so we need a different one
df[tc.isnull()][['customerid', 'totalcharges']]
| customerid | totalcharges | |
|---|---|---|
| 488 | 4472-lvygi | _ |
| 753 | 3115-czmzd | _ |
| 936 | 5709-lvoeq | _ |
| 1082 | 4367-nuyao | _ |
| 1340 | 1371-dwpaz | _ |
| 3331 | 7644-omvmy | _ |
| 3826 | 3213-vvolg | _ |
| 4380 | 2520-sgtta | _ |
| 5218 | 2923-arzlg | _ |
| 6670 | 4075-wkniu | _ |
| 6754 | 2775-sefee | _ |
Now let’s take of that values. We can fill those values with 0. However, it’s important to keep in mind that 0 is not always the ideal choice, but in many practical scenarios, it’s an acceptable one.
df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')
df.totalcharges = df.totalcharges.fillna(0)
df.totalcharges.isnull().sum()
# Output: 0
Checking if the churn variable needs any preparation
The last thing we wanted to do is look at the churn variable. We see that the values are either “yes” or “no.” In machine learning, for classification, we are interested in numerical values. In this case, we can represent “churn” as 1 and “not churn” as 0.
df.churn.head()
# Output:
# 0 no
# 1 no
# 2 yes
# 3 no
# 4 yes
# Name: churn, dtype: object
To obtain these numerical values, we can replace “yes” with 1 and “no” with 0, as demonstrated in the following snippet.
(df.churn == ‘yes’).astype(int).head()
# Output:
# 0 0
# 1 0
# 2 1
# 3 0
# 4 1
# Name: churn, dtype: int64
Now, let’s apply this transformation to all values in this column and update it in the df.churn column.
df.churn = (df.churn == ‘yes’).astype(int)
df.churn
# Output:
# 0 0
# 1 0
# 2 1
# 3 0
# 4 1
# ..
# 7038 0
# 7039 0
# 7040 0
# 7041 1
# 7042 0
# Name: churn, Length: 7043, dtype: int64