Overview
Today’s post recaps all the important lines of code that are crucial for the rest of this chapter. This includes the necessary imports, data preparation, data splitting for training, validation, and testing, separating the target variable ‘churn’, training the logistic regression model, and finally, validating the model on the validation data and outputting the accuracy at the end.
In the first code snippet, we observe the necessary imports: Pandas, NumPy, and Matplotlib, as well as the three imports from the Scikit-Learn package. These imports are used once for the train-test-split, once for the DictVectorizer, and once for the linear model for LogisticRegression.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
First, the dataset is read and stored in the ‘df’ dataframe. In the second line, the column names are standardized by converting them to lowercase and replacing spaces with underscores. Then, all categorical columns are assigned to the variable ‘categorical_columns’ using the condition ‘dtypes == object’. However, it’s worth noting that the ‘totalcharges’ column is mistakenly considered categorical, but it is, in fact, numerical. To correct this, the ‘totalcharges’ column is converted to a numerical format using the Pandas function ‘to_numeric’, with the ‘errors=’coerce” parameter set to ignore any errors. After this conversion, missing values are filled with ‘0’. Finally, the ‘churn’ column is converted to integer values, where ‘churn==yes’ becomes ‘1’ and ‘churn==no’ becomes ‘0’.
df = pd.read_csv('data-week-3.csv')
df.columns = df.columns.str.lower().str.replace(' ', '_')
categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)
for c in categorical_columns:
df[c] = df[c].str.lower().str.replace(' ', '_')
df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')
df.totalcharges = df.totalcharges.fillna(0)
df.churn = (df.churn == 'yes').astype(int)
Next, we use the ‘train_test_split’ function to split the datasets into ‘full_train’ (80%) and ‘test’ in the first step. In the second step, ‘full_train’ is further divided into two datasets for training and validation. Ultimately, we achieve a split of the initial data into a 60%-20%-20% ratio, where 60% is used for training, and 20% each is allocated for validation and testing. The parameter ‘random_state=1’ ensures that the random split is reproducible.
Following this, the next three lines reset the indices in each of the three datasets. Since the random split may result in non-continuous indices, setting ‘drop=True’ removes the old index.
Then, for each record, the target variable ‘y’ is set, which in this case is the ‘churn’ column. Finally, the target column is removed from the records to prevent accidental usage during training.
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_test.churn.values
del df_train['churn']
del df_val['churn']
del df_test['churn']
In the next snippet, we define two variables, ‘numerical’ and ‘categorical’, which contain the relevant column names. The ‘numerical’ array contains the names of all numerical columns, while the ‘categorical’ array contains the names of all categorical columns.
numerical = ['tenure', 'monthlycharges', 'totalcharges']
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
'phoneservice', 'multiplelines', 'internetservice',
'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
'paymentmethod']
Next, we create a DictVectorizer instance. We then transform the dataframe into dictionaries, and using the ‘fit_transform(train_dict)’ function, we train the DictVectorizer. This step involves showing the DictVectorizer how the data is structured, allowing it to distinguish column names and values and perform one-hot encoding based on this information. Importantly, the DictVectorizer is smart enough to distinguish between categorical values and numeric values, so numeric values are ignored during one-hot encoding. The ‘transform’ part of this process converts the dictionary into a vector or matrix suitable for machine learning.
After preparing the data, we move on to model creation. In this case, a Logistic Regression model is used. The ‘model.fit’ function is then employed to train the model on the training data.
dv = DictVectorizer(sparse=False)
train_dict = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)
model = LogisticRegression()
model.fit(X_train, y_train)
We can then proceed to validate the trained model using the validation data. To do this, we need to prepare the validation DataFrame in the same way as shown for the training DataFrame. This involves transforming it into dictionaries and applying the ‘transform’ function of the DictVectorizer. However, during validation, we only need to use the ‘transform’ function of the DictVectorizer since it already knows the data structure. In the case of validation, we are primarily interested in the transformed output, which serves as input for prediction.
For prediction, we use the ‘predict_proba’ function of the model, which provides us with probabilities in two columns. Here, we are interested in the second column. We evaluate the model’s performance using a threshold of ‘>=0.5’. The ‘churn_decision’ variable contains ‘True’ for any value in the prediction greater than or equal to the threshold, and ‘False’ otherwise. We calculate the accuracy using the ‘mean’ function, and in this case, it is approximately 80%.
val_dict = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dict)
y_pred = model.predict_proba(X_val)[:, 1]
churn_decision = (y_pred >= 0.5)
(y_val == churn_decision).mean()
# Output: 0.8034066713981547