ML Zoomcamp 2023 – Deploying Machine Learning Models– Part 2

  1. Saving and loading the model
    1. Saving the model to pickle
    2. Loading the model with Pickle
    3. Turning our notebook into a Python script

Saving and loading the model

Let’s take a moment to recap what we’ve accomplished so far. Before we can save a model, the crucial first step is training it. I’ve extensively covered model training in previous articles, where we explored various techniques, including K-Fold cross-validation. Below, I’ve included all the code necessary for model training. While there’s nothing new here, it serves as a useful recap of our progress.

The first code snippet contains all the necessary imports.

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

The next snippet is about data preparation, where we need to read the csv file, make the column names more homogenous, and deal with categorical and numerical values.

# Data preparation

df = pd.read_csv('data-week-3.csv')

df.columns = df.columns.str.lower().str.replace(' ', '_')

categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')

df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')
df.totalcharges = df.totalcharges.fillna(0)

df.churn = (df.churn == 'yes').astype(int)

The next snippet is about data splitting. Again we use the train_test_split function to divide the dataset in full_train and test data.

# Data splitting

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)

In the next snippet you see what are the numerical column names and what are the categorical column names.

numerical = ['tenure', 'monthlycharges', 'totalcharges']

categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
       'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod']

The next snippet is about the train function. It has three arguments – the training dataframe and the target values y_train, and the third argument is C which is a LogisticRegression parameter for our model. First step here is to create dictionaries from the categorical columns, remember the numerical columns are ignored here. Next we create a DictVectorizer instance which we need to use fit_transform function on the dictionaries. So we get the X_train. Then we create our model which is a logistic regression model, that we can use for training (fit function) based on the training data (X_train and y_train). To apply the model later we need to return the DictVectorizer and the model as well.

def train(df_train, y_train, C=1.0):
    dicts = df_train[categorical + numerical].to_dict(orient='records')

    dv = DictVectorizer(sparse=False)
    X_train = dv.fit_transform(dicts)

    model = LogisticRegression(C=C, max_iter=1000)
    model.fit(X_train, y_train)

    return dv, model

As I just mentioned in the paragraph before to use the model we need also the DictVectorizer. Both are arguments for the predict function which is show in the next snippet. Besides both arguments you also need a dataframe where we can provide a prediction for. First step here is the same like in training function, we need to get the dictionaries. This can be transformed by the DictVectorizer so we get the X, what we need to make a prediction on. What we return here is the predicted probability for churning.

def predict(df, dv, model):
     dicts = df[categorical + numerical].to_dict(orient='records')

     X = dv.transform(dicts)
     y_pred = model.predict_proba(X)[:,1]

     return y_pred

Next snippet is to setup two parameters. The first one is the C value for the Logistic Regression model, and the ‘n_splits’ parameter tells us how many splits we’re going to use in K-Fold cross-validation. Here, we’re using 5 splits.

C = 1.0
n_splits = 5

Next snippet shows the implemented K-Fold cross validation, where we use the parameters from the last snippet. The for loop loops over all folds and does a training for each. After that we calculate the roc_auc_score and collect the values for each fold. At the end the mean score and the standard deviation for all folds are printed.

kfold = KFold(n_splits=n_splits, shuffle=True, random_state=1)  

scores = []

for train_idx, val_idx in kfold.split(df_full_train):
    df_train = df_full_train.iloc[train_idx]
    df_val = df_full_train.iloc[val_idx]

    y_train = df_train.churn.values
    y_val = df_val.churn.values

    dv, model = train(df_train, y_train, C=C)
    y_pred = predict(df_val, dv, model)

    auc = roc_auc_score(y_val, y_pred)
    scores.append(auc)

print('C=%s %.3f +- %.3f' % (C, np.mean(scores), np.std(scores)))

# Output: C=1.0 0.841 +- 0.008

The last snipped doesn’t show the score for each fold seperately but here you can see each value.

scores

# Output: 
# [0.8438508214866044,
# 0.8450763971659383,
# 0.8327513546056594,
# 0.8301724275756219,
# 0.8521461516739357]

Last step is to train the final model based on the full_train data. The steps here are similar to the steps mentioned before. First is model training, then predicting the test data, and lastly calculate the roc_auc_score. We see a value of 85.7% which is a bit higher than the average of the k-folds. But there is not a big difference.

dv, model = train(df_full_train, df_full_train.churn.values, C=1.0)
y_pred = predict(df_test, dv, model)
y_test = df_test.churn.values

auc = roc_auc_score(y_test, y_pred)
auc

# Output: 0.8572386167896259

Until now the model still lives in our Jupyter notebook. So we cannot just take this model and put it in a web service. Remember we want to put this model in a web service, that the marketing service can use it to score the customers. That means now we need to save this model in order to be able to load it later.

Saving the model to pickle

For saving the model we’ll use pickle, what is a built in library for saving Python objects.

import pickle

First, we need to name our model file before we can write it to a file. The following snippet demonstrates two ways of naming the file.

output_file = 'model_C=%s.bin' % C
output_file
# Output: 'model_C=1.0.bin'

output_file = f'model_C={C}.bin'
output_file
# Output: 'model_C=1.0.bin'

Now we want to create a file with that file name. ‘wb’ means Write Binary. We need to save DictVectorizer and the model as well, because with just the model we’ll not be able to translate a customer into a feature matrix. Closing the file is crucial. Otherwise, we cannot be certain whether this file truly contains the content.

f_out = open(output_file, 'wb')

pickle.dump((dv, model), f_out)

f_out.close()

To avoid accidentally forgetting to close the file, we can use the ‘with’ statement, which ensures that the file is closed automatically. Everything we do inside the ‘with’ statement keeps the file open. However, once we exit this statement, the file is automatically closed.

with open(output_file, 'wb') as f_out:
    pickle.dump((dv, model), f_out)

Loading the model with Pickle

For loading the model we’ll also use pickle.

import pickle
model_file = 'model_C=1.0.bin'

We also utilize the ‘with’ statement for loading the model. Here, ‘rb’ denotes Read Binary. We employ the ‘load’ function from pickle, which returns both the DictVectorizer and the model.

with open(model_file, 'rb') as f_in:
    dv, model = pickle.load(f_in)

dv, model
# Output: (DictVectorizer(sparse=False), LogisticRegression(max_iter=1000))

After loading the model, let’s use it to score one sample customer.

customer = {
    'gender': 'female',
    'seniorcitizen': 0,
    'partner': 'yes',
    'dependents': 'no',
    'phoneservice': 'no',
    'multiplelines': 'no_phone_service',
    'internetservice': 'dsl',
    'onlinesecurity': 'no',
    'onlinebackup': 'yes',
    'deviceprotection': 'no',
    'techsupport': 'no',
    'streamingtv': 'no',
    'streamingmovies': 'no',
    'contract': 'month-to-month',
    'paperlessbilling': 'yes',
    'paymentmethod': 'electronic_check',
    'tenure': 1,
    'monthlycharges': 29.85,
    'totalcharges': 29.85
}

Before we can apply the predict function to this customer we need to turn it into a feature matrix. The DictVectorizer expects a list of dictionaries, that’s why we create a list with one customer.

X = dv.transform([customer])
X

# Output: 
# array([[ 1.  ,  0.  ,  0.  ,  1.  ,  0.  ,  1.  ,  0.  ,  0.  ,  1.  ,
#         0.  ,  1.  ,  0.  ,  0.  , 29.85,  0.  ,  1.  ,  0.  ,  0.  ,
#         0.  ,  1.  ,  1.  ,  0.  ,  0.  ,  0.  ,  1.  ,  0.  ,  1.  ,
#         0.  ,  0.  ,  1.  ,  0.  ,  1.  ,  0.  ,  0.  ,  1.  ,  0.  ,
#         0.  ,  1.  ,  0.  ,  0.  ,  1.  ,  0.  ,  0.  ,  1.  , 29.85]])

We use predict function to get the probability that this particular customer is going to churn. We’re interested in the second element, so we need to set the row=0 and column=1.

model.predict_proba(X)
# Output: array([[0.36364158, 0.63635842]])

model.predict_proba(X)[0,1]
# Output: 0.6363584152758612

Turning our notebook into a Python script

We can turn the Jupyter Notebook code into a Python file. One easy way of doing this is click on “File” -> “Download as” and then “Python (.py)

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.