ML Zoomcamp 2023 – Evaluation metrics for classification– Part 7

  1. Cross-Validation
    1. Evaluating the same model on different subsets of data
    2. Getting the average prediction and the spread within predictions
    3. Parameter Tuning

Cross-Validation

Evaluating the same model on different subsets of data

In this article, I’ll discuss parameter tuning, which involves selecting the optimal parameter. Typically, we start by splitting our entire dataset into three parts: training, validation, and testing. We utilize the validation dataset to determine the best parameter for the formula g(xi), essentially finding the optimal parameters for training our model.

For the time being, we set aside the test set and continue working with our combined training and validation dataset – so called full_train. Next, we divide this data into ‘k’ parts, with ‘k’ equal to 3.

FULL TRAIN
1 2 3
We can train our model using datasets 1 and 2, using dataset 3 for validation. Subsequently, we calculate the AUC on the validation dataset (3).

Next step is to train another model based on 1 and 3 and validate this model on dataset 2. Again compute the AUC on validation data (2).

TRAIN VAL
1 3 2

Next step is to train another model based on 2 and 3 and validate this model on dataset 1. Again compute the AUC on validation data (1).

TRAIN VAL
2 3 1

After obtaining three AUC values, we calculate their mean and standard deviation. The standard deviation reflects the model’s stability and how scores vary across different folds.

K-Fold Cross-Validation is a method for assessing the same model on various subsets of our dataset.

def train(df_train, y_train):
    dicts = df_train[categorical + numerical].to_dict(orient='records')

    dv = DictVectorizer(sparse=False)
    X_train = dv.fit_transform(dicts)

    model = LogisticRegression()
    model.fit(X_train, y_train)

    return dv, model
dv, model = train(df_train, y_train)
def predict(df, dv, model):
     dicts = df[categorical + numerical].to_dict(orient='records')

     X = dv.fit_transform(dicts)
     y_pred = model.predict_proba(X)[:,1]

     return y_pred
y_pred = predict(df_val, dv, model)
y_pred

# Output: array([0.00899722, 0.20451861, 0.2122173 , ..., 0.13639118, 0.79976555, 0.83740295])

We now have the ‘train’ and ‘predict’ functions in place. Let’s proceed to implement K-Fold Cross-Validation.

from sklearn.model_selection import KFold

kfold = KFold(n_splits=10, shuffle=True, random_state=1) 

kfold.split(df_full_train)
# Output: <generator object _BaseKFold.split at 0x2838baf20>

train_idx, val_idx = next(kfold.split(df_full_train))
len(train_idx), len(val_idx)
# Output: (5070, 564)

len(df_full_train)
# Output: 5634

# We can use iloc to select a part of this dataframe
df_train = df_full_train.iloc[train_idx]
df_val = df_full_train.iloc[val_idx]

The following code snippet demonstrates the implementation for 10 folds. Finally, we use the ‘roc_auc_score’ function to calculate and output the corresponding score for each fold.

from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score

kfold = KFold(n_splits=10, shuffle=True, random_state=1)  
scores = []

for train_idx, val_idx in kfold.split(df_full_train):
    df_train = df_full_train.iloc[train_idx]
    df_val = df_full_train.iloc[val_idx]

    y_train = df_train.churn.values
    y_val = df_val.churn.values

    dv, model = train(df_train, y_train)
    y_pred = predict(df_val, dv, model)

    auc = roc_auc_score(y_val, y_pred)
    scores.append(auc)

scores
# Output:
# [0.8479398247539081,
# 0.8410581683168317,
# 0.8557214756739697,
# 0.8333552794008724,
# 0.8262717121588089,
# 0.8342657342657342,
# 0.8412569195701727,
# 0.8186669829222013,
# 0.8452349192233585,
# 0.8621054754462034]

Same implementation but this time with tqdm package.

from sklearn.model_selection import KFold
!pip3 install tqdm
from tqdm.auto import tqdm

kfold = KFold(n_splits=10, shuffle=True, random_state=1)  
scores = []

for train_idx, val_idx in tqdm(kfold.split(df_full_train)):
    df_train = df_full_train.iloc[train_idx]
    df_val = df_full_train.iloc[val_idx]

    y_train = df_train.churn.values
    y_val = df_val.churn.values

    dv, model = train(df_train, y_train)
    y_pred = predict(df_val, dv, model)

    auc = roc_auc_score(y_val, y_pred)
    scores.append(auc)

scores
# Output: 
# [0.8479398247539081,
# 0.8410581683168317,
# 0.8557214756739697,
# 0.8333552794008724,
# 0.8262717121588089,
# 0.8342657342657342,
# 0.8412569195701727,
# 0.8186669829222013,
# 0.8452349192233585,
# 0.8621054754462034]

Getting the average prediction and the spread within predictions

We can utilize the scores generated to compute the average score across the 10 folds, which is 84.1%, with a standard deviation of 0.012.

print('%.3f +- %.3f' % (np.mean(scores), np.std(scores)))
# Output: 0.841 +- 0.012

Parameter Tuning

We discussed parameter tuning, particularly the ‘C’ parameter in our LogisticRegression model, which serves as the regularization parameter with a default value of 1.0. We can include this ‘C’ parameter in our ‘train’ function. If ‘C’ is set to a very small value, it implies strong regularization. Additionally, we can address an annoying message by setting the ‘max_iter’ value to 1000.

def train(df_train, y_train, C=1.0):
    dicts = df_train[categorical + numerical].to_dict(orient='records')

    dv = DictVectorizer(sparse=False)
    X_train = dv.fit_transform(dicts)

    model = LogisticRegression(C=C, max_iter=1000)
    model.fit(X_train, y_train)

    return dv, model
dv, model = train(df_train, y_train, C=0.001)

We can iterate over various values for ‘C,’ keeping in mind that ‘C’ cannot be set to 0.0, as it would result in an ‘InvalidParameterError.’ The ‘C’ parameter for LogisticRegression must be a float within the range (0.0, inf], so we need to avoid using 0.0.

from sklearn.model_selection import KFold

kfold = KFold(n_splits=10, shuffle=True, random_state=1)  

for C in [0.001, 0.01, 0.1, 0.5, 1, 5, 10]:
    
    scores = []

    for train_idx, val_idx in kfold.split(df_full_train):
        df_train = df_full_train.iloc[train_idx]
        df_val = df_full_train.iloc[val_idx]

        y_train = df_train.churn.values
        y_val = df_val.churn.values

        dv, model = train(df_train, y_train, C=C)
        y_pred = predict(df_val, dv, model)

        auc = roc_auc_score(y_val, y_pred)
        scores.append(auc)

    print('C=%s %.3f +- %.3f' % (C, np.mean(scores), np.std(scores)))

# Output:
# C=0.001 0.826 +- 0.012
# C=0.01 0.840 +- 0.012
# C=0.1 0.841 +- 0.011
# C=0.5 0.841 +- 0.011
# C=1 0.840 +- 0.012
# C=5 0.841 +- 0.012
# C=10 0.841 +- 0.012

We can implement the same procedure using the ‘tqdm’ package, which provides a more visually appealing output.

from sklearn.model_selection import KFold

n_splits = 5

for C in tqdm([0.001, 0.01, 0.1, 0.5, 1, 5, 10]):   
    scores = []

    kfold = KFold(n_splits=n_splits, shuffle=True, random_state=1)  

    for train_idx, val_idx in kfold.split(df_full_train):
        df_train = df_full_train.iloc[train_idx]
        df_val = df_full_train.iloc[val_idx]

        y_train = df_train.churn.values
        y_val = df_val.churn.values

        dv, model = train(df_train, y_train, C=C)
        y_pred = predict(df_val, dv, model)

        auc = roc_auc_score(y_val, y_pred)
        scores.append(auc)

    print('C=%s %.3f +- %.3f' % (C, np.mean(scores), np.std(scores)))

# Output:
#  14%|█▍        | 1/7 [00:01<00:06,  1.03s/it]
# C=0.001 0.825 +- 0.009
# 29%|██▊       | 2/7 [00:02<00:05,  1.07s/it]
# C=0.01 0.840 +- 0.009
# 43%|████▎     | 3/7 [00:03<00:04,  1.06s/it]
# C=0.1 0.840 +- 0.008
# 57%|█████▋    | 4/7 [00:04<00:03,  1.08s/it]
# C=0.5 0.841 +- 0.006
# 71%|███████▏  | 5/7 [00:05<00:02,  1.13s/it]
# C=1 0.841 +- 0.008
# 86%|████████▌ | 6/7 [00:06<00:01,  1.15s/it]
# C=5 0.841 +- 0.007
# 100%|██████████| 7/7 [00:07<00:00,  1.10s/it]
# C=10 0.841 +- 0.008

Afterward, we aim to train our final model using the entire training dataset (df_full_train) and then validate it using the test dataset.

dv, model = train(df_full_train, df_full_train.churn.values, C=1.0)
y_pred = predict(df_test, dv, model)

auc = roc_auc_score(y_test, y_pred)
auc
# Output: 0.8572386167896259

We observe that the AUC is slightly better than what we observed during k-fold cross-validation, though not significantly higher. This is expected when the difference is small.

In terms of when to use cross-validation versus traditional hold-out validation, for larger datasets, standard hold-out validation is often sufficient. However, if your dataset is smaller or you require insight into the model’s stability and variation across folds, then cross-validation is more appropriate. For larger datasets, consider using fewer splits (e.g., 2 or 3), while for smaller datasets, a higher number of splits (e.g., 10) may be beneficial.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.