ML Zoomcamp 2023 – Machine Learning for Classification– Part 3

Setting up the validation framework

Perform the train/validation/test split with Scikit-Learn

You can utilize the train_test_split function from the sklearn.model_selection package to automate the splitting of your data into training, validation, and test sets. Before you can use it, make sure to import it first as follows:

from sklearn.model_selection import train_test_split

# to see the documentation
train_test_split?

The train_test_split function divides the dataframe into two parts, with 80% for the full train set and 20% for the test set. We use random_state=1 to ensure that the results are reproducible.

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
len(df_full_train), len(df_test)

# Output: (5634, 1409)

To obtain three sets (train, validation, and test), we should perform the split again with the full train set. However, this time, we want to allocate 60% for the train set and 20% for the validation set. To calculate the validation set size, we can’t use test_size=0.2 as before because we are dealing with 80% of the data. Instead, we need to determine 20% of 80%, which is equivalent to 25%. Therefore, for the validation set, we should use test_size=0.25.

df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)
len(df_train), len(df_val)

# Output: (4225, 1409)

We see that validation set and test set have the same size now.

len(df_train), len(df_val), len(df_test)

# Output: (4225, 1409, 1409)

Two code snippets earlier, we used the train_test_split function to create shuffled datasets. However, this resulted in the indexes within the records being shuffled rather than continuous. To reset the indices, you can use the reset_index function with the drop=True parameter to drop the old index column. Here’s how you can do it:

df_full_train = df_full_train.reset_index(drop=True)
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

Now, we need to extract our target variable, which is ‘y' (churn). Here’s how it looks for the four datasets:

y_full_train = df_full_train.churn.values
y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_test.churn.values

Certainly, to prevent accidental use of the “churn” variable when building a model, we should remove it from our dataframes. Here’s how you can remove the “churn” column from each of the four datasets:

del df_full_train['churn']
del df_train['churn']
del df_val['churn']
del df_test['churn'] 

After performing these operations, the “churn” variable will be removed from your datasets, and you can proceed with building your model without the risk of accidentally using it.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.