Setting up the validation framework
Perform the train/validation/test split with Scikit-Learn
You can utilize the train_test_split function from the sklearn.model_selection package to automate the splitting of your data into training, validation, and test sets. Before you can use it, make sure to import it first as follows:
from sklearn.model_selection import train_test_split
# to see the documentation
train_test_split?
The train_test_split function divides the dataframe into two parts, with 80% for the full train set and 20% for the test set. We use random_state=1 to ensure that the results are reproducible.
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
len(df_full_train), len(df_test)
# Output: (5634, 1409)
To obtain three sets (train, validation, and test), we should perform the split again with the full train set. However, this time, we want to allocate 60% for the train set and 20% for the validation set. To calculate the validation set size, we can’t use test_size=0.2 as before because we are dealing with 80% of the data. Instead, we need to determine 20% of 80%, which is equivalent to 25%. Therefore, for the validation set, we should use test_size=0.25.
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)
len(df_train), len(df_val)
# Output: (4225, 1409)
We see that validation set and test set have the same size now.
len(df_train), len(df_val), len(df_test)
# Output: (4225, 1409, 1409)
Two code snippets earlier, we used the train_test_split function to create shuffled datasets. However, this resulted in the indexes within the records being shuffled rather than continuous. To reset the indices, you can use the reset_index function with the drop=True parameter to drop the old index column. Here’s how you can do it:
df_full_train = df_full_train.reset_index(drop=True)
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
Now, we need to extract our target variable, which is ‘y' (churn). Here’s how it looks for the four datasets:
y_full_train = df_full_train.churn.values
y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_test.churn.values
Certainly, to prevent accidental use of the “churn” variable when building a model, we should remove it from our dataframes. Here’s how you can remove the “churn” column from each of the four datasets:
del df_full_train['churn']
del df_train['churn']
del df_val['churn']
del df_test['churn']
After performing these operations, the “churn” variable will be removed from your datasets, and you can proceed with building your model without the risk of accidentally using it.