Setting up the validation framework

Perform the train/validation/test split with Scikit-Learn

You can utilize the train_test_split function from the sklearn.model_selection package to automate the splitting of your data into training, validation, and test sets. Before you can use it, make sure to import it first as follows:

from sklearn.model_selection import train_test_split

# to see the documentation
train_test_split?

The train_test_split function divides the dataframe into two parts, with 80% for the full train set and 20% for the test set. We use random_state=1 to ensure that the results are reproducible.

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
len(df_full_train), len(df_test)

# Output: (5634, 1409)

To obtain three sets (train, validation, and test), we should perform the split again with the full train set. However, this time, we want to allocate 60% for the train set and 20% for the validation set. To calculate the validation set size, we can’t use test_size=0.2 as before because we are dealing with 80% of the data. Instead, we need to determine 20% of 80%, which is equivalent to 25%. Therefore, for the validation set, we should use test_size=0.25.

df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)
len(df_train), len(df_val)

# Output: (4225, 1409)

We see that validation set and test set have the same size now.

len(df_train), len(df_val), len(df_test)

# Output: (4225, 1409, 1409)

Two code snippets earlier, we used the train_test_split function to create shuffled datasets. However, this resulted in the indexes within the records being shuffled rather than continuous. To reset the indices, you can use the reset_index function with the drop=True parameter to drop the old index column. Here’s how you can do it:

df_full_train = df_full_train.reset_index(drop=True)
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

Now, we need to extract our target variable, which is ‘y' (churn). Here’s how it looks for the four datasets:

y_full_train = df_full_train.churn.values
y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_test.churn.values

Certainly, to prevent accidental use of the “churn” variable when building a model, we should remove it from our dataframes. Here’s how you can remove the “churn” column from each of the four datasets:

del df_full_train['churn']
del df_train['churn']
del df_val['churn']
del df_test['churn']

After performing these operations, the “churn” variable will be removed from your datasets, and you can proceed with building your model without the risk of accidentally using it.

ML Zoomcamp 2023 – Machine Learning for Classification– Part 3

Setting up the validation framework

Perform the train/validation/test split with Scikit-Learn

Leave a comment Cancel reply

Setting up the validation framework

Perform the train/validation/test split with Scikit-Learn

Teilen mit:

Related

Leave a comment Cancel reply