ML Zoomcamp 2023 – Machine Learning for Regression – Part 3

Setting up the validation framework

To validate the model, we take the dataset and split it into three parts (train-val-test / 60-20-20). The reason why this is useful was mentioned in an earlier blog post. This means that we train the model on the training dataset, check if it works fine on the validation dataset, and leave the test dataset for the end. We only use the test dataset very occasionally, and only to check if the model is performing well. For each of these three parts, we create the feature matrix X and the target variable y (Xtrain, ytrain, Xval, yval, Xtest, ytest). So, what we need to do is calculate how much 20% is.

# Returns the number of records of the whole dataset
len(df)
# Output: 11914

# Calculate 20% of whole dataset
int(len(df) * 0.2)
# Output: 2382

With this preliminary work from the last code snippet, we can complete the calculation for splitting into the three datasets.

n = len(df)
n_val = n_test = int(n * 0.2)
n_train = n - n_val - n_test
n , n_val+n_test+n_train
# Output: (11914, 11914)

# sizes of our dataframes
n_val, n_test, n_train
# Output:  (2382, 2382, 7150)

df_train = df.iloc[:n_train]
df_val = df.iloc[n_train:n_train + n_val]
df_test = df.iloc[n_train + n_val:]

You might think that this concludes the division, but there is one crucial problem. This approach brings us to the problem that it’s sequential. That’s a problem when there is an order in the dataset. That means we need to shuffle, otherwise, there are BMWs only in one dataset. Generally shuffling is always a good idea.

idx = np.arange(n)
idx
# Output: array([    0,     1,     2, ..., 11911, 11912, 11913])

# to make it reproducible
#np.random.seed(2)
np.random.shuffle(idx)
idx
# Output: array([11545,  7488,   263, ...,  3119,  1696,  9053])

Using this shuffled index we can create our shuffled datasets for training, validation and for testing.

# Create shuffled datasets with correct size
df_train = df.iloc[idx[:n_train]]
df_val = df.iloc[idx[n_train:n_train + n_val]]
df_test = df.iloc[idx[n_train + n_val:]]

Now there is no order in the index column so we can reset index and drop the old index column.

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

As I mentioned in the last blog article we should apply the log1p transformation to the price column to help the model perform well.

y_train = np.log1p(df_train.msrp.values)
y_val = np.log1p(df_val.msrp.values)
y_test = np.log1p(df_test.msrp.values)

There is one final but very important step. We should remove msrp values from dataframes (df_train, df_val, df_test) to make sure that we don’t accidentally use it for training purposes.

del df_train['msrp']
del df_val['msrp']
del df_test['msrp']

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.