Tuning the model

The topic for this article is finding the best regularization parameter for our linear regression model. We realized that the parameter r affects the quality of our model and now we try to find the best value for this r.

for r in [0.0, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10]:
    X_train = prepare_X(df_train)
    w0, w = train_linear_regression_reg(X_train, y_train, r=r)

    X_val = prepare_X(df_val)
    y_pred = w0 + X_val.dot(w)

    score = rmse(y_val, y_pred)
    
    print("reg parameter: ",r, "bias term: ",w0, "rmse: ",score)

# Output:
# reg parameter:  0.0 bias term:  2.6643718859809136e+16 rmse:  292.5054633101075
# reg parameter:  1e-05 bias term:  6.099552653959844 rmse:  0.456883648941604
# reg parameter:  0.0001 bias term:  6.8929434420779865 rmse:  0.4568834231183306
# reg parameter:  0.001 bias term:  6.900647539490208 rmse:  0.4568807317131709
# reg parameter:  0.01 bias term:  6.885494975398419 rmse:  0.45685446091134857
# reg parameter:  0.1 bias term:  6.7419125296313265 rmse:  0.45665036676484794
# reg parameter:  1 bias term:  5.908895080537622 rmse:  0.4569676895885577
# reg parameter:  10 bias term:  4.234139685166065 rmse:  0.47376448953457045

What you see here is using r=0 makes the bias term huge and the rmse score aswell.
0.001 could be a good parameter for r.

r = 0.001
X_train = prepare_X(df_train)
w0, w = train_linear_regression_reg(X_train, y_train, r=r)

X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)

score = rmse(y_val, y_pred)
   
print("rmse: ",score)
# Output: rmse:  0.4568807317131709

Using the model

In the last article we found the best parameter for the linear regression and in this lesson we’ll train the model again and use it. What we did so far is, we trained our model on training dataset and applied the best model on validation dataset. To check the model performance we calculated the RMSE.

What we want to do now is to train our final model on both training dataset and validation dataset. We call this FULL TRAIN. After that we make the final evaluation on the test dataset to make sure that our model works fine and check what is the value for RMSE. It shouldn’t be too different from what we saw on the validation dataset.

Combining datasets

First step to do is getting our data. So we need to combine df_train and df_val into one dataset. We can use Pandas concat() function that takes a list of dataframes and concatenates them together.

df_full_train = pd.concat([df_train, df_val])

We also need to concatenate y_train and y_val to get y_full_train. This time we use the concatenate function of NumPy library.

y_full_train = np.concatenate([y_train, y_val])
y_full_train
# Output: array([10.40262514, 10.06032035,  7.60140233, ..., 10.3837818 ,
       10.3663092 , 10.37101938])

Resetting index

When combining two dataframes it can happen that the index is not sequential. Here you can use an already known function and reset the index.

df_full_train = df_full_train.reset_index(drop=True)

Getting feature matrix X

Now we have again a coherent dataset for training and we can prepare it for the usage as we did before. The prepare_X() function still works fine.

X_full_train = prepare_X(df_full_train) 
X_full_train
# Output:
# array([[310.,   8.,  18., ...,   0.,   0.,   0.],
        [170.,   4.,  32., ...,   0.,   0.,   0.],
        [165.,   6.,  15., ...,   0.,   0.,   0.],
        ...,
        [295.,   8.,  19., ...,   0.,   0.,   0.],
        [283.,   6.,  25., ...,   0.,   0.,   0.],
        [182.,   4.,  32., ...,   0.,   0.,   0.]])

Train the final model

Next step is to train the final model on the combined dataset. We’re using the new train_linear_regression_reg() function to get the value for w₀ and the vector w.

w0, w = train_linear_regression_reg(X_full_train, y_full_train, r=0.001)
w0, w
# Output:
# (6.78312259616272,
# array([ 1.46535912e-03,  1.06314995e-01, -3.46567859e-02,  1.34536223e-02,
        -5.29907921e-05, -1.00251712e-01, -1.12652502e+00, -1.30755218e+00,
        -9.91483092e-01, -2.98403708e-02,  1.71081845e-01,  8.84475226e-03,
        -1.21180790e-01, -1.07844220e-01, -4.76163384e-01,  6.26604228e-02,
        -3.21665624e-01, -5.42738768e-01,  4.29611924e-02,  1.16474177e+00,
         9.88804373e-01,  1.20687968e+00,  2.79900016e+00,  6.21654168e-01,
         1.78916282e+00,  1.63554122e+00,  1.73947001e+00,  1.61859437e+00,
        -8.10459522e-02,  3.06406210e-02, -3.41386920e-02, -2.42013404e-02,
         3.75251434e-02,  2.33450124e+00,  2.22007067e+00,  2.22820812e+00,
         4.14224325e-02,  4.99735446e-02,  2.45833450e-01,  3.81450761e-01,
        -1.16344690e-01]))

Applying model to test data

Now is the great moment for the final model. It must pass the final test. For this purpose we use test data, which are again prepared with the prepare_X() function. Then the model is applied to the test data and the RMSE can be calculated.

X_test = prepare_X(df_test)
y_pred = w0 + X_test.dot(w)

score = rmse(y_test, y_pred)
   
print("rmse: ",score)
# Output: rmse:  0.5094518818513973

RMSE_test = 0.5094518818513973 is not so far away from RMSE_val = 0.4568807317131709. That means the model generalizes quite well and it didn’t get this score by chance. Now we have our final model and we can use it. The way we want to use it is to predict the price of an (unseen) car – unseen means here that the model hasn’t seen this car during training.

Using the model

Using the model means:

Extracting all the features (getting feature vector of the car)
Applying our final model to this feature vector & predicting the price

Feature Extraction

For this step we can take any car from our test dataset and pretend it’s a new car. Let’s just take one car.

df_test.iloc[20]
# Output:
# make                                           saab
# model                                   9-3_griffin
# year                                           2012
# engine_fuel_type     premium_unleaded_(recommended)
# engine_hp                                     220.0
# engine_cylinders                                4.0
# transmission_type                            manual
# driven_wheels                       all_wheel_drive
# number_of_doors                                 4.0
# market_category                              luxury
# vehicle_size                                compact
# vehicle_style                                 wagon
# highway_mpg                                      30
# city_mpg                                         20
# popularity                                      376
# Name: 20, dtype: object

Usually the way we do it is that we don’t get a dataframe here. But it could be a Python dictionary with all the information about the car. In real life you can imagine a website or an app, where people enter all the values. Then the website sends the request with all the information (as dictionary) to the model. The model replies back with the predicted price.

For this example we turn this data of our car into a dictionary.

car = df_test.iloc[20].to_dict()
car
# Output:
# {'make': 'saab',
#  'model': '9-3_griffin',
#  'year': 2012,
#  'engine_fuel_type': 'premium_unleaded_(recommended)',
#  'engine_hp': 220.0,
#  'engine_cylinders': 4.0,
#  'transmission_type': 'manual',
#  'driven_wheels': 'all_wheel_drive',
#  'number_of_doors': 4.0,
#  'market_category': 'luxury',
#  'vehicle_size': 'compact',
#  'vehicle_style': 'wagon',
#  'highway_mpg': 30,
#  'city_mpg': 20,
#  'popularity': 376}

The car is our request and now remember the prepare_X function expects a dataframe, so we need to create a dataframe with a single row for our request.

df_small = pd.DataFrame([car])
df_small

	make	model	year	engine_fuel_type	engine_hp	engine_cylinders	transmission_type	driven_wheels	number_of_doors	market_category	vehicle_size	vehicle_style	highway_mpg	city_mpg	popularity
0	saab	9-3_griffin	2012	premium_unleaded_(recommended)	220	4.0	manual	all_wheel_drive	4.0	luxury	compact	wagon	30	20	376

DataFrame of our requested car

We can use this single row DataFrame as input for the prepare_X() function to get the feature matrix. In this case our feature matrix is a feature vector.

X_small = prepare_X(df_small)
X_small
# Output:
# array([[220.,   4.,  30.,  20., 376.,   5.,   0.,   0.,   1.,   0.,   0.,
#               0.,   0.,   0.,   0.,   0.,   1.,   0.,   0.,   0.,   1.,   0.,
#               0.,   0.,   0.,   0.,   1.,   0.,   0.,   0.,   1.,   0.,   0.,
#               1.,   0.,   0.,   0.,   0.,   0.,   0.,   0.]])

Predicting the price

The final step is to apply the final model to our requested car (feature vector) and predict the price.

y_pred = w0 + X_small.dot(w)
# Don't need an array but it's first (and only) item
y_pred = y_pred[0]
y_pred
# Output: 9.954435569951846

9.95 is still not the price in $. To get the real price we need to undo the logarithm.

np.expm1(y_pred)
# Output: 21044.363844829495

After undoing the logarithm we get the price in $. So we think that a car with these characteristics should cost $21,044.36.

Lastly to get an evaluation about model performance let’s compare the predicted price to the actual price of this requested car.

np.expm1(y_test[20])
# Output: 34975.0

ML Zoomcamp 2023 – Machine Learning for Regression – Part 12

Tuning the model

Using the model

Combining datasets

Resetting index

Getting feature matrix X

Train the final model

Applying model to test data

Using the model

Feature Extraction

Predicting the price

Leave a comment Cancel reply

Tuning the model

Using the model

Combining datasets

Resetting index

Getting feature matrix X

Train the final model

Applying model to test data

Using the model

Feature Extraction

Predicting the price

Teilen mit:

Related

Leave a comment Cancel reply