Tuning the model
The topic for this article is finding the best regularization parameter for our linear regression model. We realized that the parameter r affects the quality of our model and now we try to find the best value for this r.
for r in [0.0, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10]:
X_train = prepare_X(df_train)
w0, w = train_linear_regression_reg(X_train, y_train, r=r)
X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)
score = rmse(y_val, y_pred)
print("reg parameter: ",r, "bias term: ",w0, "rmse: ",score)
# Output:
# reg parameter: 0.0 bias term: 2.6643718859809136e+16 rmse: 292.5054633101075
# reg parameter: 1e-05 bias term: 6.099552653959844 rmse: 0.456883648941604
# reg parameter: 0.0001 bias term: 6.8929434420779865 rmse: 0.4568834231183306
# reg parameter: 0.001 bias term: 6.900647539490208 rmse: 0.4568807317131709
# reg parameter: 0.01 bias term: 6.885494975398419 rmse: 0.45685446091134857
# reg parameter: 0.1 bias term: 6.7419125296313265 rmse: 0.45665036676484794
# reg parameter: 1 bias term: 5.908895080537622 rmse: 0.4569676895885577
# reg parameter: 10 bias term: 4.234139685166065 rmse: 0.47376448953457045
What you see here is using r=0 makes the bias term huge and the rmse score aswell.
0.001 could be a good parameter for r.
r = 0.001
X_train = prepare_X(df_train)
w0, w = train_linear_regression_reg(X_train, y_train, r=r)
X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)
score = rmse(y_val, y_pred)
print("rmse: ",score)
# Output: rmse: 0.4568807317131709
Using the model
In the last article we found the best parameter for the linear regression and in this lesson we’ll train the model again and use it. What we did so far is, we trained our model on training dataset and applied the best model on validation dataset. To check the model performance we calculated the RMSE.
What we want to do now is to train our final model on both training dataset and validation dataset. We call this FULL TRAIN. After that we make the final evaluation on the test dataset to make sure that our model works fine and check what is the value for RMSE. It shouldn’t be too different from what we saw on the validation dataset.
Combining datasets
First step to do is getting our data. So we need to combine df_train and df_val into one dataset. We can use Pandas concat() function that takes a list of dataframes and concatenates them together.
df_full_train = pd.concat([df_train, df_val])
We also need to concatenate y_train and y_val to get y_full_train. This time we use the concatenate function of NumPy library.
y_full_train = np.concatenate([y_train, y_val])
y_full_train
# Output: array([10.40262514, 10.06032035, 7.60140233, ..., 10.3837818 ,
10.3663092 , 10.37101938])
Resetting index
When combining two dataframes it can happen that the index is not sequential. Here you can use an already known function and reset the index.
df_full_train = df_full_train.reset_index(drop=True)
Getting feature matrix X
Now we have again a coherent dataset for training and we can prepare it for the usage as we did before. The prepare_X() function still works fine.
X_full_train = prepare_X(df_full_train)
X_full_train
# Output:
# array([[310., 8., 18., ..., 0., 0., 0.],
[170., 4., 32., ..., 0., 0., 0.],
[165., 6., 15., ..., 0., 0., 0.],
...,
[295., 8., 19., ..., 0., 0., 0.],
[283., 6., 25., ..., 0., 0., 0.],
[182., 4., 32., ..., 0., 0., 0.]])
Train the final model
Next step is to train the final model on the combined dataset. We’re using the new train_linear_regression_reg() function to get the value for w0 and the vector w.
w0, w = train_linear_regression_reg(X_full_train, y_full_train, r=0.001)
w0, w
# Output:
# (6.78312259616272,
# array([ 1.46535912e-03, 1.06314995e-01, -3.46567859e-02, 1.34536223e-02,
-5.29907921e-05, -1.00251712e-01, -1.12652502e+00, -1.30755218e+00,
-9.91483092e-01, -2.98403708e-02, 1.71081845e-01, 8.84475226e-03,
-1.21180790e-01, -1.07844220e-01, -4.76163384e-01, 6.26604228e-02,
-3.21665624e-01, -5.42738768e-01, 4.29611924e-02, 1.16474177e+00,
9.88804373e-01, 1.20687968e+00, 2.79900016e+00, 6.21654168e-01,
1.78916282e+00, 1.63554122e+00, 1.73947001e+00, 1.61859437e+00,
-8.10459522e-02, 3.06406210e-02, -3.41386920e-02, -2.42013404e-02,
3.75251434e-02, 2.33450124e+00, 2.22007067e+00, 2.22820812e+00,
4.14224325e-02, 4.99735446e-02, 2.45833450e-01, 3.81450761e-01,
-1.16344690e-01]))
Applying model to test data
Now is the great moment for the final model. It must pass the final test. For this purpose we use test data, which are again prepared with the prepare_X() function. Then the model is applied to the test data and the RMSE can be calculated.
X_test = prepare_X(df_test)
y_pred = w0 + X_test.dot(w)
score = rmse(y_test, y_pred)
print("rmse: ",score)
# Output: rmse: 0.5094518818513973
RMSE_test = 0.5094518818513973 is not so far away from RMSE_val = 0.4568807317131709. That means the model generalizes quite well and it didn’t get this score by chance. Now we have our final model and we can use it. The way we want to use it is to predict the price of an (unseen) car – unseen means here that the model hasn’t seen this car during training.
Using the model
Using the model means:
- Extracting all the features (getting feature vector of the car)
- Applying our final model to this feature vector & predicting the price
Feature Extraction
For this step we can take any car from our test dataset and pretend it’s a new car. Let’s just take one car.
df_test.iloc[20]
# Output:
# make saab
# model 9-3_griffin
# year 2012
# engine_fuel_type premium_unleaded_(recommended)
# engine_hp 220.0
# engine_cylinders 4.0
# transmission_type manual
# driven_wheels all_wheel_drive
# number_of_doors 4.0
# market_category luxury
# vehicle_size compact
# vehicle_style wagon
# highway_mpg 30
# city_mpg 20
# popularity 376
# Name: 20, dtype: object
Usually the way we do it is that we don’t get a dataframe here. But it could be a Python dictionary with all the information about the car. In real life you can imagine a website or an app, where people enter all the values. Then the website sends the request with all the information (as dictionary) to the model. The model replies back with the predicted price.
For this example we turn this data of our car into a dictionary.
car = df_test.iloc[20].to_dict()
car
# Output:
# {'make': 'saab',
# 'model': '9-3_griffin',
# 'year': 2012,
# 'engine_fuel_type': 'premium_unleaded_(recommended)',
# 'engine_hp': 220.0,
# 'engine_cylinders': 4.0,
# 'transmission_type': 'manual',
# 'driven_wheels': 'all_wheel_drive',
# 'number_of_doors': 4.0,
# 'market_category': 'luxury',
# 'vehicle_size': 'compact',
# 'vehicle_style': 'wagon',
# 'highway_mpg': 30,
# 'city_mpg': 20,
# 'popularity': 376}
The car is our request and now remember the prepare_X function expects a dataframe, so we need to create a dataframe with a single row for our request.
df_small = pd.DataFrame([car])
df_small
| make | model | year | engine_fuel_type | engine_hp | engine_cylinders | transmission_type | driven_wheels | number_of_doors | market_category | vehicle_size | vehicle_style | highway_mpg | city_mpg | popularity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | saab | 9-3_griffin | 2012 | premium_unleaded_(recommended) | 220 | 4.0 | manual | all_wheel_drive | 4.0 | luxury | compact | wagon | 30 | 20 | 376 |
We can use this single row DataFrame as input for the prepare_X() function to get the feature matrix. In this case our feature matrix is a feature vector.
X_small = prepare_X(df_small)
X_small
# Output:
# array([[220., 4., 30., 20., 376., 5., 0., 0., 1., 0., 0.,
# 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0.,
# 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0.,
# 1., 0., 0., 0., 0., 0., 0., 0.]])
Predicting the price
The final step is to apply the final model to our requested car (feature vector) and predict the price.
y_pred = w0 + X_small.dot(w)
# Don't need an array but it's first (and only) item
y_pred = y_pred[0]
y_pred
# Output: 9.954435569951846
9.95 is still not the price in $. To get the real price we need to undo the logarithm.
np.expm1(y_pred)
# Output: 21044.363844829495
After undoing the logarithm we get the price in $. So we think that a car with these characteristics should cost $21,044.36.
Lastly to get an evaluation about model performance let’s compare the predicted price to the actual price of this requested car.
np.expm1(y_test[20])
# Output: 34975.0