ML Zoomcamp 2023 – Machine Learning for Regression – Part 7

  1. Car price baseline model
    1. Value extraction
    2. Missing values
    3. Plotting model performance

Car price baseline model

This article is about building a baseline model for price prediction of a car. Here we’ll use the implemented code from the last article to build the model. First we start with a simple model while we’re using only numerical columns.

The next code snippet shows how to extract all numerical columns. By the way you can also use it to extract categorical columns.

df_train.dtypes

# Output:
# make                  object
# model                 object
# year                   int64
# engine_fuel_type      object
# engine_hp            float64
# engine_cylinders     float64
# transmission_type     object
# driven_wheels         object
# number_of_doors      float64
# market_category       object
# vehicle_size          object
# vehicle_style         object
# highway_mpg            int64
# city_mpg               int64
# popularity             int64
# dtype: object

df_train.columns
# Output: 
# Index(['make', 'model', 'year', 'engine_fuel_type', 'engine_hp',
#       'engine_cylinders', 'transmission_type', 'driven_wheels',
#       'number_of_doors', 'market_category', 'vehicle_size', 'vehicle_style',
#       'highway_mpg', 'city_mpg', 'popularity'],
#      dtype='object')

We choose the columns engine_hp, engine_cylinders, highway_mpg, city_mpg, and popularity for our base model.

base = ['engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg', 'popularity']

df_train[base].head()
engine_hpengine_cylindershighway_mpgcity_mpgpopularity
0310.08.018131851
1170.04.03224640
2165.06.01513549
3150.04.03928873
4510.08.01913258
7150 rows × 5 columns

Value extraction

We need to extract the values to use them in training.

X_train = df_train[base].values
X_train
# Output:
#array([[ 310.,    8.,   18.,   13., 1851.],
#            [ 170.,    4.,   32.,   24.,  640.],
#            [ 165.,    6.,   15.,   13.,  549.],
#            ...,
#            [ 342.,    8.,   24.,   17.,  454.],
#            [ 170.,    4.,   28.,   23., 2009.],
#            [ 160.,    6.,   19.,   14.,  586.]])

Missing values

Missing values are generally not good for our model. Therefore, you should always check whether such values are present.

df_train[base].isnull().sum()
# Output:
# engine_hp           42
# engine_cylinders    17
# highway_mpg          0
# city_mpg             0
# popularity           0
# dtype: int64

As you can see there are two columns that have missing values. The easiest thing we can do is fill them with zeros. But notice filling it with 0 makes the model ignore this feature, because:

g(xi) = w0 + xi1w1 + xi2w2

if xi1 = 0 then the last equation simplifies to

g(xi) = w0 + 0 + xi2w2

But 0 is not always the best way to deal with missing values, because that means there is an observation of a car with 0 cylinders or 0 horse powers. And a car without cylinders or 0 horse powers does not make much sense at this point. For the current example this procedure is sufficient for us.

df_train[base].fillna(0).isnull().sum()
# Output:
# engine_hp           0
# engine_cylinders    0
# highway_mpg         0
# city_mpg            0
# popularity          0
# dtype: int64

As you can see in the last snippet, the missing values have disappeared. However, now we need to apply this change in the DataFrame.

X_train = df_train[base].fillna(0).values
X_train
# Output:
# array([[ 310.,    8.,   18.,   13., 1851.],
#             [ 170.,    4.,   32.,   24.,  640.],
#             [ 165.,    6.,   15.,   13.,  549.],
#             ...,
#             [ 342.,    8.,   24.,   17.,  454.],
#             [ 170.,    4.,   28.,   23., 2009.],
#             [ 160.,    6.,   19.,   14.,  586.]])

y_train
# Output:
# array([10.40262514, 10.06032035,  7.60140233, ..., 10.92181127,
        9.91100919,  8.10892416])

Now we can train our model using the train_linear_regression function that we’ve implemented in the last article. The function return the value for w0 and and array for vector w.

w0, w = train_linear_regression(X_train, y_train)
w0, w
# Output:
# (7.471835414587793,
 array([ 9.30959186e-03, -1.19533938e-01,  4.68925224e-02, -1.13441722e-02,
        -1.01631104e-05]))

We can use this two variables to apply the model to our training dataset to see how well the model has learned the training data. We do not want the model to simply memorize the data, but to recognize the correlations. Later, we’ll also apply the model to unseen data so that we can eliminate memorization.

y_pred = w0 + X_train.dot(w)
y_pred
# Output: 
# array([10.07931663,  9.79812647,  8.84104849, ..., 10.62739988,
        9.60798726,  8.97035041])

Plotting model performance

Three snippets before, you can see the actual values for y. The last snippet shows the predicted values for y. You could now go through these two lists manually and compare them. A better and also visual possibility is provided by Seaborn. The next snippet shows how to output these two lists accordingly.

# alpha changes the transparency of the bars
# bins specifies the number of bars
sns.histplot(y_pred, color='red', alpha=0.5, bins=50)
sns.histplot(y_train, color='blue', alpha=0.5, bins=50)

You see from this plot that the model is not ideal but it’s better to have an objective way of saying that the model is good or not good. When we start improving the model, we also want to ensure that we really improving it. The next article is about an objective way to evaluate the performance of a regression model.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.