ML Zoomcamp 2023 – Machine Learning for Regression – Part 9

Simple feature engineering

Suppose we want to develop a new feature based on the existing ones in the feature matrix X. Let’s assume we want to use the year information as an age information. Let’s assume further we have year 2017.

2017 - df_train.year
# Output:
# 0        8
# 1        1
# 2       22
# 3        1
# 4        3
#         ..
# 7145     1
# 7146    10
# 7147     6
# 7148     3
# 7149    19
# Name: year, Length: 7150, dtype: int64

We can add this new feature ‘age’ to our prepare_X function. What is one important remark here. It’s a good way to copy the dataframe inside prepare_X. Otherwise while using df you’ll modify the original data, what ist mostly not wanted.

base = ['engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg', 'popularity']

def prepare_X(df):
    df = df.copy()
    
    df['age'] = 2017 - df.year
    features = base + ['age']
    
    df_num = df[features]
    df_num = df_num.fillna(0)
    # extracting the Numpy array
    X = df_num.values
    return X

X_train = prepare_X(df_train)
X_train
# Output:
# array([[3.100e+02, 8.000e+00, 1.800e+01, 1.300e+01, 1.851e+03, 8.000e+00],
#       [1.700e+02, 4.000e+00, 3.200e+01, 2.400e+01, 6.400e+02, 1.000e+00],
#       [1.650e+02, 6.000e+00, 1.500e+01, 1.300e+01, 5.490e+02, 2.200e+01],
#       ...,
#       [3.420e+02, 8.000e+00, 2.400e+01, 1.700e+01, 4.540e+02, 6.000e+00],
#       [1.700e+02, 4.000e+00, 2.800e+01, 2.300e+01, 2.009e+03, 3.000e+00],
#       [1.600e+02, 6.000e+00, 1.900e+01, 1.400e+01, 5.860e+02, 1.900e+01]])

The output of the last snippet shows a list of lists. Each list has 6 items – 5 numerical columns and our new ‘age’ column. Let’s train a new model and see how the model performs.

X_train = prepare_X(df_train)
w0, w = train_linear_regression(X_train, y_train)

X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)

rmse(y_val, y_pred)
# Output: 0.5153662333982238

We can see an improvement. The rmse decreased from 0.7328022115111966 to 0.5153662333982238. The improvement in the rmse was clear. Let’s see if this improvement can be seen in the plots as well.

sns.histplot(y_pred, color='red', alpha=0.5, bins=50)
sns.histplot(y_val, color='blue', alpha=0.5, bins=50)

Here, too, a clear improvement can be seen. Many car prices are predicted much better. But there is still space for improvement.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.