ML Zoomcamp 2023 – Machine Learning for Regression – Part 11

Regularization

The topic for this part is regularization as a way to solve the problem of duplicated columns in our data. Remember the formula for normal equation is:

w = (XTX)-1*XTy

The problem what we have is connected with the first part (XTX)-1. We need to take an inverse of the GRAM matrix. Sometimes this inverse doesn’t exist. This happens when there are duplicate features in X. Let’s take an easy example.

# You see here 2nd and 3rd columns are identical
X = [
    [4, 4, 4],
    [3, 5, 5],
    [5, 1, 1],
    [5, 4, 4],
    [7, 5, 5],
    [4, 5, 5]
]

X = np.array(X)
X
# Output:
# array([[4, 4, 4],
#             [3, 5, 5],
#             [5, 1, 1],
#             [5, 4, 4],
#             [7, 5, 5],
#             [4, 5, 5]])

XTX = X.T.dot(X)
XTX
# Output:
# array([[140, 111, 111],
#             [111, 108, 108],
#             [111, 108, 108]])

The output of the last snippet shows the XTX matrix. You see the 2nd and 3rd columns are the same. In this case the inverse doesn’t exist.
Remember: In linear algebra we say that one column is a linear combination of other columns. That means it’s possible to express the column number 3 with other columns of the matrix which is basically just a duplicate of column 2.
Therefore the next code snipped raises an error “Singular matrix”.

np.linalg.inv(XTX)
# Output: LinAlgError: Singular matrix

The code from the last article didn’t raise that error, so the inverse of that gram matrix exists. But the reason for the very big value for rmse is that our data is not very clean.

Let’s go back to our last example but this time similar X as before with a few noise.

X = [
    [4, 4, 4],
    [3, 5, 5],
    [5, 1, 1],
    [5, 4, 4],
    [7, 5, 5],
    [4, 5, 5.0000001],
]

X = np.array(X)
y = [1, 2, 3, 1, 2, 3]

XTX = X.T.dot(X)
XTX
# Output: 
# array([[140.       , 111.       , 111.0000004],
#             [111.       , 108.       , 108.0000005],
#             [111.0000004, 108.0000005, 108.000001 ]])

XTX_inv = np.linalg.inv(XTX)
XTX_inv
# Output:
# array([[ 3.93617174e-02, -1.76703046e+05,  1.76703004e+05],
#             [-1.76703046e+05,  4.02107113e+13, -4.02107110e+13],
#             [ 1.76703004e+05, -4.02107110e+13,  4.02107106e+13]])

As we can see from this example, a little noise is enough to make the two columns no longer identical. This leads to the fact that the calculation of the gram matrix no longer throws an error. Now we can calculate vector w again.

w = XTX_inv.dot(X.T).dot(y)
w
# Output: array([ 2.85838502e-01, -5.04106388e+06,  5.04106425e+06])

The first element (that’s the unique feature) of w looks ok, but the second and the third elements (that are the duplicates with noise) are very big numbers. That’s why we have duplicates in our feature matrix. The noise leads to the fact that no more error is thrown. Nevertheless, the model performs poorly due to the duplicates. What we can do to fix the problem is to add a small number (called alpha) to the diagonal of XTX. Let’s demonstrate this on an easier example of XTX.

# Adding a small number to the diagonal
# helps to control. So the numbers of w become smaller
XTX = [
    [1.0001, 2, 2],
    [2, 1.0001, 1.0000001],
    [2, 1.0000001, 1.0001]
]

XTX = np.array(XTX)
np.linalg.inv(XTX)
# Output:
# array([[-3.33366691e-01,  3.33350007e-01,  3.33350007e-01],
#             [ 3.33350007e-01,  5.00492166e+03, -5.00508835e+03],
#             [ 3.33350007e-01, -5.00508835e+03,  5.00492166e+03]])

The larger the number alpha adding to the diagonal, the more we have these weights under control. The reason why this works this way is, that this decrease the likelihood that these two columns are just copies of each other.

XTX = [
    [1.01, 2, 2],
    [2, 1.01, 1.0000001],
    [2, 1.0000001, 1.01]
]

XTX = np.array(XTX)
np.linalg.inv(XTX)
# Output:
# array([[ -0.33668908,   0.33501399,   0.33501399],
#             [  0.33501399,  49.91590897, -50.08509104],
#             [  0.33501399, -50.08509104,  49.91590897]])

Let’s implement this.

XTX = [
    [1, 2, 2],
    [2, 1, 1.0000001],
    [2, 1.0000001, 1]
]

XTX =  np.array(XTX)
XTX
# Output:
# array([[1.       , 2.       , 2.       ],
#             [2.       , 1.       , 1.0000001],
#             [2.       , 1.0000001, 1.       ]])

Remember there was the eye function to get an Identity matrix. Maybe we can use this…

np.eye(3)

# When adding XTX to this matrix, it adds one on the diagonal
XTX + np.eye(3)

# We can multiply this eye by a smal number
XTX = XTX + 0.01*np.eye(3)
XTX
# Output:
# array([[2.13     , 2.       , 2.       ],
#             [2.       , 2.13     , 1.0000001],
#             [2.       , 1.0000001, 2.13     ]])

#XTX = XTX + 0.1*np.eye(3)
#XTX

#XTX = XTX + 1*np.eye(3)
#XTX
np.linalg.inv(XTX)
# Output: 
# array([[-2.34791133,  1.50026279,  1.50026279],
#             [ 1.50026279, -0.35641202, -1.24136785],
#             [ 1.50026279, -1.24136785, -0.35641202]])

Solving this problem is called regularization and means in this case controlling. We’re controlling the weights that they don’t grow too much. Alpha = 0.01 is a parameter, and the larger this parameter the larger the numbers on the diagonal. And the larger this numbers on the diagonal the smaller the values in the inverse XTX matrix.

This leads us to reimplementing the train_linear_regression function again.

# reg = regularized
# parameter r = short for regularization
def train_linear_regression_reg(X, y, r=0.001):
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])
    
    XTX = X.T.dot(X)
    XTX = XTX + r*np.eye(XTX.shape[0])
    
    XTX_inv = np.linalg.inv(XTX)
    w_full = XTX_inv.dot(X.T).dot(y)
    
    return w_full[0], w_full[1:]

To test the effect of regularization, we need to re-train our model and apply it to the validation data. Then we can calculate the rmse.

X_train = prepare_X(df_train)
w0, w = train_linear_regression_reg(X_train, y_train, r=0.01)

X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)

rmse(y_val, y_pred)
# Output: 0.45685446091134857

This is the best result we had before, but we don’t know that there is no better one.
To find a good value for r is the topic of the next article.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.