ML Zoomcamp 2023 – Machine Learning for Regression – Part 10

Categorical variables

Categorical variables are variables that are categories (typically strings)
Here: make, model, engine_fuel_type, transmission_type, driven_wheels, market_category, vehicle_size, vehicle_style But, there is one value that looks like numerical variable, but it isn’t.
number_of_doors is not really a numerical number.

df_train.dtypes
# Output:
# make                              object
# model                             object
# year                                 int64
# engine_fuel_type      object
# engine_hp                   float64
# engine_cylinders       float64
# transmission_type    object
# driven_wheels            object
# number_of_doors    float64
# market_category       object
# vehicle_size                object
# vehicle_style              object
# highway_mpg            int64
# city_mpg                     int64
# popularity                    int64
# dtype: object
df_train.number_of_doors
# Output: 
# 0       4.0
# 1       4.0
# 2       3.0
# 3       4.0
# 4       4.0
#        ... 
# 7145    4.0
# 7146    2.0
# 7147    4.0
# 7148    4.0
# 7149    2.0
# Name: number_of_doors, Length: 7150, dtype: float64

df_train.number_of_doors == 2
# Output: 
# 0       False
# 1       False
# 2       False
# 3       False
# 4       False
#         ...  
# 7145    False
# 7146     True
# 7147    False
# 7148    False
# 7149     True
# Name: number_of_doors, Length: 7150, dtype: bool

Typical way of encoding such categorical variables is that we represent it with a bunch of binary columns – so called one-hot encoding. For each value we have a different column.

Num of doorsnum_doors_2num_doors_3num_doors_4
2100
3010
4001
2100
one-hot encoding for feature Num of Doors

We can imitate this encoding by turning the booleans from the last snippet into integers (1 and 0) and creating a new variable for each number of doors.

df_train['num_doors_2'] = (df_train.number_of_doors == 2).astype('int')
df_train['num_doors_3'] = (df_train.number_of_doors == 3).astype('int')
df_train['num_doors_4'] = (df_train.number_of_doors == 4).astype('int')

But we can do this easier with string replacement.

'num_doors_%s' % 4
# Output: 'num_doors_4'

# With that replacement we can write a loop
for v in [2, 3, 4]:
    df_train['num_doors_%s' % v] = (df_train.number_of_doors == v).astype('int')

# We delete this because we'll use another solution
for v in [2, 3, 4]:
    del df_train['num_doors_%s' % v]

Let’s use this string replacement method in our prepare_X function.

def prepare_X(df):
    df = df.copy()
    features = base.copy()
    
    df['age'] = 2017 - df.year
    features.append('age')
    
    for v in [2, 3, 4]:
        df['num_doors_%s' % v] = (df.number_of_doors == v).astype('int')
        features.append('num_doors_%s' % v)
    
    df_num = df[features]
    df_num = df_num.fillna(0)
    # extracting the Numpy array
    X = df_num.values
    return X

prepare_X(df_train)
# Output:
# array([[310.,   8.,  18., ...,   0.,   0.,   1.],
#             [170.,   4.,  32., ...,   0.,   0.,   1.],
#             [165.,   6.,  15., ...,   0.,   1.,   0.],
#             ...,
#             [342.,   8.,  24., ...,   0.,   0.,   1.],
#             [170.,   4.,  28., ...,   0.,   0.,   1.],
#             [160.,   6.,  19., ...,   1.,   0.,   0.]])

When you look at the output of the last snippet you see at the end of each list there are three new items – one for each number of doors (2, 3, 4). Now we can check if the model performance has improved with the new features.

X_train = prepare_X(df_train)
w0, w = train_linear_regression(X_train, y_train)

X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)

rmse(y_val, y_pred)
# Ouput: 0.5139733981046036

We see in contrast to the last training with rmse of 0.5153662333982238 there is only a slightly improvement, almost negligible so the number of doors feature is not that useful. Maybe the ‘Make’ information is more useful.

df.make.nunique() 
# Output: 48

df.make
# Output:
# 0            bmw
# 1            bmw
# 2            bmw
# 3            bmw
# 4            bmw
#           ...   
# 11909      acura
# 11910      acura
# 11911      acura
# 11912      acura
# 11913    lincoln
# Name: make, Length: 11914, dtype: object

There are 48 unique values in the ‘Make’ column. That could be too much. Let’s look at the most popular ones.

df.make.value_counts().head()
# Output:
# chevrolet     1123
# ford           881
# volkswagen     809
# toyota         746
# dodge          626
# Name: make, dtype: int64

# If we want to get the actual values, we use the index property
df.make.value_counts().head().index
# Wrap it in a usual Python list
makes = list(df.make.value_counts().head().index)
makes
# Ouput: ['chevrolet', 'ford', 'volkswagen', 'toyota', 'dodge']

We can now adapt again our prepare_X function to add the new feature.

def prepare_X(df):
    df = df.copy()
    features = base.copy()
    
    df['age'] = 2017 - df.year
    features.append('age')
    
    for v in [2, 3, 4]:
        df['num_doors_%s' % v] = (df.number_of_doors == v).astype('int')
        features.append('num_doors_%s' % v)
        
    for v in makes:
        df['make_%s' % v] = (df.make == v).astype('int')
        features.append('make_%s' % v)
    
    df_num = df[features]
    df_num = df_num.fillna(0)
    # extracting the Numpy array
    X = df_num.values
    return X

Now we can use our new prepare_X function and train and validate again.

X_train = prepare_X(df_train)
w0, w = train_linear_regression(X_train, y_train)

X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)

rmse(y_val, y_pred)
# Output: 0.5058837299788781

The model performance has once again improved somewhat. How about adding all the other categorical variables now? This should improve the performance even more, right? Let’s try.

categorical_variables = [
    'make', 'engine_fuel_type', 'transmission_type', 'driven_wheels', 
    'market_category', 'vehicle_size', 'vehicle_style'
]

# The dictionary category will contain for each of the categories 
# the top 5 most common ones
categories = {}

for c in categorical_variables:
    categories[c] = list(df[c].value_counts().head().index)
    
categories
# Output:
# {'make': ['chevrolet', 'ford', 'volkswagen', 'toyota', 'dodge'],
#  'engine_fuel_type': ['regular_unleaded',
#   'premium_unleaded_(required)',
#   'premium_unleaded_(recommended)',
#   'flex-fuel_(unleaded/e85)',
#   'diesel'],
#  'transmission_type': ['automatic',
#   'manual',
#   'automated_manual',
#   'direct_drive',
#   'unknown'],
#  'driven_wheels': ['front_wheel_drive',
#   'rear_wheel_drive',
#   'all_wheel_drive',
#   'four_wheel_drive'],
#  'market_category': ['crossover',
#   'flex_fuel',
#   'luxury',
#   'luxury,performance',
#   'hatchback'],
#  'vehicle_size': ['compact', 'midsize', 'large'],
#  'vehicle_style': ['sedan',
#   '4dr_suv',
#   'coupe',
#   'convertible',
#   '4dr_hatchback']}

The next snippet shows how to implement the new features to our prepare_X function. This time we need two loops as described inline.

def prepare_X(df):
    # this is good way to do, otherwise while using df you'll modify the original data
    # what is mostly not wanted
    df = df.copy()
    features = base.copy()
    
    df['age'] = 2017 - df.year
    features.append('age')
    
    for v in [2, 3, 4]:
        df['num_doors_%s' % v] = (df.number_of_doors == v).astype('int')
        features.append('num_doors_%s' % v)

    # First loop is for each key of the dictionary categories.
    # Second loop is for each value inside the categories
    # For each of this values we create a new column.
    for c, values in categories.items():    
        for v in values:
            df['%s_%s' % (c, v)] = (df[c] == v).astype('int')
            features.append('%s_%s' % (c, v))
    
    df_num = df[features]
    df_num = df_num.fillna(0)
    # extracting the Numpy array
    X = df_num.values
    return X

Now we can train the model again and apply it to the validation data to see what is the model performance.

X_train = prepare_X(df_train)
w0, w = train_linear_regression(X_train, y_train)

X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)

rmse(y_val, y_pred)
# Output: 292.5054633101075

This time the model performance is very bad. As you can see the RMSE (292.505) is very large. So something went wrong. In the next article we’ll see why that has happened and how to fix it.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.