Categorical variables
Categorical variables are variables that are categories (typically strings)
Here: make, model, engine_fuel_type, transmission_type, driven_wheels, market_category, vehicle_size, vehicle_style But, there is one value that looks like numerical variable, but it isn’t.
number_of_doors is not really a numerical number.
df_train.dtypes
# Output:
# make object
# model object
# year int64
# engine_fuel_type object
# engine_hp float64
# engine_cylinders float64
# transmission_type object
# driven_wheels object
# number_of_doors float64
# market_category object
# vehicle_size object
# vehicle_style object
# highway_mpg int64
# city_mpg int64
# popularity int64
# dtype: object
df_train.number_of_doors
# Output:
# 0 4.0
# 1 4.0
# 2 3.0
# 3 4.0
# 4 4.0
# ...
# 7145 4.0
# 7146 2.0
# 7147 4.0
# 7148 4.0
# 7149 2.0
# Name: number_of_doors, Length: 7150, dtype: float64
df_train.number_of_doors == 2
# Output:
# 0 False
# 1 False
# 2 False
# 3 False
# 4 False
# ...
# 7145 False
# 7146 True
# 7147 False
# 7148 False
# 7149 True
# Name: number_of_doors, Length: 7150, dtype: bool
Typical way of encoding such categorical variables is that we represent it with a bunch of binary columns – so called one-hot encoding. For each value we have a different column.
| Num of doors | num_doors_2 | num_doors_3 | num_doors_4 |
|---|---|---|---|
| 2 | 1 | 0 | 0 |
| 3 | 0 | 1 | 0 |
| 4 | 0 | 0 | 1 |
| 2 | 1 | 0 | 0 |
We can imitate this encoding by turning the booleans from the last snippet into integers (1 and 0) and creating a new variable for each number of doors.
df_train['num_doors_2'] = (df_train.number_of_doors == 2).astype('int')
df_train['num_doors_3'] = (df_train.number_of_doors == 3).astype('int')
df_train['num_doors_4'] = (df_train.number_of_doors == 4).astype('int')
But we can do this easier with string replacement.
'num_doors_%s' % 4
# Output: 'num_doors_4'
# With that replacement we can write a loop
for v in [2, 3, 4]:
df_train['num_doors_%s' % v] = (df_train.number_of_doors == v).astype('int')
# We delete this because we'll use another solution
for v in [2, 3, 4]:
del df_train['num_doors_%s' % v]
Let’s use this string replacement method in our prepare_X function.
def prepare_X(df):
df = df.copy()
features = base.copy()
df['age'] = 2017 - df.year
features.append('age')
for v in [2, 3, 4]:
df['num_doors_%s' % v] = (df.number_of_doors == v).astype('int')
features.append('num_doors_%s' % v)
df_num = df[features]
df_num = df_num.fillna(0)
# extracting the Numpy array
X = df_num.values
return X
prepare_X(df_train)
# Output:
# array([[310., 8., 18., ..., 0., 0., 1.],
# [170., 4., 32., ..., 0., 0., 1.],
# [165., 6., 15., ..., 0., 1., 0.],
# ...,
# [342., 8., 24., ..., 0., 0., 1.],
# [170., 4., 28., ..., 0., 0., 1.],
# [160., 6., 19., ..., 1., 0., 0.]])
When you look at the output of the last snippet you see at the end of each list there are three new items – one for each number of doors (2, 3, 4). Now we can check if the model performance has improved with the new features.
X_train = prepare_X(df_train)
w0, w = train_linear_regression(X_train, y_train)
X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)
rmse(y_val, y_pred)
# Ouput: 0.5139733981046036
We see in contrast to the last training with rmse of 0.5153662333982238 there is only a slightly improvement, almost negligible so the number of doors feature is not that useful. Maybe the ‘Make’ information is more useful.
df.make.nunique()
# Output: 48
df.make
# Output:
# 0 bmw
# 1 bmw
# 2 bmw
# 3 bmw
# 4 bmw
# ...
# 11909 acura
# 11910 acura
# 11911 acura
# 11912 acura
# 11913 lincoln
# Name: make, Length: 11914, dtype: object
There are 48 unique values in the ‘Make’ column. That could be too much. Let’s look at the most popular ones.
df.make.value_counts().head()
# Output:
# chevrolet 1123
# ford 881
# volkswagen 809
# toyota 746
# dodge 626
# Name: make, dtype: int64
# If we want to get the actual values, we use the index property
df.make.value_counts().head().index
# Wrap it in a usual Python list
makes = list(df.make.value_counts().head().index)
makes
# Ouput: ['chevrolet', 'ford', 'volkswagen', 'toyota', 'dodge']
We can now adapt again our prepare_X function to add the new feature.
def prepare_X(df):
df = df.copy()
features = base.copy()
df['age'] = 2017 - df.year
features.append('age')
for v in [2, 3, 4]:
df['num_doors_%s' % v] = (df.number_of_doors == v).astype('int')
features.append('num_doors_%s' % v)
for v in makes:
df['make_%s' % v] = (df.make == v).astype('int')
features.append('make_%s' % v)
df_num = df[features]
df_num = df_num.fillna(0)
# extracting the Numpy array
X = df_num.values
return X
Now we can use our new prepare_X function and train and validate again.
X_train = prepare_X(df_train)
w0, w = train_linear_regression(X_train, y_train)
X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)
rmse(y_val, y_pred)
# Output: 0.5058837299788781
The model performance has once again improved somewhat. How about adding all the other categorical variables now? This should improve the performance even more, right? Let’s try.
categorical_variables = [
'make', 'engine_fuel_type', 'transmission_type', 'driven_wheels',
'market_category', 'vehicle_size', 'vehicle_style'
]
# The dictionary category will contain for each of the categories
# the top 5 most common ones
categories = {}
for c in categorical_variables:
categories[c] = list(df[c].value_counts().head().index)
categories
# Output:
# {'make': ['chevrolet', 'ford', 'volkswagen', 'toyota', 'dodge'],
# 'engine_fuel_type': ['regular_unleaded',
# 'premium_unleaded_(required)',
# 'premium_unleaded_(recommended)',
# 'flex-fuel_(unleaded/e85)',
# 'diesel'],
# 'transmission_type': ['automatic',
# 'manual',
# 'automated_manual',
# 'direct_drive',
# 'unknown'],
# 'driven_wheels': ['front_wheel_drive',
# 'rear_wheel_drive',
# 'all_wheel_drive',
# 'four_wheel_drive'],
# 'market_category': ['crossover',
# 'flex_fuel',
# 'luxury',
# 'luxury,performance',
# 'hatchback'],
# 'vehicle_size': ['compact', 'midsize', 'large'],
# 'vehicle_style': ['sedan',
# '4dr_suv',
# 'coupe',
# 'convertible',
# '4dr_hatchback']}
The next snippet shows how to implement the new features to our prepare_X function. This time we need two loops as described inline.
def prepare_X(df):
# this is good way to do, otherwise while using df you'll modify the original data
# what is mostly not wanted
df = df.copy()
features = base.copy()
df['age'] = 2017 - df.year
features.append('age')
for v in [2, 3, 4]:
df['num_doors_%s' % v] = (df.number_of_doors == v).astype('int')
features.append('num_doors_%s' % v)
# First loop is for each key of the dictionary categories.
# Second loop is for each value inside the categories
# For each of this values we create a new column.
for c, values in categories.items():
for v in values:
df['%s_%s' % (c, v)] = (df[c] == v).astype('int')
features.append('%s_%s' % (c, v))
df_num = df[features]
df_num = df_num.fillna(0)
# extracting the Numpy array
X = df_num.values
return X
Now we can train the model again and apply it to the validation data to see what is the model performance.
X_train = prepare_X(df_train)
w0, w = train_linear_regression(X_train, y_train)
X_val = prepare_X(df_val)
y_pred = w0 + X_val.dot(w)
rmse(y_val, y_pred)
# Output: 292.5054633101075
This time the model performance is very bad. As you can see the RMSE (292.505) is very large. So something went wrong. In the next article we’ll see why that has happened and how to fix it.