ML Zoomcamp 2023 – Machine Learning for Classification– Part 8

One-hot encoding

One-hot encoding is a technique used in machine learning to convert categorical (non-numeric) data into a numeric format that can be used by machine learning algorithms. It’s particularly useful when working with algorithms that require numerical input, such as many classification and regression models. Scikit-Learn, a popular machine learning library in Python, provides convenient tools for performing one-hot encoding.

  • Problem: Categorical data, such as “color” with categories like “red,” “green,” and “blue,” cannot be directly used as input for most machine learning algorithms because they require numerical data. One-hot encoding solves this problem by converting categorical data into binary vectors.
  • How It Works: For each categorical feature, one-hot encoding creates a new binary (0 or 1) feature for each category within that feature. Each binary feature represents the presence or absence of a specific category. For example, in the “color” example, you’d create three binary features: “IsRed,” “IsGreen,” and “IsBlue.” If an observation belongs to the “red” category, the “IsRed” feature is set to 1, while “IsGreen” and “IsBlue” are set to 0.

Use Scikit-Learn to encode categorical features

Using the Pandas to_dict function with the orient parameter set to ‘records’ transforms the dataframe into a collection of dictionaries. In this format, each row or record is converted into a separate dictionary.

# turns each column into a dictionary --> but the result is not what we want here
# df_train[['gender', 'contract']].iloc[:100].to_dict()

dicts = df_train[['gender', 'contract']].iloc[:100].to_dict(orient='records')
dicts

# Output:
# [{'gender': 'female', 'contract': 'two_year'},
# {'gender': 'male', 'contract': 'month-to-month'},
# {'gender': 'female', 'contract': 'month-to-month'},
# {'gender': 'female', 'contract': 'month-to-month'},
# {'gender': 'female', 'contract': 'two_year'},
# {'gender': 'male', 'contract': 'month-to-month'},
# {'gender': 'male', 'contract': 'month-to-month'},
# {'gender': 'female', 'contract': 'month-to-month'},
# {'gender': 'female', 'contract': 'two_year'},
# {'gender': 'female', 'contract': 'month-to-month'},
# {'gender': 'female', 'contract': 'two_year'},
# {'gender': 'male', 'contract': 'month-to-month'},
# {'gender': 'female', 'contract': 'two_year'},
# {'gender': 'female', 'contract': 'month-to-month'},
# {'gender': 'female', 'contract': 'month-to-month'},
# {'gender': 'male', 'contract': 'month-to-month'},
# {'gender': 'female', 'contract': 'two_year'},
# {'gender': 'female', 'contract': 'month-to-month'},
# {'gender': 'male', 'contract': 'one_year'},
# {'gender': 'male', 'contract': 'two_year'},
# {'gender': 'male', 'contract': 'month-to-month'},
# {'gender': 'female', 'contract': 'one_year'},
# {'gender': 'female', 'contract': 'month-to-month'},
# {'gender': 'female', 'contract': 'two_year'},
# {'gender': 'male', 'contract': 'month-to-month'},
...
# {'gender': 'male', 'contract': 'one_year'},
# {'gender': 'female', 'contract': 'month-to-month'},
# {'gender': 'male', 'contract': 'month-to-month'},
# {'gender': 'male', 'contract': 'one_year'},
# {'gender': 'male', 'contract': 'month-to-month'}]

Using DictVectorizer

First, we need to create a new instance of this class. Then, we train our DictVectorizer instance. This involves presenting the data to the DictVectorizer so that it can infer the column names and their corresponding values. Based on this information, it creates the one-hot encoding feature matrix. It’s worth noting that the DictVectorizer is intelligent enough to detect numerical variables and exclude them from one-hot encoding, as they don’t require such encoding.

from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer()

dv.fit(dicts)

Then we need to transform the dictionaries.

dv.transform(dicts)
# Output: 
# <10x4 sparse matrix of type '<class 'numpy.float64'>'
#	with 20 stored elements in Compressed Sparse Row format>

When using the transform method as shown in the last snippet, it creates a sparse matrix by default. A sparse matrix is a memory-efficient way of encoding data when there are many zeros in the dataset. But we’ll not use sparse matrix here.

In the code of the next snippet, the DictVectorizer returns a regular NumPy array where the first three columns represent the “contract” variable, and the last two columns represent the “gender” variable.

dv = DictVectorizer(sparse=False)
dv.fit(dicts)

dv.get_feature_names_out()
# Output:
# array(['contract=month-to-month', 'contract=one_year',
#             'contract=two_year', 'gender=female', 'gender=male'], dtype=object)

dv.transform(dicts)
# Output:
# array([[0., 0., 1., 1., 0.],
#       [1., 0., 0., 0., 1.],
#       [1., 0., 0., 1., 0.],
#       [1., 0., 0., 1., 0.],
#       [0., 0., 1., 1., 0.],
#       [1., 0., 0., 0., 1.],
#       [1., 0., 0., 0., 1.],
#       [1., 0., 0., 1., 0.],
#       [0., 0., 1., 1., 0.],
#       [1., 0., 0., 1., 0.],
#       [0., 0., 1., 1., 0.],
#       [1., 0., 0., 0., 1.],
#       [0., 0., 1., 1., 0.],
#       [1., 0., 0., 1., 0.],
#       [1., 0., 0., 1., 0.],
#       [1., 0., 0., 0., 1.],
#       [0., 0., 1., 1., 0.],
#       [1., 0., 0., 1., 0.],
#       [0., 1., 0., 0., 1.],
#       [0., 0., 1., 0., 1.],
#       [1., 0., 0., 0., 1.],
#       [0., 1., 0., 1., 0.],
#       [1., 0., 0., 1., 0.],
#       [0., 0., 1., 1., 0.],
#       [1., 0., 0., 0., 1.],
#       ....
#       [0., 1., 0., 0., 1.],
#       [1., 0., 0., 1., 0.],
#       [1., 0., 0., 0., 1.],
#       [0., 1., 0., 0., 1.],
#       [1., 0., 0., 0., 1.]])

Let’s bring it all together

train_dicts = df_train[categorical + numerical].to_dict(orient='records')
train_dicts[0]
# Output:
# {'gender': 'female',
#  'seniorcitizen': 0,
#  'partner': 'yes',
#  'dependents': 'yes',
#  'phoneservice': 'yes',
#  'multiplelines': 'yes',
#  'internetservice': 'fiber_optic',
#  'onlinesecurity': 'yes',
#  'onlinebackup': 'yes',
#  'deviceprotection': 'yes',
#  'techsupport': 'yes',
#  'streamingtv': 'yes',
#  'streamingmovies': 'yes',
#  'contract': 'two_year',
#  'paperlessbilling': 'yes',
#  'paymentmethod': 'electronic_check',
#  'tenure': 72,
#  'monthlycharges': 115.5,
#  'totalcharges': 8425.15}

dv = DictVectorizer(sparse=False)
dv.fit(train_dicts)

dv.get_feature_names_out()
# Output:
# array(['contract=month-to-month', 'contract=one_year',
#       'contract=two_year', 'dependents=no', 'dependents=yes',
#       'deviceprotection=no', 'deviceprotection=no_internet_service',
#       'deviceprotection=yes', 'gender=female', 'gender=male',
#       'internetservice=dsl', 'internetservice=fiber_optic',
#       'internetservice=no', 'monthlycharges', 'multiplelines=no',
#       'multiplelines=no_phone_service', 'multiplelines=yes',
#       'onlinebackup=no', 'onlinebackup=no_internet_service',
#       'onlinebackup=yes', 'onlinesecurity=no',
#       'onlinesecurity=no_internet_service', 'onlinesecurity=yes',
#       'paperlessbilling=no', 'paperlessbilling=yes', 'partner=no',
#       'partner=yes', 'paymentmethod=bank_transfer_(automatic)',
#       'paymentmethod=credit_card_(automatic)',
#       'paymentmethod=electronic_check', 'paymentmethod=mailed_check',
#       'phoneservice=no', 'phoneservice=yes', 'seniorcitizen',
#       'streamingmovies=no', 'streamingmovies=no_internet_service',
#       'streamingmovies=yes', 'streamingtv=no',
#       'streamingtv=no_internet_service', 'streamingtv=yes',
#       'techsupport=no', 'techsupport=no_internet_service',
#       'techsupport=yes', 'tenure', 'totalcharges'], dtype=object)

Short version without long outputs:

from sklearn.feature_extraction import DictVectorizer

train_dicts = df_train[categorical + numerical].to_dict(orient='records')
dv = DictVectorizer(sparse=False)

dv.fit(train_dicts)
X_train = dv.transform(train_dicts)
# instead of last two lines, you can also use
# X_train = dv.fit_transform(train_dicts)

X_train.shape
# Output: (4225, 45)

When dealing with validation data, we can reuse the same DictVectorizer instance that we created before. Instead of using the fit function followed by the transform function, we only need to apply the transform function to the validation data. This ensures that the transformation process applied to the validation data is consistent with the encoding used for the training data.

val_dicts = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dicts)

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.