ML Zoomcamp 2023 – Decision Trees and Ensemble Learning– Part 3

  1. Data cleaning and preparation – Data Transformation and Splitting – Part 2/2
    1. Re-encoding the categorical variables
      1. Missing values
    2. Performing the train/validation/test split

Building upon the necessary preparation steps outlined in the previous article, this section is dedicated to two critical processes: re-encoding categorical variables and performing the train/validation/test split, a crucial step in preparing our data for modeling and evaluation.

Data cleaning and preparation – Data Transformation and Splitting – Part 2/2

Re-encoding the categorical variables

To handle the categorical variables, we need to address a few important considerations. The R file in the repository provides insights into preprocessing this data.

Here are the key points:

  1. Missing Values: The missing values are encoded as a series of nines (99999999). We’ll need to address how to handle these missing values.
  2. Categorical Variable Information: The R file also offers information about the categorical variables. For example:
    • ‘Status’ is encoded as ‘good’ (1) and ‘bad’ (2).
    • ‘Home’ includes categories like ‘rent,’ ‘owner,’ ‘priv,’ ‘ignore,’ ‘parents,’ and ‘other.’
    • ‘Marital’ encompasses ‘single,’ ‘married,’ ‘widow,’ ‘separated,’ and ‘divorced.’
    • ‘Records’ has ‘yes’ and ‘no.’
    • ‘Job’ includes ‘fixed,’ ‘partime,’ ‘freelance,’ and ‘other.’

To proceed, we’ll need to translate these numerical values back into their respective categorical strings. This ensures our data is more interpretable and ready for analysis.

To address the ‘status’ variable, we examine the possible values: 1, 2, and 0. As previously mentioned, 1 corresponds to ‘good,’ and 2 corresponds to ‘bad.’ However, we also have one record with the value 0, which we’ll designate as ‘unknown.’

To map these values accordingly, we can use the ‘map’ method. This method takes a dictionary that maps each original dataframe value to a new value.

df.status.value_counts()
# Output:
# 1    3200
# 2    1254
# 0       1
# Name: status, dtype: int64

status_values = {
    1: 'ok',
    2: 'default',
    0: 'unk'
}
df.status = df.status.map(status_values)
df.head()
statusseniorityhometimeagemaritalrecordsjobexpensesincomeassetsdebtamountprice
0ok9160302137312900800846
1ok1716058311481310010001658
2default1023646223902003000020002985
3ok01602411163182250009001325
4ok0136261114610700310910
Dataframe with mapped values for status column

The same re-encoding process applied to the ‘status’ column should also be carried out for the remaining categorical columns: ‘home_values,’ ‘marital_values,’ ‘records_values,’ and ‘job_values’.

home_values = {
    1: 'rent',
    2: 'owner',
    3: 'private',
    4: 'ignore',
    5: 'parents',
    6: 'other',
    0: 'unk'
}
df.home = df.home.map(home_values)

marital_values = {
    1: 'single', 
    2: 'married', 
    3: 'widow', 
    4: 'separated',
    5: 'divorced',
    0: 'unk'
}
df.marital = df.marital.map(marital_values)

records_values = {
    1: 'no',
    2: 'yes',
    0: 'unk'
}
df.records = df.records.map(records_values)

job_values = {
    1: 'fixed', 
    2: 'partime', 
    3: 'freelance', 
    4: 'others',
    0: 'unk'
}
df.job = df.job.map(job_values)

df.head()
statusseniorityhometimeagemaritalrecordsjobexpensesincomeAssetsdebtamountprice
0ok9rent6030marriednofreelance7312900800846
1ok17rent6058widownofixed481310010001658
2default10owner3646marriedyesfreelance902003000020002985
3ok0rent6024singlenofixed63182250009001325
4ok0rent3626singlenofixed4610700310910
Dataframe with mapped values for all columns

Missing values

With all categorical variables decoded back to strings, the next step is to address the missing values.

df.describe().round()
senioritytimeageexpensesincomeassetsdebtamountprice
count4455.04455.04455.04455.04455.04455.04455.04455.04455.0
mean8.046.037.056.0763317.01060341.0404382.01039.01463.0
std8.015.011.020.08703625.010217569.06344253.0475.0628.0
min0.06.018.035.00.00.00.0100.0105.0
25%2.036.028.035.080.00.00.0700.01118.0
50%5.048.036.051.0120.03500.00.01000.01400.0
75%12.060.045.072.0166.06000.00.01300.01692.0
max48.072.068.0180.099999999.099999999.099999999.05000.011140.0
3 columns with missing values

It’s apparent that ‘income,’ ‘assets,’ and ‘debt’ columns contain extremely large values (e.g., 99999999.0) as maximum values. To address this issue, we need to replace these outlier values. Let’s explore the replacement process.

df.income.max()
# Output: 99999999

df.income.replace(to_replace=99999999, value=np.nan)

df.income.replace(to_replace=99999999, value=np.nan).max()
# Output: 959.0

We’ll address the outlier values in a loop for all three columns: ‘income,’ ‘assets,’ and ‘debt’.

for c in ['income', 'assets', 'debt']:
    df[c] = df[c].replace(to_replace=99999999, value=np.nan)

df.describe().round()
senioritytimeageexpensesincomeassetsdebtamountprice
count4455.04455.04455.04455.04421.04408.04437.04455.04455.0
mean8.046.037.056.0131.05403.0343.01039.01463.0
std8.015.011.020.086.011573.01246.0475.0628.0
min0.06.018.035.00.00.00.0100.0105.0
25%2.036.028.035.080.00.00.0700.01118.0
50%5.048.036.051.0120.03000.00.01000.01400.0
75%12.060.045.072.0165.06000.00.01300.01692.0
max48.072.068.0180.0959.0300000.030000.05000.011140.0
Columns ‘income,’ ‘assets,’ and ‘debt’ without 99999999 values

While analyzing the ‘status’ column, we discovered a single record with the value ‘unk,’ representing a missing or unknown status. Since our focus is solely on ‘ok’ and ‘default’ values, we can safely remove this record from the dataframe.

df.status.value_counts()

# Output:
# ok         3200
# default    1254
# unk           1
# Name: status, dtype: int64

df = df[df.status != 'unk'].reset_index(drop=True)
df
statusseniorityhometimeagemaritalrecordsjobexpensesincomeassetsdebtamountprice
0ok9rent6030marriednofreelance73129.00.00.0800846
1ok17rent6058widownofixed48131.00.00.010001658
2default10owner3646marriedyesfreelance90200.03000.00.020002985
3ok0rent6024singlenofixed63182.02500.00.09001325
4ok0rent3626singlenofixed46107.00.00.0310910
4449default1rent6039marriednofixed6992.00.00.09001020
4450ok22owner6046marriednofixed6075.03000.0600.09501263
4451default0owner2437marriednopartime6090.03500.00.0500963
4452ok0rent4823singlenofreelance49140.00.00.0550550
4453ok5owner6032marriednofreelance60140.04000.01000.013501650
4454 rows × 14 columns

Performing the train/validation/test split

The final step in our data preparation is to split the dataset into training, validation, and test sets. We achieve this using the ‘train_test_split’ function from scikit-learn.

Here’s the code for the split:

from sklearn.model_selection import train_test_split

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=11)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=11)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

df_train.status
# Output: 
# 0       default
# 1       default
# 2            ok
# 3       default
# 4            ok
#         ...   
# 2667         ok
# 2668         ok
# 2669         ok
# 2670         ok
# 2671         ok
# Name: status, Length: 2672, dtype: object

To predict a probability, we need to convert our target variable ‘status’ into a numerical format.

(df_train.status == 'default').astype('int')

# Output:
# 0       1
# 1       1
# 2       0
# 3       1
# 4       0
#        ..
# 2667    0
# 2668    0
# 2669    0
# 2670    0
# 2671    0
# Name: status, Length: 2672, dtype: int64

To complete our data preparation, we need to assign target variables for the training, validation, and test sets. Additionally, to prevent accidental use of the target variable during training, we should remove it from ‘df_train,’ ‘df_val,’ and ‘df_test.’

y_train = (df_train.status == 'default').astype('int').values
y_val = (df_val.status == 'default').astype('int').values
y_test = (df_test.status == 'default').astype('int').values

del df_train['status']
del df_val['status']
del df_test['status']

df_train
seniorityhometimeagemaritalrecordsjobexpensesincomeassetsdebtamountprice
010owner3636marriednofreelance750.010000.00.010001400
16parents4832singleyesfixed3585.00.00.011001330
21parents4840marriednofixed75121.00.00.013201600
31parents4823singlenopartime3572.00.00.010781079
45owner3646marriednofreelance60100.04000.00.011001897
266718private3645marriednofixed45220.020000.00.08001600
26687private6029marriednofixed6051.03500.0500.010001290
26691parents2419singlenofixed3528.00.00.0400600
267015owner4843marriednofreelance60100.018000.00.025002976
267112owner4827marriedyesfixed45110.05000.01300.04501636
2672 rows × 13 columns

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.