Building upon the necessary preparation steps outlined in the previous article, this section is dedicated to two critical processes: re-encoding categorical variables and performing the train/validation/test split, a crucial step in preparing our data for modeling and evaluation.
Data cleaning and preparation – Data Transformation and Splitting – Part 2/2
Re-encoding the categorical variables
To handle the categorical variables, we need to address a few important considerations. The R file in the repository provides insights into preprocessing this data.
Here are the key points:
- Missing Values: The missing values are encoded as a series of nines (99999999). We’ll need to address how to handle these missing values.
- Categorical Variable Information: The R file also offers information about the categorical variables. For example:
- ‘Status’ is encoded as ‘good’ (1) and ‘bad’ (2).
- ‘Home’ includes categories like ‘rent,’ ‘owner,’ ‘priv,’ ‘ignore,’ ‘parents,’ and ‘other.’
- ‘Marital’ encompasses ‘single,’ ‘married,’ ‘widow,’ ‘separated,’ and ‘divorced.’
- ‘Records’ has ‘yes’ and ‘no.’
- ‘Job’ includes ‘fixed,’ ‘partime,’ ‘freelance,’ and ‘other.’
To proceed, we’ll need to translate these numerical values back into their respective categorical strings. This ensures our data is more interpretable and ready for analysis.
To address the ‘status’ variable, we examine the possible values: 1, 2, and 0. As previously mentioned, 1 corresponds to ‘good,’ and 2 corresponds to ‘bad.’ However, we also have one record with the value 0, which we’ll designate as ‘unknown.’
To map these values accordingly, we can use the ‘map’ method. This method takes a dictionary that maps each original dataframe value to a new value.
df.status.value_counts()
# Output:
# 1 3200
# 2 1254
# 0 1
# Name: status, dtype: int64
status_values = {
1: 'ok',
2: 'default',
0: 'unk'
}
df.status = df.status.map(status_values)
df.head()
| status | seniority | home | time | age | marital | records | job | expenses | income | assets | debt | amount | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ok | 9 | 1 | 60 | 30 | 2 | 1 | 3 | 73 | 129 | 0 | 0 | 800 | 846 |
| 1 | ok | 17 | 1 | 60 | 58 | 3 | 1 | 1 | 48 | 131 | 0 | 0 | 1000 | 1658 |
| 2 | default | 10 | 2 | 36 | 46 | 2 | 2 | 3 | 90 | 200 | 3000 | 0 | 2000 | 2985 |
| 3 | ok | 0 | 1 | 60 | 24 | 1 | 1 | 1 | 63 | 182 | 2500 | 0 | 900 | 1325 |
| 4 | ok | 0 | 1 | 36 | 26 | 1 | 1 | 1 | 46 | 107 | 0 | 0 | 310 | 910 |
The same re-encoding process applied to the ‘status’ column should also be carried out for the remaining categorical columns: ‘home_values,’ ‘marital_values,’ ‘records_values,’ and ‘job_values’.
home_values = {
1: 'rent',
2: 'owner',
3: 'private',
4: 'ignore',
5: 'parents',
6: 'other',
0: 'unk'
}
df.home = df.home.map(home_values)
marital_values = {
1: 'single',
2: 'married',
3: 'widow',
4: 'separated',
5: 'divorced',
0: 'unk'
}
df.marital = df.marital.map(marital_values)
records_values = {
1: 'no',
2: 'yes',
0: 'unk'
}
df.records = df.records.map(records_values)
job_values = {
1: 'fixed',
2: 'partime',
3: 'freelance',
4: 'others',
0: 'unk'
}
df.job = df.job.map(job_values)
df.head()
| status | seniority | home | time | age | marital | records | job | expenses | income | Assets | debt | amount | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ok | 9 | rent | 60 | 30 | married | no | freelance | 73 | 129 | 0 | 0 | 800 | 846 |
| 1 | ok | 17 | rent | 60 | 58 | widow | no | fixed | 48 | 131 | 0 | 0 | 1000 | 1658 |
| 2 | default | 10 | owner | 36 | 46 | married | yes | freelance | 90 | 200 | 3000 | 0 | 2000 | 2985 |
| 3 | ok | 0 | rent | 60 | 24 | single | no | fixed | 63 | 182 | 2500 | 0 | 900 | 1325 |
| 4 | ok | 0 | rent | 36 | 26 | single | no | fixed | 46 | 107 | 0 | 0 | 310 | 910 |
Missing values
With all categorical variables decoded back to strings, the next step is to address the missing values.
df.describe().round()
| seniority | time | age | expenses | income | assets | debt | amount | price | |
|---|---|---|---|---|---|---|---|---|---|
| count | 4455.0 | 4455.0 | 4455.0 | 4455.0 | 4455.0 | 4455.0 | 4455.0 | 4455.0 | 4455.0 |
| mean | 8.0 | 46.0 | 37.0 | 56.0 | 763317.0 | 1060341.0 | 404382.0 | 1039.0 | 1463.0 |
| std | 8.0 | 15.0 | 11.0 | 20.0 | 8703625.0 | 10217569.0 | 6344253.0 | 475.0 | 628.0 |
| min | 0.0 | 6.0 | 18.0 | 35.0 | 0.0 | 0.0 | 0.0 | 100.0 | 105.0 |
| 25% | 2.0 | 36.0 | 28.0 | 35.0 | 80.0 | 0.0 | 0.0 | 700.0 | 1118.0 |
| 50% | 5.0 | 48.0 | 36.0 | 51.0 | 120.0 | 3500.0 | 0.0 | 1000.0 | 1400.0 |
| 75% | 12.0 | 60.0 | 45.0 | 72.0 | 166.0 | 6000.0 | 0.0 | 1300.0 | 1692.0 |
| max | 48.0 | 72.0 | 68.0 | 180.0 | 99999999.0 | 99999999.0 | 99999999.0 | 5000.0 | 11140.0 |
It’s apparent that ‘income,’ ‘assets,’ and ‘debt’ columns contain extremely large values (e.g., 99999999.0) as maximum values. To address this issue, we need to replace these outlier values. Let’s explore the replacement process.
df.income.max()
# Output: 99999999
df.income.replace(to_replace=99999999, value=np.nan)
df.income.replace(to_replace=99999999, value=np.nan).max()
# Output: 959.0
We’ll address the outlier values in a loop for all three columns: ‘income,’ ‘assets,’ and ‘debt’.
for c in ['income', 'assets', 'debt']:
df[c] = df[c].replace(to_replace=99999999, value=np.nan)
df.describe().round()
| seniority | time | age | expenses | income | assets | debt | amount | price | |
|---|---|---|---|---|---|---|---|---|---|
| count | 4455.0 | 4455.0 | 4455.0 | 4455.0 | 4421.0 | 4408.0 | 4437.0 | 4455.0 | 4455.0 |
| mean | 8.0 | 46.0 | 37.0 | 56.0 | 131.0 | 5403.0 | 343.0 | 1039.0 | 1463.0 |
| std | 8.0 | 15.0 | 11.0 | 20.0 | 86.0 | 11573.0 | 1246.0 | 475.0 | 628.0 |
| min | 0.0 | 6.0 | 18.0 | 35.0 | 0.0 | 0.0 | 0.0 | 100.0 | 105.0 |
| 25% | 2.0 | 36.0 | 28.0 | 35.0 | 80.0 | 0.0 | 0.0 | 700.0 | 1118.0 |
| 50% | 5.0 | 48.0 | 36.0 | 51.0 | 120.0 | 3000.0 | 0.0 | 1000.0 | 1400.0 |
| 75% | 12.0 | 60.0 | 45.0 | 72.0 | 165.0 | 6000.0 | 0.0 | 1300.0 | 1692.0 |
| max | 48.0 | 72.0 | 68.0 | 180.0 | 959.0 | 300000.0 | 30000.0 | 5000.0 | 11140.0 |
While analyzing the ‘status’ column, we discovered a single record with the value ‘unk,’ representing a missing or unknown status. Since our focus is solely on ‘ok’ and ‘default’ values, we can safely remove this record from the dataframe.
df.status.value_counts()
# Output:
# ok 3200
# default 1254
# unk 1
# Name: status, dtype: int64
df = df[df.status != 'unk'].reset_index(drop=True)
df
| status | seniority | home | time | age | marital | records | job | expenses | income | assets | debt | amount | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ok | 9 | rent | 60 | 30 | married | no | freelance | 73 | 129.0 | 0.0 | 0.0 | 800 | 846 |
| 1 | ok | 17 | rent | 60 | 58 | widow | no | fixed | 48 | 131.0 | 0.0 | 0.0 | 1000 | 1658 |
| 2 | default | 10 | owner | 36 | 46 | married | yes | freelance | 90 | 200.0 | 3000.0 | 0.0 | 2000 | 2985 |
| 3 | ok | 0 | rent | 60 | 24 | single | no | fixed | 63 | 182.0 | 2500.0 | 0.0 | 900 | 1325 |
| 4 | ok | 0 | rent | 36 | 26 | single | no | fixed | 46 | 107.0 | 0.0 | 0.0 | 310 | 910 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 4449 | default | 1 | rent | 60 | 39 | married | no | fixed | 69 | 92.0 | 0.0 | 0.0 | 900 | 1020 |
| 4450 | ok | 22 | owner | 60 | 46 | married | no | fixed | 60 | 75.0 | 3000.0 | 600.0 | 950 | 1263 |
| 4451 | default | 0 | owner | 24 | 37 | married | no | partime | 60 | 90.0 | 3500.0 | 0.0 | 500 | 963 |
| 4452 | ok | 0 | rent | 48 | 23 | single | no | freelance | 49 | 140.0 | 0.0 | 0.0 | 550 | 550 |
| 4453 | ok | 5 | owner | 60 | 32 | married | no | freelance | 60 | 140.0 | 4000.0 | 1000.0 | 1350 | 1650 |
Performing the train/validation/test split
The final step in our data preparation is to split the dataset into training, validation, and test sets. We achieve this using the ‘train_test_split’ function from scikit-learn.
Here’s the code for the split:
from sklearn.model_selection import train_test_split
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=11)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=11)
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
df_train.status
# Output:
# 0 default
# 1 default
# 2 ok
# 3 default
# 4 ok
# ...
# 2667 ok
# 2668 ok
# 2669 ok
# 2670 ok
# 2671 ok
# Name: status, Length: 2672, dtype: object
To predict a probability, we need to convert our target variable ‘status’ into a numerical format.
(df_train.status == 'default').astype('int')
# Output:
# 0 1
# 1 1
# 2 0
# 3 1
# 4 0
# ..
# 2667 0
# 2668 0
# 2669 0
# 2670 0
# 2671 0
# Name: status, Length: 2672, dtype: int64
To complete our data preparation, we need to assign target variables for the training, validation, and test sets. Additionally, to prevent accidental use of the target variable during training, we should remove it from ‘df_train,’ ‘df_val,’ and ‘df_test.’
y_train = (df_train.status == 'default').astype('int').values
y_val = (df_val.status == 'default').astype('int').values
y_test = (df_test.status == 'default').astype('int').values
del df_train['status']
del df_val['status']
del df_test['status']
df_train
| seniority | home | time | age | marital | records | job | expenses | income | assets | debt | amount | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10 | owner | 36 | 36 | married | no | freelance | 75 | 0.0 | 10000.0 | 0.0 | 1000 | 1400 |
| 1 | 6 | parents | 48 | 32 | single | yes | fixed | 35 | 85.0 | 0.0 | 0.0 | 1100 | 1330 |
| 2 | 1 | parents | 48 | 40 | married | no | fixed | 75 | 121.0 | 0.0 | 0.0 | 1320 | 1600 |
| 3 | 1 | parents | 48 | 23 | single | no | partime | 35 | 72.0 | 0.0 | 0.0 | 1078 | 1079 |
| 4 | 5 | owner | 36 | 46 | married | no | freelance | 60 | 100.0 | 4000.0 | 0.0 | 1100 | 1897 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 2667 | 18 | private | 36 | 45 | married | no | fixed | 45 | 220.0 | 20000.0 | 0.0 | 800 | 1600 |
| 2668 | 7 | private | 60 | 29 | married | no | fixed | 60 | 51.0 | 3500.0 | 500.0 | 1000 | 1290 |
| 2669 | 1 | parents | 24 | 19 | single | no | fixed | 35 | 28.0 | 0.0 | 0.0 | 400 | 600 |
| 2670 | 15 | owner | 48 | 43 | married | no | freelance | 60 | 100.0 | 18000.0 | 0.0 | 2500 | 2976 |
| 2671 | 12 | owner | 48 | 27 | married | yes | fixed | 45 | 110.0 | 5000.0 | 1300.0 | 450 | 1636 |