Data cleaning and preparation – Data Transformation and Splitting – Part 2/2
1. Re-encoding the categorical variables
  1. Missing values
2. Performing the train/validation/test split

Building upon the necessary preparation steps outlined in the previous article, this section is dedicated to two critical processes: re-encoding categorical variables and performing the train/validation/test split, a crucial step in preparing our data for modeling and evaluation.

Data cleaning and preparation – Data Transformation and Splitting – Part 2/2

Re-encoding the categorical variables

To handle the categorical variables, we need to address a few important considerations. The R file in the repository provides insights into preprocessing this data.

Here are the key points:

Missing Values: The missing values are encoded as a series of nines (99999999). We’ll need to address how to handle these missing values.
Categorical Variable Information: The R file also offers information about the categorical variables. For example:
- ‘Status’ is encoded as ‘good’ (1) and ‘bad’ (2).
- ‘Home’ includes categories like ‘rent,’ ‘owner,’ ‘priv,’ ‘ignore,’ ‘parents,’ and ‘other.’
- ‘Marital’ encompasses ‘single,’ ‘married,’ ‘widow,’ ‘separated,’ and ‘divorced.’
- ‘Records’ has ‘yes’ and ‘no.’
- ‘Job’ includes ‘fixed,’ ‘partime,’ ‘freelance,’ and ‘other.’

To proceed, we’ll need to translate these numerical values back into their respective categorical strings. This ensures our data is more interpretable and ready for analysis.

To address the ‘status’ variable, we examine the possible values: 1, 2, and 0. As previously mentioned, 1 corresponds to ‘good,’ and 2 corresponds to ‘bad.’ However, we also have one record with the value 0, which we’ll designate as ‘unknown.’

To map these values accordingly, we can use the ‘map’ method. This method takes a dictionary that maps each original dataframe value to a new value.

df.status.value_counts()
# Output:
# 1    3200
# 2    1254
# 0       1
# Name: status, dtype: int64

status_values = {
    1: 'ok',
    2: 'default',
    0: 'unk'
}

df.status = df.status.map(status_values)
df.head()

	status	seniority	home	time	age	marital	records	job	expenses	income	assets	amount	price
0	ok	9	1	60	30	2	1	3	73	129	0	800	846
1	ok	17	1	60	58	3	1	1	48	131	0	1000	1658
2	default	10	2	36	46	2	2	3	90	200	3000	2000	2985
3	ok	0	1	60	24	1	1	1	63	182	2500	900	1325
4	ok	0	1	36	26	1	1	1	46	107	0	310	910

Dataframe with mapped values for status column

The same re-encoding process applied to the ‘status’ column should also be carried out for the remaining categorical columns: ‘home_values,’ ‘marital_values,’ ‘records_values,’ and ‘job_values’.

home_values = {
    1: 'rent',
    2: 'owner',
    3: 'private',
    4: 'ignore',
    5: 'parents',
    6: 'other',
    0: 'unk'
}
df.home = df.home.map(home_values)

marital_values = {
    1: 'single', 
    2: 'married', 
    3: 'widow', 
    4: 'separated',
    5: 'divorced',
    0: 'unk'
}
df.marital = df.marital.map(marital_values)

records_values = {
    1: 'no',
    2: 'yes',
    0: 'unk'
}
df.records = df.records.map(records_values)

job_values = {
    1: 'fixed', 
    2: 'partime', 
    3: 'freelance', 
    4: 'others',
    0: 'unk'
}
df.job = df.job.map(job_values)

df.head()

	status	seniority	home	time	age	marital	records	job	expenses	income	Assets	amount	price
0	ok	9	rent	60	30	married	no	freelance	73	129	0	800	846
1	ok	17	rent	60	58	widow	no	fixed	48	131	0	1000	1658
2	default	10	owner	36	46	married	yes	freelance	90	200	3000	2000	2985
3	ok	0	rent	60	24	single	no	fixed	63	182	2500	900	1325
4	ok	0	rent	36	26	single	no	fixed	46	107	0	310	910

Dataframe with mapped values for all columns

Missing values

With all categorical variables decoded back to strings, the next step is to address the missing values.

df.describe().round()

	seniority	time	age	expenses	income	assets	debt	amount	price
count	4455.0	4455.0	4455.0	4455.0	4455.0	4455.0	4455.0	4455.0	4455.0
mean	8.0	46.0	37.0	56.0	763317.0	1060341.0	404382.0	1039.0	1463.0
std	8.0	15.0	11.0	20.0	8703625.0	10217569.0	6344253.0	475.0	628.0
min	0.0	6.0	18.0	35.0	0.0	0.0	0.0	100.0	105.0
25%	2.0	36.0	28.0	35.0	80.0	0.0	0.0	700.0	1118.0
50%	5.0	48.0	36.0	51.0	120.0	3500.0	0.0	1000.0	1400.0
75%	12.0	60.0	45.0	72.0	166.0	6000.0	0.0	1300.0	1692.0
max	48.0	72.0	68.0	180.0	99999999.0	99999999.0	99999999.0	5000.0	11140.0

3 columns with missing values

It’s apparent that ‘income,’ ‘assets,’ and ‘debt’ columns contain extremely large values (e.g., 99999999.0) as maximum values. To address this issue, we need to replace these outlier values. Let’s explore the replacement process.

df.income.max()
# Output: 99999999

df.income.replace(to_replace=99999999, value=np.nan)

df.income.replace(to_replace=99999999, value=np.nan).max()
# Output: 959.0

We’ll address the outlier values in a loop for all three columns: ‘income,’ ‘assets,’ and ‘debt’.

for c in ['income', 'assets', 'debt']:
    df[c] = df[c].replace(to_replace=99999999, value=np.nan)

df.describe().round()

	seniority	time	age	expenses	income	assets	debt	amount	price
count	4455.0	4455.0	4455.0	4455.0	4421.0	4408.0	4437.0	4455.0	4455.0
mean	8.0	46.0	37.0	56.0	131.0	5403.0	343.0	1039.0	1463.0
std	8.0	15.0	11.0	20.0	86.0	11573.0	1246.0	475.0	628.0
min	0.0	6.0	18.0	35.0	0.0	0.0	0.0	100.0	105.0
25%	2.0	36.0	28.0	35.0	80.0	0.0	0.0	700.0	1118.0
50%	5.0	48.0	36.0	51.0	120.0	3000.0	0.0	1000.0	1400.0
75%	12.0	60.0	45.0	72.0	165.0	6000.0	0.0	1300.0	1692.0
max	48.0	72.0	68.0	180.0	959.0	300000.0	30000.0	5000.0	11140.0

Columns ‘income,’ ‘assets,’ and ‘debt’ without 99999999 values

While analyzing the ‘status’ column, we discovered a single record with the value ‘unk,’ representing a missing or unknown status. Since our focus is solely on ‘ok’ and ‘default’ values, we can safely remove this record from the dataframe.

df.status.value_counts()

# Output:
# ok         3200
# default    1254
# unk           1
# Name: status, dtype: int64

df = df[df.status != 'unk'].reset_index(drop=True)
df

	status	seniority	home	time	age	marital	records	job	expenses	income	assets	debt	amount	price
0	ok	9	rent	60	30	married	no	freelance	73	129.0	0.0	0.0	800	846
1	ok	17	rent	60	58	widow	no	fixed	48	131.0	0.0	0.0	1000	1658
2	default	10	owner	36	46	married	yes	freelance	90	200.0	3000.0	0.0	2000	2985
3	ok	0	rent	60	24	single	no	fixed	63	182.0	2500.0	0.0	900	1325
4	ok	0	rent	36	26	single	no	fixed	46	107.0	0.0	0.0	310	910
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
4449	default	1	rent	60	39	married	no	fixed	69	92.0	0.0	0.0	900	1020
4450	ok	22	owner	60	46	married	no	fixed	60	75.0	3000.0	600.0	950	1263
4451	default	0	owner	24	37	married	no	partime	60	90.0	3500.0	0.0	500	963
4452	ok	0	rent	48	23	single	no	freelance	49	140.0	0.0	0.0	550	550
4453	ok	5	owner	60	32	married	no	freelance	60	140.0	4000.0	1000.0	1350	1650

4454 rows × 14 columns

Performing the train/validation/test split

The final step in our data preparation is to split the dataset into training, validation, and test sets. We achieve this using the ‘train_test_split’ function from scikit-learn.

Here’s the code for the split:

from sklearn.model_selection import train_test_split

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=11)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=11)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

df_train.status
# Output: 
# 0       default
# 1       default
# 2            ok
# 3       default
# 4            ok
#         ...   
# 2667         ok
# 2668         ok
# 2669         ok
# 2670         ok
# 2671         ok
# Name: status, Length: 2672, dtype: object

To predict a probability, we need to convert our target variable ‘status’ into a numerical format.

(df_train.status == 'default').astype('int')

# Output:
# 0       1
# 1       1
# 2       0
# 3       1
# 4       0
#        ..
# 2667    0
# 2668    0
# 2669    0
# 2670    0
# 2671    0
# Name: status, Length: 2672, dtype: int64

To complete our data preparation, we need to assign target variables for the training, validation, and test sets. Additionally, to prevent accidental use of the target variable during training, we should remove it from ‘df_train,’ ‘df_val,’ and ‘df_test.’

y_train = (df_train.status == 'default').astype('int').values
y_val = (df_val.status == 'default').astype('int').values
y_test = (df_test.status == 'default').astype('int').values

del df_train['status']
del df_val['status']
del df_test['status']

df_train

	seniority	home	time	age	marital	records	job	expenses	income	assets	debt	amount	price
0	10	owner	36	36	married	no	freelance	75	0.0	10000.0	0.0	1000	1400
1	6	parents	48	32	single	yes	fixed	35	85.0	0.0	0.0	1100	1330
2	1	parents	48	40	married	no	fixed	75	121.0	0.0	0.0	1320	1600
3	1	parents	48	23	single	no	partime	35	72.0	0.0	0.0	1078	1079
4	5	owner	36	46	married	no	freelance	60	100.0	4000.0	0.0	1100	1897
…	…	…	…	…	…	…	…	…	…	…	…	…	…
2667	18	private	36	45	married	no	fixed	45	220.0	20000.0	0.0	800	1600
2668	7	private	60	29	married	no	fixed	60	51.0	3500.0	500.0	1000	1290
2669	1	parents	24	19	single	no	fixed	35	28.0	0.0	0.0	400	600
2670	15	owner	48	43	married	no	freelance	60	100.0	18000.0	0.0	2500	2976
2671	12	owner	48	27	married	yes	fixed	45	110.0	5000.0	1300.0	450	1636

2672 rows × 13 columns

ML Zoomcamp 2023 – Decision Trees and Ensemble Learning– Part 3

Data cleaning and preparation – Data Transformation and Splitting – Part 2/2

Re-encoding the categorical variables

Missing values

Performing the train/validation/test split

Leave a comment Cancel reply

Data cleaning and preparation – Data Transformation and Splitting – Part 2/2

Re-encoding the categorical variables

Missing values

Performing the train/validation/test split

Teilen mit:

Related

Leave a comment Cancel reply