In part 1 of this chapter, “Decision Trees and Ensemble Learning,” we introduced the project, which is a binary classification problem aimed at predicting the probability of a client defaulting on a loan. Part 2 of this chapter is divided into two main sections.
Preparation Steps
In the first part, we focus on necessary preparation steps. This includes importing essential libraries, downloading the dataset, previewing the the CSV File, and performing an initial column format adaptation to ensure uniformity in our data.
Data Transformation and Splitting
The second part is dedicated to re-encoding categorical variables and performing the train/validation/test split, a crucial step in preparing our data for modeling and evaluation.
Preparation Steps – Part 1/2
Imports for this project
For this project, we’ll need to import several essential libraries that we’re already familiar with. These libraries provide the foundation for our data analysis and machine learning tasks. The necessary libraries include:
- NumPy: NumPy is a fundamental library for numerical and array operations in Python.
- Pandas: Pandas is used for data manipulation and analysis, allowing us to work with structured data efficiently.
- Scikit-Learn: Scikit-Learn is a powerful machine learning library that provides a wide range of tools and algorithms for our classification task. This library we’ll import at a later point.
- Matplotlib: Matplotlib is essential for data visualization, enabling us to create informative plots and charts.
- Seaborn: Seaborn complements Matplotlib and simplifies the creation of aesthetically pleasing statistical visualizations.
By ensuring that we have these libraries at our disposal, we’ll be well-equipped to tackle the various tasks involved in our credit risk scoring project.
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
Downloading the dataset
To begin our analysis, we must first obtain the dataset. We achieve this using the ‘wget’ command. The following console command, which starts with ‘!’, allows us to download the CSV file from the web. We access the data by referencing the URL stored in the ‘data’ variable.
data = 'https://github.com/gastonstat/CreditScoring/blob/master/CreditScoring.csv'
!wget $data
# Output
# --2023-10-08 19:13:49-- https://github.com/gastonstat/CreditScoring/blob/master/CreditScoring.csv
# Resolving github.com (github.com)... 140.82.121.4
# Connecting to github.com (github.com)|140.82.121.4|:443... connected.
# HTTP request sent, awaiting response... 200 OK
# Length: 321064 (314K) [text/plain]
# Saving to: ‘CreditScoring.csv’
# CreditScoring.csv 100%[===================>] 313,54K --.-KB/s in 0,09s
# 2023-10-08 19:13:50 (3,24 MB/s) - ‘CreditScoring.csv’ saved [321064/321064]
Previewing the CSV File
Similarly to using the ‘wget’ console command, we can gain an initial overview of the CSV file by using the ‘head’ function. This function works with text files and provides a quick look at the file’s content, just as we’ve seen with Pandas dataframes.
!head CreditScoring.csv
#df = pd.read_csv(data)
df = pd.read_csv('CreditScoring.csv')
df.head()
| Status | Seniority | Home | Time | Age | Marital | Records | Job | Expenses | Income | Assets | Debt | Amount | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 9 | 1 | 60 | 30 | 2 | 1 | 3 | 73 | 129 | 0 | 0 | 800 | 846 |
| 1 | 1 | 17 | 1 | 60 | 58 | 3 | 1 | 1 | 48 | 131 | 0 | 0 | 1000 | 1658 |
| 2 | 2 | 10 | 2 | 36 | 46 | 2 | 2 | 3 | 90 | 200 | 3000 | 0 | 2000 | 2985 |
| 3 | 1 | 0 | 1 | 60 | 24 | 1 | 1 | 1 | 63 | 182 | 2500 | 0 | 900 | 1325 |
| 4 | 1 | 0 | 1 | 36 | 26 | 1 | 1 | 1 | 46 | 107 | 0 | 0 | 310 | 910 |
Adapting Column Format
We’ve observed that some of the categorical variables, such as ‘status,’ ‘home,’ ‘marital,’ ‘records,’ and ‘job,’ are currently encoded as numerical values, which can be less intuitive. To make the data more understandable, we’ll convert these columns into text format.
As a first step, we’ll lowercase all the column names for consistency:
df.columns = df.columns.str.lower()
df.head()
This change ensures uniformity in our data and improves readability.
| Status | Seniority | Home | Time | Age | Marital | Records | Job | Expenses | Income | Assets | Debt | Amount | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 9 | 1 | 60 | 30 | 2 | 1 | 3 | 73 | 129 | 0 | 0 | 800 | 846 |
| 1 | 1 | 17 | 1 | 60 | 58 | 3 | 1 | 1 | 48 | 131 | 0 | 0 | 1000 | 1658 |
| 2 | 2 | 10 | 2 | 36 | 46 | 2 | 2 | 3 | 90 | 200 | 3000 | 0 | 2000 | 2985 |
| 3 | 1 | 0 | 1 | 60 | 24 | 1 | 1 | 1 | 63 | 182 | 2500 | 0 | 900 | 1325 |
| 4 | 1 | 0 | 1 | 36 | 26 | 1 | 1 | 1 | 46 | 107 | 0 | 0 | 310 | 910 |