Preparation Steps – Part 1/2

In part 1 of this chapter, “Decision Trees and Ensemble Learning,” we introduced the project, which is a binary classification problem aimed at predicting the probability of a client defaulting on a loan. Part 2 of this chapter is divided into two main sections.

Preparation Steps

In the first part, we focus on necessary preparation steps. This includes importing essential libraries, downloading the dataset, previewing the the CSV File, and performing an initial column format adaptation to ensure uniformity in our data.

Data Transformation and Splitting

The second part is dedicated to re-encoding categorical variables and performing the train/validation/test split, a crucial step in preparing our data for modeling and evaluation.

Preparation Steps – Part 1/2

Imports for this project

For this project, we’ll need to import several essential libraries that we’re already familiar with. These libraries provide the foundation for our data analysis and machine learning tasks. The necessary libraries include:

NumPy: NumPy is a fundamental library for numerical and array operations in Python.
Pandas: Pandas is used for data manipulation and analysis, allowing us to work with structured data efficiently.
Scikit-Learn: Scikit-Learn is a powerful machine learning library that provides a wide range of tools and algorithms for our classification task. This library we’ll import at a later point.
Matplotlib: Matplotlib is essential for data visualization, enabling us to create informative plots and charts.
Seaborn: Seaborn complements Matplotlib and simplifies the creation of aesthetically pleasing statistical visualizations.

By ensuring that we have these libraries at our disposal, we’ll be well-equipped to tackle the various tasks involved in our credit risk scoring project.

import pandas as pd
import numpy as np

import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

Downloading the dataset

To begin our analysis, we must first obtain the dataset. We achieve this using the ‘wget’ command. The following console command, which starts with ‘!’, allows us to download the CSV file from the web. We access the data by referencing the URL stored in the ‘data’ variable.

data = 'https://github.com/gastonstat/CreditScoring/blob/master/CreditScoring.csv'
!wget $data

# Output
# --2023-10-08 19:13:49--  https://github.com/gastonstat/CreditScoring/blob/master/CreditScoring.csv
# Resolving github.com (github.com)... 140.82.121.4
# Connecting to github.com (github.com)|140.82.121.4|:443... connected.
# HTTP request sent, awaiting response... 200 OK
# Length: 321064 (314K) [text/plain]
# Saving to: ‘CreditScoring.csv’

# CreditScoring.csv   100%[===================>] 313,54K  --.-KB/s    in 0,09s   

# 2023-10-08 19:13:50 (3,24 MB/s) - ‘CreditScoring.csv’ saved [321064/321064]

Previewing the CSV File

Similarly to using the ‘wget’ console command, we can gain an initial overview of the CSV file by using the ‘head’ function. This function works with text files and provides a quick look at the file’s content, just as we’ve seen with Pandas dataframes.

!head CreditScoring.csv

#df = pd.read_csv(data)
df = pd.read_csv('CreditScoring.csv')
df.head()

	Status	Seniority	Home	Time	Age	Marital	Records	Job	Expenses	Income	Assets	Amount	Price
0	1	9	1	60	30	2	1	3	73	129	0	800	846
1	1	17	1	60	58	3	1	1	48	131	0	1000	1658
2	2	10	2	36	46	2	2	3	90	200	3000	2000	2985
3	1	0	1	60	24	1	1	1	63	182	2500	900	1325
4	1	0	1	36	26	1	1	1	46	107	0	310	910

Adapting Column Format

We’ve observed that some of the categorical variables, such as ‘status,’ ‘home,’ ‘marital,’ ‘records,’ and ‘job,’ are currently encoded as numerical values, which can be less intuitive. To make the data more understandable, we’ll convert these columns into text format.

As a first step, we’ll lowercase all the column names for consistency:

df.columns = df.columns.str.lower()
df.head()

This change ensures uniformity in our data and improves readability.

	Status	Seniority	Home	Time	Age	Marital	Records	Job	Expenses	Income	Assets	Amount	Price
0	1	9	1	60	30	2	1	3	73	129	0	800	846
1	1	17	1	60	58	3	1	1	48	131	0	1000	1658
2	2	10	2	36	46	2	2	3	90	200	3000	2000	2985
3	1	0	1	60	24	1	1	1	63	182	2500	900	1325
4	1	0	1	36	26	1	1	1	46	107	0	310	910

ML Zoomcamp 2023 – Decision Trees and Ensemble Learning– Part 2

Preparation Steps – Part 1/2

Imports for this project

Downloading the dataset

Previewing the CSV File

Adapting Column Format

Leave a comment Cancel reply

Preparation Steps – Part 1/2

Imports for this project

Downloading the dataset

Previewing the CSV File

Adapting Column Format

Teilen mit:

Related

Leave a comment Cancel reply