ML Zoomcamp 2023 – Decision Trees and Ensemble Learning– Part 2

  1. Preparation Steps – Part 1/2
    1. Imports for this project
    2. Downloading the dataset
    3. Previewing the CSV File
    4. Adapting Column Format

In part 1 of this chapter, “Decision Trees and Ensemble Learning,” we introduced the project, which is a binary classification problem aimed at predicting the probability of a client defaulting on a loan. Part 2 of this chapter is divided into two main sections.

Preparation Steps

In the first part, we focus on necessary preparation steps. This includes importing essential libraries, downloading the dataset, previewing the the CSV File, and performing an initial column format adaptation to ensure uniformity in our data.

Data Transformation and Splitting

The second part is dedicated to re-encoding categorical variables and performing the train/validation/test split, a crucial step in preparing our data for modeling and evaluation.

Preparation Steps – Part 1/2

Imports for this project

For this project, we’ll need to import several essential libraries that we’re already familiar with. These libraries provide the foundation for our data analysis and machine learning tasks. The necessary libraries include:

  1. NumPy: NumPy is a fundamental library for numerical and array operations in Python.
  2. Pandas: Pandas is used for data manipulation and analysis, allowing us to work with structured data efficiently.
  3. Scikit-Learn: Scikit-Learn is a powerful machine learning library that provides a wide range of tools and algorithms for our classification task. This library we’ll import at a later point.
  4. Matplotlib: Matplotlib is essential for data visualization, enabling us to create informative plots and charts.
  5. Seaborn: Seaborn complements Matplotlib and simplifies the creation of aesthetically pleasing statistical visualizations.

By ensuring that we have these libraries at our disposal, we’ll be well-equipped to tackle the various tasks involved in our credit risk scoring project.

import pandas as pd
import numpy as np

import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

Downloading the dataset

To begin our analysis, we must first obtain the dataset. We achieve this using the ‘wget’ command. The following console command, which starts with ‘!’, allows us to download the CSV file from the web. We access the data by referencing the URL stored in the ‘data’ variable.

data = 'https://github.com/gastonstat/CreditScoring/blob/master/CreditScoring.csv'
!wget $data

# Output
# --2023-10-08 19:13:49--  https://github.com/gastonstat/CreditScoring/blob/master/CreditScoring.csv
# Resolving github.com (github.com)... 140.82.121.4
# Connecting to github.com (github.com)|140.82.121.4|:443... connected.
# HTTP request sent, awaiting response... 200 OK
# Length: 321064 (314K) [text/plain]
# Saving to: ‘CreditScoring.csv’

# CreditScoring.csv   100%[===================>] 313,54K  --.-KB/s    in 0,09s   

# 2023-10-08 19:13:50 (3,24 MB/s) - ‘CreditScoring.csv’ saved [321064/321064]

Previewing the CSV File

Similarly to using the ‘wget’ console command, we can gain an initial overview of the CSV file by using the ‘head’ function. This function works with text files and provides a quick look at the file’s content, just as we’ve seen with Pandas dataframes.

!head CreditScoring.csv

#df = pd.read_csv(data)
df = pd.read_csv('CreditScoring.csv')
df.head()
StatusSeniorityHomeTimeAgeMaritalRecordsJobExpensesIncomeAssetsDebtAmountPrice
019160302137312900800846
111716058311481310010001658
221023646223902003000020002985
3101602411163182250009001325
410136261114610700310910

Adapting Column Format

We’ve observed that some of the categorical variables, such as ‘status,’ ‘home,’ ‘marital,’ ‘records,’ and ‘job,’ are currently encoded as numerical values, which can be less intuitive. To make the data more understandable, we’ll convert these columns into text format.

As a first step, we’ll lowercase all the column names for consistency:

df.columns = df.columns.str.lower()
df.head()

This change ensures uniformity in our data and improves readability.

StatusSeniorityHomeTimeAgeMaritalRecordsJobExpensesIncomeAssetsDebtAmountPrice
019160302137312900800846
111716058311481310010001658
221023646223902003000020002985
3101602411163182250009001325
410136261114610700310910

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.