ML Zoomcamp 2023 – Machine Learning for Regression – Part 2

Exploratory data analysis (EDA) – General Information

Exploratory data analysis (EDA) is an essential step in the data analysis process. It involves summarizing and visualizing the main characteristics of a dataset to gain insights and identify patterns or trends. By exploring the data, researchers can uncover hidden relationships between variables and make informed decisions.

One common technique in EDA is to calculate summary statistics like mean, median, and standard deviation to understand the distribution of the data. These statistics provide a general overview of the dataset and can help identify potential outliers or unusual patterns.

Visualizations also play a crucial role in EDA. Graphical representations such as histograms, scatter plots, and box plots help visualize the data distribution, identify clusters or groups, and detect any unusual patterns or trends. Visualizations can be particularly helpful in identifying relationships between variables or finding patterns that may not be immediately apparent.

Another important aspect of EDA is data cleaning. This involves handling missing values, outliers, and inconsistencies in the dataset. By carefully examining the data, researchers can decide how to handle missing values (e.g., imputing or removing them) and identify and address outliers or errors.

EDA is not a one-time process but rather an iterative one. As researchers delve deeper into the data, they may uncover additional questions or areas of interest that require further exploration. Through this iterative process, researchers refine their understanding of the data and uncover valuable insights.

In conclusion, exploratory data analysis is a crucial step in the data analysis process. By summarizing, visualizing, and cleaning the data, researchers can uncover patterns, identify relationships, and make informed decisions. It provides the foundation for more advanced data analysis techniques and helps in the formation of hypotheses for further investigation.

Exploratory data analysis (EDA) – Car price prediction project

This section covers a few topics that are mentioned before to get a better understanding about what you can do in the EDA.

Getting an overview

First we want to understand how the data looks like just to get a feeling what values are there. That helps to learn more about the problem. What you can do is to look at each column and print some values.

for col in df.columns:
    print(col)
    print(df[col].head())
    print()

The output is not very informative, but what about unique values?

for col in df.columns:
    print(col)
    # print only the first 5 values
    # print(df[col].unique()[:5])
    print(df[col].unique())
    print("number of unique values: ",df[col].nunique())
    print()

Distribution of price

Next we want to look at the price and visualize this column.

# For plotting we use two libraries

import matplotlib.pyplot as plt
import seaborn as sns

# this line is needed to display the plots in notebooks
%matplotlib inline

# bins = number of bars in the histogram
# in the diagram 1e6 means 10^6 = 1,000,000
sns.histplot(df.msrp, bins=50)

What you can see in the histogram, there are a lot of prices that are pretty cheap but only a few cars that are very expensive. That means this is a long-tail distribution (many prices in a small range, but a few prices in a wide range). We need to zoom in a bit to “ignore” the long tail with too less datapoints.

sns.histplot(df.msrp[df.msrp < 100000], bins=50)

This kind of distribution (long tail, and the peak) is not good for ML models, because this distribution will confuse them.There is a way to get rid of the long tail, by applying logarithm to the price. This results in more compact values.

#np.log([0, 1,10,1000,100000])
# problem with logarithm is when we have a 0, because log(0) does not exist
#np.log([0 + 1, 1 + 1, 10 + 1, 1000 + 1, 100000 + 1])
# Output: array([ 0.        ,  0.69314718,  2.39789527,  6.90875478, 11.51293546])
# 
# to not always add 1 there is a NumPy function
#np.log1p([0, 1,10,1000,100000])
# Output: array([ 0.        ,  0.69314718,  2.39789527,  6.90875478, 11.51293546])

price_logs = np.log1p(df.msrp)
sns.histplot(price_logs, bins=50)

You can see the long tail is gone and you see a nice bell curve shape of a so called normal distribution, what is ideal for ML models. But still there is the strange peak. This could be the minimum price of $1,000 of the platform.

Missing values

As the title suggests, this is about finding missing values (NaN values). We can use the function in the following snippet to find that values. The sum function sums across columns and shows for each column how much missing values are there. This information is important when training a model.

df.isnull().sum()
# Output:
# make                    0
# model                   0
# year                    0
# engine_fuel_type        3
# engine_hp              69
# engine_cylinders       30
# transmission_type       0
# driven_wheels           0
# number_of_doors         6
# market_category      3742
# vehicle_size            0
# vehicle_style           0
# highway_mpg             0
# city_mpg                0
# popularity              0
# msrp                    0
# dtype: int64

Exploratory data analysis (EDA) – General Information

Exploratory data analysis (EDA) – Car price prediction project

Getting an overview

Distribution of price

Missing values

Teilen mit:

Related

Leave a comment Cancel reply