Car price prediction project
This chapter is about the implementation of an ML project and which steps have to be considered. It is about the prediction of car prices based on a Kaggle dataset.
The steps will be described in individual blog posts and include:
- Data preparation
- EDA (Exploratory Data Analysis)
- Use linear regression for price prediction
(MSRP – Manufacturer suggested retail price) - Understand the internals of linear regression
- Evaluating the model with RMSE (root mean squared error)
- Feature Engineering (creating new features)
- Regularization
- Using the model
Data preparation – General Information
In the data preparation phase, several important steps need to be followed to ensure the dataset is suitable for analysis and modeling. Here are some key considerations:
- Data Cleaning: This involves handling missing values, dealing with outliers, and ensuring consistency in data formats. Missing values can be imputed using various techniques, such as mean/median imputation or using advanced methods like K-nearest neighbors. Outliers may need to be addressed by either removing them or transforming them to fall within a reasonable range.
- Data Integration: If you have multiple datasets related to car prices, you may need to combine them into a single dataset. This can involve matching and merging records based on common identifiers or performing data joins based on shared attributes.
- Data Transformation: Sometimes, the existing variables may not be in a suitable format for analysis. In such cases, feature engineering techniques can be applied to create new variables that may have a better relationship with the target variable, such as transforming categorical variables into numerical ones using one-hot encoding or label encoding.
- Feature Scaling: It is crucial to make sure that the features are on a similar scale to avoid bias in the model. Common techniques for feature scaling include standardization (mean of 0 and standard deviation of 1) or normalization (scaling values between 0 and 1).
- Train-Validation Split: Before building the predictive model, it is essential to split the dataset into training and validating subsets. Typically, the majority of the data is used for training, while a smaller portion is reserved for evaluating the model’s performance. As I mentioned in past articles, a train-validate-test split might provide more reliable results.
By following these steps diligently, you can ensure that the data is well-prepared and ready for the subsequent stages of the car price prediction project.
Data preparation – Car price prediction project
This section covers a few topics that are mentioned before to get a better understanding about what happens in the preparation step.
Loading Data and get an overview
import pandas as pd
import numpy as np
# reading csv file after downloading...
df = pd.read_csv('data.csv')
# ... and getting a first overview about the data
df.head()
# What you can see here, there is some inconsitency
# in the way of naming columns
# -> sometimes the columns have underscores, sometimes not,
# sometimes the columns have capital letters, sometimes not
#
# df['Transmission Type'] is working
# df.Transmission Type is not working because of space
Cleaning
To make the columns more consistent we might decide to make them all lowercased and we might replace spaces with underscores. The following code snippets show how to get this.
# Pandas DataFrame has a field called columns,
# that contains the name of the columns
# columns is an index, that is a special data structure
# in Pandas (very similar to series)
df.columns
# like series it also has the str method for doing string
# manipulation what we can do now is to apply the same
# string function to all column names
df.columns = df.columns.str.lower().str.replace(' ','_')
df.head()
Actually we have the same problems with the values. Before we can apply that, we need to detect all string columns, because the str function works only on strings.
# dtypes returns for all the columns what is the type of this
# column and here we're interested in "objects"
#
# In case of csv files "objects" cannot be something
# different than strings
df.dtypes
df.dtypes == 'object'
# to select only the objects
df.dtypes[df.dtypes == 'object']
The output of the last line of code in the last snippet are the values and the index of the series. We’re not interested in values here, but we’re interested in the names.
# Get access to the index of that series
# Converting it to a python list with name strings
strings = list(df.dtypes[df.dtypes == 'object'].index)
strings
Similar to what we’ve done with the column names we want to apply to the specified columns.
df['make'].str.lower().str.replace(' ','_')
# Better way
for col in strings:
df[col] = df[col].str.lower().str.replace(' ','_')
df.head()