- Environment
- Business Understanding
- Data Understanding
- Data Preparation
- Modeling
- Evaluation
- Deployment
Precision in Understanding: EDA Techniques
A crucial aspect of data understanding involves delving into Exploratory Data Analysis (EDA) which is an approach of analyzing datasets to summarize the containing main features. This entails both numerical and graphical summaries to paint a comprehensive picture of your dataset.
Numerical Summaries: Decoding Data Through Numbers
Depending on how many different variables are involved numerical summaries are divided into univariate measures (mean, median, percentiles, …) and bivariate measures (correlation, …). Univariate measures rely on only one variable, while bivariate measures rely on two variables.
Mean and Median
The mean provides the average value, while the median offers the middle point. Assessing both helps understand the central tendency of the data. The mean considers all values but is sensitive for extreme values while the median does not consider dataset distribution and is insensitive for extreme values.
Python Code:
import numpy as np
data = np.array([1, 2, 3, 4, 5])
mean_value = np.mean(data)
median_value = np.median(data)
print(f"Mean: {mean_value}, Median: {median_value}")
Percentiles
Dividing the data into percentiles reveals the distribution’s spread, highlighting values below which a certain percentage of observations fall.
Python Code:
import numpy as np
data = np.array([1, 2, 3, 4, 5])
percentile_25 = np.percentile(data, 25)
percentile_75 = np.percentile(data, 75)
print(f"25th Percentile: {percentile_25}, 75th Percentile: {percentile_75}")
Standard Deviation
This metric measures the dispersion of data points from the mean, indicating the dataset’s variability. This measure considers all values and the dataset distribution as well.
Python Code:
import numpy as np
data = np.array([1, 2, 3, 4, 5])
std_deviation = np.std(data)
print(f"Standard Deviation: {std_deviation}")
Correlation Coefficients
Explore relationships between variables by calculating correlation coefficients. These coefficients quantify the strength and direction of linear relationships.
- 0 < x < 1 refers to positive relationship
- 0 refers to no relationship
- -1 < x < 0 refers to negative relationship
Python Code:
import numpy as np
data_x = np.array([1, 2, 3, 4, 5])
data_y = np.array([5, 4, 3, 2, 1])
correlation_coefficient = np.corrcoef(data_x, data_y)[0, 1]
print(f"Correlation Coefficient: {correlation_coefficient}")
Skewness and Kurtosis
Skewness indicates the asymmetry of the data distribution, while kurtosis measures the tail heaviness. Both metrics offer insights into the dataset’s shape.
Python Code:
import numpy as np
from scipy.stats import skew, kurtosis
data = np.array([1, 2, 3, 4, 5])
skewness_value = skew(data)
kurtosis_value = kurtosis(data)
print(f"Skewness: {skewness_value}, Kurtosis: {kurtosis_value}")
Graphical Summaries: Crafting a Visual Narrative
While numerical summaries distill information into digits, graphical summaries paint a vivid picture of the data landscape. These visualizations enhance our intuitive understanding and bring to light patterns that might elude numerical scrutiny.
Histogramms
Histograms showcase the distribution of a single variable, revealing the frequency of different ranges. This graphical tool is fundamental for grasping the shape of the data.
Use: Identify the distribution of a variable and understand the frequency of values.
Use Case: Analyzing the distribution of exam scores in a class to understand the spread and identify common performance ranges.
Python Code:
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming 'data' is your dataset and 'exam_scores' is the variable of interest
sns.histplot(data['exam_scores'], kde=True)
plt.title('Histogram of Exam Scores')
plt.show()
Distribution Plots
Kernel density plots or distribution plots provide a smoothed representation of the data distribution, aiding in the identification of patterns.
Use: Understand the continuous distribution of a variable.
Use Case: Analyzing the distribution of customer purchase amounts in an e-commerce dataset.
Python Code:
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming 'data' is your DataFrame and 'purchase_amount' is the variable
sns.kdeplot(data['purchase_amount'], shade=True)
plt.title('Kernel Density Plot of Purchase Amounts')
plt.show()
Box Plots
Box plots, or box-and-whisker plots, offer a visual depiction of the dataset’s spread and central tendency. They are effective for identifying outliers and assessing the data’s variability.
Use: Visualize the spread of a variable and identify outliers.
Use Case: Comparing the distribution of salaries among different departments in a company.
Python Code:
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming 'data' is your DataFrame and 'department' and 'salary' are the variables
sns.boxplot(x='department', y='salary', data=data)
plt.title('Box Plot of Salaries by Department')
plt.show()
Scatter Plots
Scatter plots are invaluable for exploring relationships between two variables. They uncover trends, clusters, or anomalies, guiding further analysis.
Use: Identify relationships between two variables.
Use Case: Investigating the correlation between hours of study and exam scores for students.
Python Code:
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming 'data' is your DataFrame and 'study_hours' and 'exam_scores' are the variables
plt.scatter(data['study_hours'], data['exam_scores'])
plt.title('Scatter Plot of Study Hours vs. Exam Scores')
plt.xlabel('Study Hours')
plt.ylabel('Exam Scores')
plt.show()
Bar Plot / Count Plot
Bar plots or count plots represent the count of categorical variables, providing insights into the distribution of categories.
Use: Visualize the distribution of categorical variables.
Use Case: Showing the frequency of different car models in a dealership.
Python Code:
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming 'data' is your DataFrame and 'car_model' is the variable
sns.countplot(x='car_model', data=data)
plt.title('Count Plot of Car Models')
plt.show()
Density Plot
Density plots showcase the distribution of a continuous variable similar to histograms but with a smoothed curve.
Use: Understand the continuous distribution of a variable.
Use Case: Visualizing the density of temperatures recorded over a month.
Python Code:
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming 'data' is your DataFrame and 'temperature' is the variable
sns.kdeplot(data['column'], fill=True)
plt.title('Density Plot of the Variable')
plt.show()
Line Plot
Line plots represent the trend of a variable over a continuous dimension, often time.
Use: Visualize trends over a continuous dimension.
Use Case: Showing the sales trend of a product over different months.
Python Code:
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming 'data' is your DataFrame and 'month' and 'sales' are the variables
sns.lineplot(x='month', y='sales', data=data)
plt.title('Line Plot of Sales Over Months')
plt.show()
Heatmap
A heatmap is a graphical representation of data where values are depicted using colors. It’s useful for visualizing relationships in a matrix format.
Use: Visualize relationships in a matrix format.
Use Case: Analyzing the correlation between various financial indicators in a company.
Python Code:
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming 'data' is your DataFrame
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Heatmap of Financial Indicators Correlation')
plt.show()
Violin Plot
Combines aspects of box plots and density plots, providing a richer description of the data distribution.
Use: Visualizing the distribution of a numerical variable across different categories.
Use Case: Comparing the distribution of house prices among different neighborhoods.
Python Code:
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming 'data' is your DataFrame and 'neighborhood' and 'house_price' are the variables
sns.violinplot(x='neighborhood', y='house_price', data=data)
plt.title('Violin Plot of House Prices by Neighborhood')
plt.show()
Subplot and Pair Plot
Use: Subplots help display multiple plots in a single figure, and pair plots show pairwise relationships between variables.
Use Case: Examining the distribution, scatter plots, and relationships between multiple variables.
Python Code:
# Subplots
plt.subplot(2, 2, 1)
sns.histplot(data['column'])
plt.title('Histogram')
plt.subplot(2, 2, 2)
sns.scatterplot(x='x_variable', y='y_variable', data=data)
plt.title('Scatter Plot')
plt.subplot(2, 2, 3)
sns.boxplot(x='category_column', y='numeric_column', data=data)
plt.title('Box Plot')
plt.subplot(2, 2, 4)
sns.kdeplot(data['column'], shade=True)
plt.title('Kernel Density Plot')
plt.tight_layout()
plt.show()
# Pair Plot
sns.pairplot(data[['variable1', 'variable2', 'variable3']])
plt.suptitle('Pair Plot of Variables')
plt.show()
By integrating numerical and graphical summaries into the fabric of EDA, we equip ourselves with a comprehensive toolkit. This arsenal empowers data explorers to unravel the intricacies of datasets, ensuring that no nuance escapes their scrutiny. Numerical precision and visual intuition converge, forming a symbiotic relationship that propels the journey from raw data to informed insights.
As we traverse the landscape of data understanding and sourcing, each step is a revelation. The journey is not just a progression; it’s an evolution. Embrace the evolving odyssey, where the possibilities are as limitless as the precision with which we navigate the data-driven frontier.
Unlocking Efficiency: The Power of Automated EDA in Python
Ready to elevate your data exploration game? I’ve prepared a special article, where you can dive into the world of Automated Exploratory Data Analysis (EDA) in Python. Uncover the tools and techniques that streamline your workflow, making data comprehension a breeze. Stay tuned for a journey into the realm where efficiency meets insights. The future of EDA is automated—be part of the revolution.