Environment
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment

Precision in Understanding: EDA Techniques

A crucial aspect of data understanding involves delving into Exploratory Data Analysis (EDA) which is an approach of analyzing datasets to summarize the containing main features. This entails both numerical and graphical summaries to paint a comprehensive picture of your dataset.

Numerical Summaries: Decoding Data Through Numbers

Depending on how many different variables are involved numerical summaries are divided into univariate measures (mean, median, percentiles, …) and bivariate measures (correlation, …). Univariate measures rely on only one variable, while bivariate measures rely on two variables.

Mean and Median

The mean provides the average value, while the median offers the middle point. Assessing both helps understand the central tendency of the data. The mean considers all values but is sensitive for extreme values while the median does not consider dataset distribution and is insensitive for extreme values.

Python Code:

import numpy as np

data = np.array([1, 2, 3, 4, 5])
mean_value = np.mean(data)
median_value = np.median(data)

print(f"Mean: {mean_value}, Median: {median_value}")

Percentiles

Dividing the data into percentiles reveals the distribution’s spread, highlighting values below which a certain percentage of observations fall.

Python Code:

import numpy as np

data = np.array([1, 2, 3, 4, 5])
percentile_25 = np.percentile(data, 25)
percentile_75 = np.percentile(data, 75)

print(f"25th Percentile: {percentile_25}, 75th Percentile: {percentile_75}")

Standard Deviation

This metric measures the dispersion of data points from the mean, indicating the dataset’s variability. This measure considers all values and the dataset distribution as well.

Python Code:

import numpy as np

data = np.array([1, 2, 3, 4, 5])
std_deviation = np.std(data)

print(f"Standard Deviation: {std_deviation}")

Correlation Coefficients

Explore relationships between variables by calculating correlation coefficients. These coefficients quantify the strength and direction of linear relationships.

0 < x < 1 refers to positive relationship
0 refers to no relationship
-1 < x < 0 refers to negative relationship

Python Code:

import numpy as np

data_x = np.array([1, 2, 3, 4, 5])
data_y = np.array([5, 4, 3, 2, 1])

correlation_coefficient = np.corrcoef(data_x, data_y)[0, 1]

print(f"Correlation Coefficient: {correlation_coefficient}")

Skewness and Kurtosis

Skewness indicates the asymmetry of the data distribution, while kurtosis measures the tail heaviness. Both metrics offer insights into the dataset’s shape.

Python Code:

import numpy as np
from scipy.stats import skew, kurtosis

data = np.array([1, 2, 3, 4, 5])

skewness_value = skew(data)
kurtosis_value = kurtosis(data)

print(f"Skewness: {skewness_value}, Kurtosis: {kurtosis_value}")

Graphical Summaries: Crafting a Visual Narrative

While numerical summaries distill information into digits, graphical summaries paint a vivid picture of the data landscape. These visualizations enhance our intuitive understanding and bring to light patterns that might elude numerical scrutiny.

Histogramms

Histograms showcase the distribution of a single variable, revealing the frequency of different ranges. This graphical tool is fundamental for grasping the shape of the data.

Use: Identify the distribution of a variable and understand the frequency of values.
Use Case: Analyzing the distribution of exam scores in a class to understand the spread and identify common performance ranges.

Python Code:

import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'data' is your dataset and 'exam_scores' is the variable of interest
sns.histplot(data['exam_scores'], kde=True)
plt.title('Histogram of Exam Scores')
plt.show()

Distribution Plots

Kernel density plots or distribution plots provide a smoothed representation of the data distribution, aiding in the identification of patterns.

Use: Understand the continuous distribution of a variable.
Use Case: Analyzing the distribution of customer purchase amounts in an e-commerce dataset.

Python Code:

import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'data' is your DataFrame and 'purchase_amount' is the variable
sns.kdeplot(data['purchase_amount'], shade=True)
plt.title('Kernel Density Plot of Purchase Amounts')
plt.show()

Box Plots

Box plots, or box-and-whisker plots, offer a visual depiction of the dataset’s spread and central tendency. They are effective for identifying outliers and assessing the data’s variability.

Use: Visualize the spread of a variable and identify outliers.
Use Case: Comparing the distribution of salaries among different departments in a company.

Python Code:

import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'data' is your DataFrame and 'department' and 'salary' are the variables
sns.boxplot(x='department', y='salary', data=data)
plt.title('Box Plot of Salaries by Department')
plt.show()

Scatter Plots

Scatter plots are invaluable for exploring relationships between two variables. They uncover trends, clusters, or anomalies, guiding further analysis.

Use: Identify relationships between two variables.
Use Case: Investigating the correlation between hours of study and exam scores for students.

Python Code:

import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'data' is your DataFrame and 'study_hours' and 'exam_scores' are the variables
plt.scatter(data['study_hours'], data['exam_scores'])
plt.title('Scatter Plot of Study Hours vs. Exam Scores')
plt.xlabel('Study Hours')
plt.ylabel('Exam Scores')
plt.show()

Bar Plot / Count Plot

Bar plots or count plots represent the count of categorical variables, providing insights into the distribution of categories.

Use: Visualize the distribution of categorical variables.
Use Case: Showing the frequency of different car models in a dealership.

Python Code:

import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'data' is your DataFrame and 'car_model' is the variable
sns.countplot(x='car_model', data=data)
plt.title('Count Plot of Car Models')
plt.show()

Density Plot

Density plots showcase the distribution of a continuous variable similar to histograms but with a smoothed curve.

Use: Understand the continuous distribution of a variable.
Use Case: Visualizing the density of temperatures recorded over a month.

Python Code:

import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'data' is your DataFrame and 'temperature' is the variable
sns.kdeplot(data['column'], fill=True)
plt.title('Density Plot of the Variable')
plt.show()

Line Plot

Line plots represent the trend of a variable over a continuous dimension, often time.

Use: Visualize trends over a continuous dimension.
Use Case: Showing the sales trend of a product over different months.

Python Code:

import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'data' is your DataFrame and 'month' and 'sales' are the variables
sns.lineplot(x='month', y='sales', data=data)
plt.title('Line Plot of Sales Over Months')
plt.show()

Heatmap

A heatmap is a graphical representation of data where values are depicted using colors. It’s useful for visualizing relationships in a matrix format.

Use: Visualize relationships in a matrix format.
Use Case: Analyzing the correlation between various financial indicators in a company.

Python Code:

import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'data' is your DataFrame
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Heatmap of Financial Indicators Correlation')
plt.show()

Violin Plot

Combines aspects of box plots and density plots, providing a richer description of the data distribution.

Use: Visualizing the distribution of a numerical variable across different categories.
Use Case: Comparing the distribution of house prices among different neighborhoods.

Python Code:

import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'data' is your DataFrame and 'neighborhood' and 'house_price' are the variables
sns.violinplot(x='neighborhood', y='house_price', data=data)
plt.title('Violin Plot of House Prices by Neighborhood')
plt.show()

Subplot and Pair Plot

Use: Subplots help display multiple plots in a single figure, and pair plots show pairwise relationships between variables.
Use Case: Examining the distribution, scatter plots, and relationships between multiple variables.

Python Code:

# Subplots
plt.subplot(2, 2, 1)
sns.histplot(data['column'])
plt.title('Histogram')

plt.subplot(2, 2, 2)
sns.scatterplot(x='x_variable', y='y_variable', data=data)
plt.title('Scatter Plot')

plt.subplot(2, 2, 3)
sns.boxplot(x='category_column', y='numeric_column', data=data)
plt.title('Box Plot')

plt.subplot(2, 2, 4)
sns.kdeplot(data['column'], shade=True)
plt.title('Kernel Density Plot')

plt.tight_layout()
plt.show()

# Pair Plot
sns.pairplot(data[['variable1', 'variable2', 'variable3']])
plt.suptitle('Pair Plot of Variables')
plt.show()

By integrating numerical and graphical summaries into the fabric of EDA, we equip ourselves with a comprehensive toolkit. This arsenal empowers data explorers to unravel the intricacies of datasets, ensuring that no nuance escapes their scrutiny. Numerical precision and visual intuition converge, forming a symbiotic relationship that propels the journey from raw data to informed insights.

As we traverse the landscape of data understanding and sourcing, each step is a revelation. The journey is not just a progression; it’s an evolution. Embrace the evolving odyssey, where the possibilities are as limitless as the precision with which we navigate the data-driven frontier.

Unlocking Efficiency: The Power of Automated EDA in Python

Ready to elevate your data exploration game? I’ve prepared a special article, where you can dive into the world of Automated Exploratory Data Analysis (EDA) in Python. Uncover the tools and techniques that streamline your workflow, making data comprehension a breeze. Stay tuned for a journey into the realm where efficiency meets insights. The future of EDA is automated—be part of the revolution.

Project Guide – Data Understanding & Data Sourcing (2/2) – Part 5

Precision in Understanding: EDA Techniques

Numerical Summaries: Decoding Data Through Numbers

Mean and Median

Percentiles

Standard Deviation

Correlation Coefficients

Skewness and Kurtosis

Graphical Summaries: Crafting a Visual Narrative

Histogramms

Distribution Plots

Box Plots

Scatter Plots

Bar Plot / Count Plot

Density Plot

Line Plot

Heatmap

Violin Plot

Subplot and Pair Plot

Unlocking Efficiency: The Power of Automated EDA in Python

Leave a comment Cancel reply

Precision in Understanding: EDA Techniques

Numerical Summaries: Decoding Data Through Numbers

Mean and Median

Percentiles

Standard Deviation

Correlation Coefficients

Skewness and Kurtosis

Graphical Summaries: Crafting a Visual Narrative

Histogramms

Distribution Plots

Box Plots

Scatter Plots

Bar Plot / Count Plot

Density Plot

Line Plot

Heatmap

Violin Plot

Subplot and Pair Plot

Unlocking Efficiency: The Power of Automated EDA in Python

Teilen mit:

Related

Leave a comment Cancel reply