Overview:
Introduction to Pandas – Part 1/2
The last part of the Introduction to ML covers the Python package Pandas.
Pandas is a powerful and versatile data manipulation library in Python. It provides data structures and functions that are essential for data analysis and preprocessing tasks. With Pandas, you can easily load, manipulate, and analyze structured data.
One of the key data structures in Pandas is the DataFrame. It is a two-dimensional table-like structure that allows you to store and manipulate data in a row-column format. You can think of it as a spreadsheet or a SQL table. The DataFrame is designed to handle both homogeneous and heterogeneous data, making it suitable for a wide range of applications.
In addition to the DataFrame, Pandas also provides Series, which is a one-dimensional labeled array. It is similar to a column in a DataFrame and can be used to store and manipulate a single variable. Series are particularly useful when you need to perform operations on a specific column or extract a subset of data from a DataFrame.
Pandas also offers a rich set of functions for data manipulation. You can perform operations such as filtering, sorting, transforming, and aggregating data with ease. Additionally, Pandas integrates well with other Python libraries such as NumPy and Matplotlib, allowing you to seamlessly combine data manipulation, numerical computation, and data visualization tasks.
Whether you are working on a small data analysis project or dealing with large datasets, Pandas provides efficient and intuitive tools to handle your data. Its extensive documentation and active community support make it a popular choice among data scientists and analysts.
Pandas DataFrame
A Pandas DataFrame is basically a table. In the following snippet data is a list of lists. Each sublist is a row in the table. Each row represents one car.
data = [
['Nissan', 'Stanza', 1991, 138, 4, 'MANUAL', 'sedan', 2000],
['Hyundai', 'Sonata', 2017, None, 4, 'AUTOMATIC', 'Sedan', 27150],
['Lotus', 'Elise', 2010, 218, 4, 'MANUAL', 'convertible', 54990],
['GMC', 'Acadia', 2017, 194, 4, 'AUTOMATIC', '4dr SUV', 34450],
['Nissan', 'Frontier', 2017, 261, 6, 'MANUAL', 'Pickup', 32340],
]
columns = [
'Make', 'Model', 'Year', 'Engine HP', 'Engine Cylinders',
'Transmission Type', 'Vehicle_Style', 'MSRP'
]
pd.DataFrame(data)
The last line outputs a Pandas Dataframe but without column names.
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
|---|---|---|---|---|---|---|---|---|
| 0 | Nissan | Stanza | 1991 | 138.0 | 4 | MANUAL | sedan | 2000 |
| 1 | Hyundai | Sonata | 2017 | NaN | 4 | AUTOMATIC | Sedan | 27150 |
| 2 | Lotus | Elise | 2010 | 218.0 | 4 | MANUAL | convertible | 54990 |
| 3 | GMC | Acadia | 2017 | 194.0 | 4 | AUTOMATIC | 4dr SUV | 34450 |
| 4 | Nissan | Frontier | 2017 | 261.0 | 6 | MANUAL | Pickup | 32340 |
To display the correct output with column names, the columns parameter must be set.
df = pd.DataFrame(data, columns = columns)
df
| Make | Model | Year | Engine HP | Engine Cylinders | Transmission Type | Vehicle_Style | MSRP | |
|---|---|---|---|---|---|---|---|---|
| 0 | Nissan | Stanza | 1991 | 138.0 | 4 | MANUAL | sedan | 2000 |
| 1 | Hyundai | Sonata | 2017 | NaN | 4 | AUTOMATIC | Sedan | 27150 |
| 2 | Lotus | Elise | 2010 | 218.0 | 4 | MANUAL | convertible | 54990 |
| 3 | GMC | Acadia | 2017 | 194.0 | 4 | AUTOMATIC | 4dr SUV | 34450 |
| 4 | Nissan | Frontier | 2017 | 261.0 | 6 | MANUAL | Pickup | 32340 |
There is another way to create a DataFrame. This time we use a list of dictionaries. In a dictionary we explicitly specify the value for each column. That means a dictionary is a key-value structure, where the keys are the column names and the values are the car specific value for that key.
data = [
{
"Make": "Nissan",
"Model": "Stanza",
"Year": 1991,
"Engine HP": 138.0,
"Engine Cylinders": 4,
"Transmission Type": "MANUAL",
"Vehicle_Style": "sedan",
"MSRP": 2000
},
{
"Make": "Hyundai",
"Model": "Sonata",
"Year": 2017,
"Engine HP": None,
"Engine Cylinders": 4,
"Transmission Type": "AUTOMATIC",
"Vehicle_Style": "Sedan",
"MSRP": 27150
},
{
"Make": "Lotus",
"Model": "Elise",
"Year": 2010,
"Engine HP": 218.0,
"Engine Cylinders": 4,
"Transmission Type": "MANUAL",
"Vehicle_Style": "convertible",
"MSRP": 54990
},
{
"Make": "GMC",
"Model": "Acadia",
"Year": 2017,
"Engine HP": 194.0,
"Engine Cylinders": 4,
"Transmission Type": "AUTOMATIC",
"Vehicle_Style": "4dr SUV",
"MSRP": 34450
},
{
"Make": "Nissan",
"Model": "Frontier",
"Year": 2017,
"Engine HP": 261.0,
"Engine Cylinders": 6,
"Transmission Type": "MANUAL",
"Vehicle_Style": "Pickup",
"MSRP": 32340
}
]
pd.DataFrame(data)
| Make | Model | Year | Engine HP | Engine Cylinders | Transmission Type | Vehicle_Style | MSRP | |
|---|---|---|---|---|---|---|---|---|
| 0 | Nissan | Stanza | 1991 | 138.0 | 4 | MANUAL | sedan | 2000 |
| 1 | Hyundai | Sonata | 2017 | NaN | 4 | AUTOMATIC | Sedan | 27150 |
| 2 | Lotus | Elise | 2010 | 218.0 | 4 | MANUAL | convertible | 54990 |
| 3 | GMC | Acadia | 2017 | 194.0 | 4 | AUTOMATIC | 4dr SUV | 34450 |
| 4 | Nissan | Frontier | 2017 | 261.0 | 6 | MANUAL | Pickup | 32340 |
After loading a DataFrame from csv file or from sql query the first thing to do is to look at the first rows to get a fast overview about the data.
df.head()
| Make | Model | Year | Engine HP | Engine Cylinders | Transmission Type | Vehicle_Style | MSRP | |
|---|---|---|---|---|---|---|---|---|
| 0 | Nissan | Stanza | 1991 | 138.0 | 4 | MANUAL | sedan | 2000 |
| 1 | Hyundai | Sonata | 2017 | NaN | 4 | AUTOMATIC | Sedan | 27150 |
| 2 | Lotus | Elise | 2010 | 218.0 | 4 | MANUAL | convertible | 54990 |
| 3 | GMC | Acadia | 2017 | 194.0 | 4 | AUTOMATIC | 4dr SUV | 34450 |
| 4 | Nissan | Frontier | 2017 | 261.0 | 6 | MANUAL | Pickup | 32340 |
df.head(n=2)
| Make | Model | Year | Engine HP | Engine Cylinders | Transmission Type | Vehicle_Style | MSRP | |
|---|---|---|---|---|---|---|---|---|
| 0 | Nissan | Stanza | 1991 | 138.0 | 4 | MANUAL | sedan | 2000 |
| 1 | Hyundai | Sonata | 2017 | NaN | 4 | AUTOMATIC | Sedan | 27150 |
Pandas Series
A Pandas Series is a one-dimensional labeled array that can hold data of any type. It is similar to a column in a DataFrame and can be thought of as a single variable. Series provide a powerful way to manipulate and analyze specific columns or subsets of data from a DataFrame.
You can create a Series by passing a list or array of values to the pd.Series() function. For example, let’s create a Series of the “Make” column from the DataFrame:
make_series = pd.Series(df['Make'])
This will create a Series where each element corresponds to the “Make” value of a row in the DataFrame. You can then perform various operations on the Series, such as filtering, sorting, or computing statistics.
For instance, to filter the Series and select only the rows where the make is “Nissan”, you can use boolean indexing:
nissan_cars = make_series[make_series == 'Nissan']
You can also perform mathematical operations on Series, such as adding or multiplying values. For example, let’s say you have a Series representing the MSRP (Manufacturer’s Suggested Retail Price) of the cars:
msrp_series = pd.Series(df['MSRP'])
You can calculate the average MSRP using the mean() method:
average_msrp = msrp_series.mean()
You can also apply a function to each element of the Series using the apply() method. This allows you to perform custom operations on the data. For example, let’s say you want to convert the MSRP values from dollars to euros:
def convert_to_euros(value):
return value * 0.85
msrp_in_euros = msrp_series.apply(convert_to_euros)
This will create a new Series msrp_in_euros where each value is the MSRP converted to euros.
In addition to the above operations, Pandas Series also provide a wide range of methods for data manipulation and analysis. You can find more information and examples in the official Pandas documentation.
By leveraging the power of Pandas Series, you can easily extract, transform, and analyze specific columns or subsets of data from a DataFrame, making it a versatile tool for data manipulation and analysis in Python.
# To access one serie you can use the dot notation...
df.Make
# Output:
# 0 Nissan
# 1 Hyundai
# 2 Lotus
# 3 GMC
# 4 Nissan
# Name: Make, dtype: object
# ... or the bracket notation
df['Make']
# Output:
# 0 Nissan
# 1 Hyundai
# 2 Lotus
# 3 GMC
# 4 Nissan
# Name: Make, dtype: object
You can create a new subset of the DataFrame with a selection of columns using the bracket notation.
df[['Make', 'Model', 'MSRP']]
| Make | Model | MSRP | |
|---|---|---|---|
| 0 | Nissan | Stanza | 2000 |
| 1 | Hyundai | Sonata | 27150 |
| 2 | Lotus | Elise | 54990 |
| 3 | GMC | Acadia | 34450 |
| 4 | Nissan | Frontier | 32340 |
Adding columns to DataFrames
Sometimes it’s necessary to add one more column to the DataFrame. The next snippet shows how you can add a column ‘id’ with predefined values to the DataFrame.
df['id'] = [1, 2, 3, 4, 5]
df
| Make | Model | Year | Engine HP | Engine Cylinders | Transmission Type | Vehicle_Style | MSRP | id | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Nissan | Stanza | 1991 | 138.0 | 4 | MANUAL | sedan | 2000 | 1 |
| 1 | Hyundai | Sonata | 2017 | NaN | 4 | AUTOMATIC | Sedan | 27150 | 2 |
| 2 | Lotus | Elise | 2010 | 218.0 | 4 | MANUAL | convertible | 54990 | 3 |
| 3 | GMC | Acadia | 2017 | 194.0 | 4 | AUTOMATIC | 4dr SUV | 34450 | 4 |
| 4 | Nissan | Frontier | 2017 | 261.0 | 6 | MANUAL | Pickup | 32340 | 5 |
df['id']
# Output:
# 0 1
# 1 2
# 2 3
# 3 4
# 4 5
# Name: id, dtype: int64
df['id'] = [10, 20, 30, 40, 50]
df
| Make | Model | Year | Engine HP | Engine Cylinders | Transmission Type | Vehicle_Style | MSRP | id | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Nissan | Stanza | 1991 | 138.0 | 4 | MANUAL | sedan | 2000 | 10 |
| 1 | Hyundai | Sonata | 2017 | NaN | 4 | AUTOMATIC | Sedan | 27150 | 20 |
| 2 | Lotus | Elise | 2010 | 218.0 | 4 | MANUAL | convertible | 54990 | 30 |
| 3 | GMC | Acadia | 2017 | 194.0 | 4 | AUTOMATIC | 4dr SUV | 34450 | 40 |
| 4 | Nissan | Frontier | 2017 | 261.0 | 6 | MANUAL | Pickup | 32340 | 50 |
Deleting columns from DataFrames
To delete a column you can use the del operator.
del df['id']
df
| Make | Model | Year | Engine HP | Engine Cylinders | Transmission Type | Vehicle_Style | MSRP | |
|---|---|---|---|---|---|---|---|---|
| 0 | Nissan | Stanza | 1991 | 138.0 | 4 | MANUAL | sedan | 2000 |
| 1 | Hyundai | Sonata | 2017 | NaN | 4 | AUTOMATIC | Sedan | 27150 |
| 2 | Lotus | Elise | 2010 | 218.0 | 4 | MANUAL | convertible | 54990 |
| 3 | GMC | Acadia | 2017 | 194.0 | 4 | AUTOMATIC | 4dr SUV | 34450 |
| 4 | Nissan | Frontier | 2017 | 261.0 | 6 | MANUAL | Pickup | 32340 |
Index
The numbers in the first column refer to each row (called Index).
df.index
# Output: RangeIndex(start=0, stop=5, step=1)
Accessing elements
Using this index we can access the elements of the DataFrame. You can output only one element as shown in the first line of next snippet. But you can also return multiple rows as shown in the last line.
df.loc[1]
# Output:
# Make Hyundai
# Model Sonata
# Year 2017
# Engine HP NaN
# Engine Cylinders 4
# Transmission Type AUTOMATIC
# Vehicle_Style Sedan
# MSRP 27150
# Name: 1, dtype: object
df.loc[[1, 2]]
| Make | Model | Year | Engine HP | Engine Cylinders | Transmission Type | Vehicle_Style | MSRP | |
|---|---|---|---|---|---|---|---|---|
| 1 | Hyundai | Sonata | 2017 | NaN | 4 | AUTOMATIC | Sedan | 27150 |
| 2 | Lotus | Elise | 2010 | 218.0 | 4 | MANUAL | convertible | 54990 |
We can also replace the index…
df.index = ['a', 'b', 'c', 'd', 'e']
df
| Make | Model | Year | Engine HP | Engine Cylinders | Transmission Type | Vehicle_Style | MSRP | |
|---|---|---|---|---|---|---|---|---|
| a | Nissan | Stanza | 1991 | 138.0 | 4 | MANUAL | sedan | 2000 |
| b | Hyundai | Sonata | 2017 | NaN | 4 | AUTOMATIC | Sedan | 27150 |
| c | Lotus | Elise | 2010 | 218.0 | 4 | MANUAL | convertible | 54990 |
| d | GMC | Acadia | 2017 | 194.0 | 4 | AUTOMATIC | 4dr SUV | 34450 |
| e | Nissan | Frontier | 2017 | 261.0 | 6 | MANUAL | Pickup | 32340 |
# usual index
df.loc[['b', 'c']]
| Make | Model | Year | Engine HP | Engine Cylinders | Transmission Type | Vehicle_Style | MSRP | |
|---|---|---|---|---|---|---|---|---|
| b | Hyundai | Sonata | 2017 | NaN | 4 | AUTOMATIC | Sedan | 27150 |
| c | Lotus | Elise | 2010 | 218.0 | 4 | MANUAL | convertible | 54990 |
# Still we can refer to a positional index (what we usually use in lists or NumPy arrays)
df.iloc[[1, 2, 4]]
| Make | Model | Year | Engine HP | Engine Cylinders | Transmission Type | Vehicle_Style | MSRP | |
|---|---|---|---|---|---|---|---|---|
| b | Hyundai | Sonata | 2017 | NaN | 4 | AUTOMATIC | Sedan | 27150 |
| c | Lotus | Elise | 2010 | 218.0 | 4 | MANUAL | convertible | 54990 |
| e | Nissan | Frontier | 2017 | 261.0 | 6 | MANUAL | Pickup | 32340 |
To reset the index we can use reset_index function. This keeps the previous index and creates a new column called index.
df.reset_index()
| index | Make | Model | Year | Engine HP | Engine Cylinders | Transmission Type | Vehicle_Style | MSRP | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | a | Nissan | Stanza | 1991 | 138.0 | 4 | MANUAL | sedan | 2000 |
| 1 | b | Hyundai | Sonata | 2017 | NaN | 4 | AUTOMATIC | Sedan | 27150 |
| 2 | c | Lotus | Elise | 2010 | 218.0 | 4 | MANUAL | convertible | 54990 |
| 3 | d | GMC | Acadia | 2017 | 194.0 | 4 | AUTOMATIC | 4dr SUV | 34450 |
| 4 | e | Nissan | Frontier | 2017 | 261.0 | 6 | MANUAL | Pickup | 32340 |
If the old index is not needed it can simply be dropped.
df = df.reset_index(drop=True)
df
| Make | Model | Year | Engine HP | Engine Cylinders | Transmission Type | Vehicle_Style | MSRP | |
|---|---|---|---|---|---|---|---|---|
| 0 | Nissan | Stanza | 1991 | 138.0 | 4 | MANUAL | sedan | 2000 |
| 1 | Hyundai | Sonata | 2017 | NaN | 4 | AUTOMATIC | Sedan | 27150 |
| 2 | Lotus | Elise | 2010 | 218.0 | 4 | MANUAL | convertible | 54990 |
| 3 | GMC | Acadia | 2017 | 194.0 | 4 | AUTOMATIC | 4dr SUV | 34450 |
| 4 | Nissan | Frontier | 2017 | 261.0 | 6 | MANUAL | Pickup | 32340 |
Element-wise operations
When looking at data, NaN values can occur. NaN means “not a number”. The following example illustrates this.
# NaN denotes a missing number
df['Engine HP']
# Output:
# 0 138.0
# 1 NaN
# 2 218.0
# 3 194.0
# 4 261.0
# Name: Engine HP, dtype: float64
We can do everything we can do in NumPy, but here we operate on series from Pandas (instead of NumPy arrays). The main difference is, that you have here an index and a name. Under the hood Pandas actually uses NumPy. The next snippet shows the devision of every element by 100.
df['Engine HP'] / 100
# Output:
# 0 1.38
# 1 NaN
# 2 2.18
# 3 1.94
# 4 2.61
# Name: Engine HP, dtype: float64
But it’s also possible to use logical operators.
df.Year
# Output:
# 0 1991
# 1 2017
# 2 2010
# 3 2017
# 4 2017
# Name: Year, dtype: int64
df.Year >= 2015
# Output:
# 0 False
# 1 True
# 2 False
# 3 True
# 4 True
# Name: Year, dtype: bool