ML Zoomcamp 2023 – Introduction to Machine Learning – Part 12

Overview:

  1. Introduction to Pandas – Part 1/2
    1. Pandas DataFrame
    2. Pandas Series
    3. Adding columns to DataFrames
    4. Deleting columns from DataFrames
    5. Index
    6. Accessing elements
    7. Element-wise operations

Introduction to Pandas – Part 1/2

The last part of the Introduction to ML covers the Python package Pandas.

Pandas is a powerful and versatile data manipulation library in Python. It provides data structures and functions that are essential for data analysis and preprocessing tasks. With Pandas, you can easily load, manipulate, and analyze structured data.

One of the key data structures in Pandas is the DataFrame. It is a two-dimensional table-like structure that allows you to store and manipulate data in a row-column format. You can think of it as a spreadsheet or a SQL table. The DataFrame is designed to handle both homogeneous and heterogeneous data, making it suitable for a wide range of applications.

In addition to the DataFrame, Pandas also provides Series, which is a one-dimensional labeled array. It is similar to a column in a DataFrame and can be used to store and manipulate a single variable. Series are particularly useful when you need to perform operations on a specific column or extract a subset of data from a DataFrame.

Pandas also offers a rich set of functions for data manipulation. You can perform operations such as filtering, sorting, transforming, and aggregating data with ease. Additionally, Pandas integrates well with other Python libraries such as NumPy and Matplotlib, allowing you to seamlessly combine data manipulation, numerical computation, and data visualization tasks.

Whether you are working on a small data analysis project or dealing with large datasets, Pandas provides efficient and intuitive tools to handle your data. Its extensive documentation and active community support make it a popular choice among data scientists and analysts.

Pandas DataFrame

A Pandas DataFrame is basically a table. In the following snippet data is a list of lists. Each sublist is a row in the table. Each row represents one car.

data = [
    ['Nissan', 'Stanza', 1991, 138, 4, 'MANUAL', 'sedan', 2000],
    ['Hyundai', 'Sonata', 2017, None, 4, 'AUTOMATIC', 'Sedan', 27150],
    ['Lotus', 'Elise', 2010, 218, 4, 'MANUAL', 'convertible', 54990],
    ['GMC', 'Acadia',  2017, 194, 4, 'AUTOMATIC', '4dr SUV', 34450],
    ['Nissan', 'Frontier', 2017, 261, 6, 'MANUAL', 'Pickup', 32340],
]

columns = [
    'Make', 'Model', 'Year', 'Engine HP', 'Engine Cylinders',
    'Transmission Type', 'Vehicle_Style', 'MSRP'
]

pd.DataFrame(data)

The last line outputs a Pandas Dataframe but without column names.

01234567
0NissanStanza1991138.04MANUALsedan2000
1HyundaiSonata2017NaN4AUTOMATICSedan27150
2LotusElise2010218.04MANUALconvertible54990
3GMCAcadia2017194.04AUTOMATIC4dr SUV34450
4NissanFrontier2017261.06MANUALPickup32340
DataFrame without any column names

To display the correct output with column names, the columns parameter must be set.

df = pd.DataFrame(data, columns = columns)
df
MakeModelYearEngine HPEngine CylindersTransmission TypeVehicle_StyleMSRP
0NissanStanza1991138.04MANUALsedan2000
1HyundaiSonata2017NaN4AUTOMATICSedan27150
2LotusElise2010218.04MANUALconvertible54990
3GMCAcadia2017194.04AUTOMATIC4dr SUV34450
4NissanFrontier2017261.06MANUALPickup32340
DataFrame with column names

There is another way to create a DataFrame. This time we use a list of dictionaries. In a dictionary we explicitly specify the value for each column. That means a dictionary is a key-value structure, where the keys are the column names and the values are the car specific value for that key.

data = [
    {
        "Make": "Nissan",
        "Model": "Stanza",
        "Year": 1991,
        "Engine HP": 138.0,
        "Engine Cylinders": 4,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "sedan",
        "MSRP": 2000
    },
    {
        "Make": "Hyundai",
        "Model": "Sonata",
        "Year": 2017,
        "Engine HP": None,
        "Engine Cylinders": 4,
        "Transmission Type": "AUTOMATIC",
        "Vehicle_Style": "Sedan",
        "MSRP": 27150
    },
    {
        "Make": "Lotus",
        "Model": "Elise",
        "Year": 2010,
        "Engine HP": 218.0,
        "Engine Cylinders": 4,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "convertible",
        "MSRP": 54990
    },
    {
        "Make": "GMC",
        "Model": "Acadia",
        "Year": 2017,
        "Engine HP": 194.0,
        "Engine Cylinders": 4,
        "Transmission Type": "AUTOMATIC",
        "Vehicle_Style": "4dr SUV",
        "MSRP": 34450
    },
    {
        "Make": "Nissan",
        "Model": "Frontier",
        "Year": 2017,
        "Engine HP": 261.0,
        "Engine Cylinders": 6,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "Pickup",
        "MSRP": 32340
    }
]

pd.DataFrame(data)
MakeModelYearEngine HPEngine CylindersTransmission TypeVehicle_StyleMSRP
0NissanStanza1991138.04MANUALsedan2000
1HyundaiSonata2017NaN4AUTOMATICSedan27150
2LotusElise2010218.04MANUALconvertible54990
3GMCAcadia2017194.04AUTOMATIC4dr SUV34450
4NissanFrontier2017261.06MANUALPickup32340

After loading a DataFrame from csv file or from sql query the first thing to do is to look at the first rows to get a fast overview about the data.

df.head()
MakeModelYearEngine HPEngine CylindersTransmission TypeVehicle_StyleMSRP
0NissanStanza1991138.04MANUALsedan2000
1HyundaiSonata2017NaN4AUTOMATICSedan27150
2LotusElise2010218.04MANUALconvertible54990
3GMCAcadia2017194.04AUTOMATIC4dr SUV34450
4NissanFrontier2017261.06MANUALPickup32340
df.head() outputs the first rows of the DataFrame
df.head(n=2)
MakeModelYearEngine HPEngine CylindersTransmission TypeVehicle_StyleMSRP
0NissanStanza1991138.04MANUALsedan2000
1HyundaiSonata2017NaN4AUTOMATICSedan27150
df.head(n=2) outputs only the first two rows of the DataFrame

Pandas Series

A Pandas Series is a one-dimensional labeled array that can hold data of any type. It is similar to a column in a DataFrame and can be thought of as a single variable. Series provide a powerful way to manipulate and analyze specific columns or subsets of data from a DataFrame.

You can create a Series by passing a list or array of values to the pd.Series() function. For example, let’s create a Series of the “Make” column from the DataFrame:

make_series = pd.Series(df['Make'])

This will create a Series where each element corresponds to the “Make” value of a row in the DataFrame. You can then perform various operations on the Series, such as filtering, sorting, or computing statistics.

For instance, to filter the Series and select only the rows where the make is “Nissan”, you can use boolean indexing:

nissan_cars = make_series[make_series == 'Nissan']

You can also perform mathematical operations on Series, such as adding or multiplying values. For example, let’s say you have a Series representing the MSRP (Manufacturer’s Suggested Retail Price) of the cars:

msrp_series = pd.Series(df['MSRP'])

You can calculate the average MSRP using the mean() method:

average_msrp = msrp_series.mean()

You can also apply a function to each element of the Series using the apply() method. This allows you to perform custom operations on the data. For example, let’s say you want to convert the MSRP values from dollars to euros:

def convert_to_euros(value):
    return value * 0.85

msrp_in_euros = msrp_series.apply(convert_to_euros)

This will create a new Series msrp_in_euros where each value is the MSRP converted to euros.

In addition to the above operations, Pandas Series also provide a wide range of methods for data manipulation and analysis. You can find more information and examples in the official Pandas documentation.

By leveraging the power of Pandas Series, you can easily extract, transform, and analyze specific columns or subsets of data from a DataFrame, making it a versatile tool for data manipulation and analysis in Python.

# To access one serie you can use the dot notation...
df.Make
# Output:
# 0     Nissan
# 1    Hyundai
# 2      Lotus
# 3        GMC
# 4     Nissan
# Name: Make, dtype: object

# ... or the bracket notation 
df['Make']
# Output:
# 0     Nissan
# 1    Hyundai
# 2      Lotus
# 3        GMC
# 4     Nissan
# Name: Make, dtype: object

You can create a new subset of the DataFrame with a selection of columns using the bracket notation.

df[['Make', 'Model', 'MSRP']]
MakeModelMSRP
0NissanStanza2000
1HyundaiSonata27150
2LotusElise54990
3GMCAcadia34450
4NissanFrontier32340
Output of selected columns ‘Make’, ‘Model’, and ‘MSRP’

Adding columns to DataFrames

Sometimes it’s necessary to add one more column to the DataFrame. The next snippet shows how you can add a column ‘id’ with predefined values to the DataFrame.

df['id'] = [1, 2, 3, 4, 5]
df
MakeModelYearEngine HPEngine CylindersTransmission TypeVehicle_StyleMSRPid
0NissanStanza1991138.04MANUALsedan20001
1HyundaiSonata2017NaN4AUTOMATICSedan271502
2LotusElise2010218.04MANUALconvertible549903
3GMCAcadia2017194.04AUTOMATIC4dr SUV344504
4NissanFrontier2017261.06MANUALPickup323405
Adds a new column ‘id’ to the DataFrame and sets values for this new column
df['id']
# Output:
# 0    1
# 1    2
# 2    3
# 3    4
# 4    5
# Name: id, dtype: int64

df['id'] = [10, 20, 30, 40, 50]
df
MakeModelYearEngine HPEngine CylindersTransmission TypeVehicle_StyleMSRPid
0NissanStanza1991138.04MANUALsedan200010
1HyundaiSonata2017NaN4AUTOMATICSedan2715020
2LotusElise2010218.04MANUALconvertible5499030
3GMCAcadia2017194.04AUTOMATIC4dr SUV3445040
4NissanFrontier2017261.06MANUALPickup3234050
Output with changed values for the id column

Deleting columns from DataFrames

To delete a column you can use the del operator.

del df['id']
df
MakeModelYearEngine HPEngine CylindersTransmission TypeVehicle_StyleMSRP
0NissanStanza1991138.04MANUALsedan2000
1HyundaiSonata2017NaN4AUTOMATICSedan27150
2LotusElise2010218.04MANUALconvertible54990
3GMCAcadia2017194.04AUTOMATIC4dr SUV34450
4NissanFrontier2017261.06MANUALPickup32340
Output without id column

Index

The numbers in the first column refer to each row (called Index).

df.index
# Output: RangeIndex(start=0, stop=5, step=1)

Accessing elements

Using this index we can access the elements of the DataFrame. You can output only one element as shown in the first line of next snippet. But you can also return multiple rows as shown in the last line.

df.loc[1]
# Output:
# Make                   Hyundai
# Model                   Sonata
# Year                      2017
# Engine HP                  NaN
# Engine Cylinders             4
# Transmission Type    AUTOMATIC
# Vehicle_Style            Sedan
# MSRP                     27150
# Name: 1, dtype: object

df.loc[[1, 2]]
MakeModelYearEngine HPEngine CylindersTransmission TypeVehicle_StyleMSRP
1HyundaiSonata2017NaN4AUTOMATICSedan27150
2LotusElise2010218.04MANUALconvertible54990
Returning rows by accessing the index

We can also replace the index…

df.index = ['a', 'b', 'c', 'd', 'e']
df
MakeModelYearEngine HPEngine CylindersTransmission TypeVehicle_StyleMSRP
aNissanStanza1991138.04MANUALsedan2000
bHyundaiSonata2017NaN4AUTOMATICSedan27150
cLotusElise2010218.04MANUALconvertible54990
dGMCAcadia2017194.04AUTOMATIC4dr SUV34450
eNissanFrontier2017261.06MANUALPickup32340
Changing index to a categorical one
# usual index
df.loc[['b', 'c']]
MakeModelYearEngine HPEngine CylindersTransmission TypeVehicle_StyleMSRP
bHyundaiSonata2017NaN4AUTOMATICSedan27150
cLotusElise2010218.04MANUALconvertible54990
Returning rows by accessing a categorical index
# Still we can refer to a positional index (what we usually use in lists or NumPy arrays)
df.iloc[[1, 2, 4]]
MakeModelYearEngine HPEngine CylindersTransmission TypeVehicle_StyleMSRP
bHyundaiSonata2017NaN4AUTOMATICSedan27150
cLotusElise2010218.04MANUALconvertible54990
eNissanFrontier2017261.06MANUALPickup32340
Returning rows by accessing a positional index with iloc

To reset the index we can use reset_index function. This keeps the previous index and creates a new column called index.

df.reset_index()
indexMakeModelYearEngine HPEngine CylindersTransmission TypeVehicle_StyleMSRP
0aNissanStanza1991138.04MANUALsedan2000
1bHyundaiSonata2017NaN4AUTOMATICSedan27150
2cLotusElise2010218.04MANUALconvertible54990
3dGMCAcadia2017194.04AUTOMATIC4dr SUV34450
4eNissanFrontier2017261.06MANUALPickup32340
DataFrame after resetting the index

If the old index is not needed it can simply be dropped.

df = df.reset_index(drop=True)
df
MakeModelYearEngine HPEngine CylindersTransmission TypeVehicle_StyleMSRP
0NissanStanza1991138.04MANUALsedan2000
1HyundaiSonata2017NaN4AUTOMATICSedan27150
2LotusElise2010218.04MANUALconvertible54990
3GMCAcadia2017194.04AUTOMATIC4dr SUV34450
4NissanFrontier2017261.06MANUALPickup32340
DataFrame after removing the old index

Element-wise operations

When looking at data, NaN values can occur. NaN means “not a number”. The following example illustrates this.

# NaN denotes a missing number
df['Engine HP']
# Output:
# 0    138.0
# 1      NaN
# 2    218.0
# 3    194.0
# 4    261.0
# Name: Engine HP, dtype: float64

We can do everything we can do in NumPy, but here we operate on series from Pandas (instead of NumPy arrays). The main difference is, that you have here an index and a name. Under the hood Pandas actually uses NumPy. The next snippet shows the devision of every element by 100.

df['Engine HP'] / 100
# Output:
# 0    1.38
# 1     NaN
# 2    2.18
# 3    1.94
# 4    2.61
# Name: Engine HP, dtype: float64

But it’s also possible to use logical operators.

df.Year
# Output:
# 0    1991
# 1    2017
# 2    2010
# 3    2017
# 4    2017
# Name: Year, dtype: int64

df.Year >= 2015
# Output:
# 0    False
# 1     True
# 2    False
# 3     True
# 4     True
# Name: Year, dtype: bool

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.