Overview:

Introduction to Pandas – Part 1/2

Introduction to Pandas – Part 1/2

The last part of the Introduction to ML covers the Python package Pandas.

Pandas is a powerful and versatile data manipulation library in Python. It provides data structures and functions that are essential for data analysis and preprocessing tasks. With Pandas, you can easily load, manipulate, and analyze structured data.

One of the key data structures in Pandas is the DataFrame. It is a two-dimensional table-like structure that allows you to store and manipulate data in a row-column format. You can think of it as a spreadsheet or a SQL table. The DataFrame is designed to handle both homogeneous and heterogeneous data, making it suitable for a wide range of applications.

In addition to the DataFrame, Pandas also provides Series, which is a one-dimensional labeled array. It is similar to a column in a DataFrame and can be used to store and manipulate a single variable. Series are particularly useful when you need to perform operations on a specific column or extract a subset of data from a DataFrame.

Pandas also offers a rich set of functions for data manipulation. You can perform operations such as filtering, sorting, transforming, and aggregating data with ease. Additionally, Pandas integrates well with other Python libraries such as NumPy and Matplotlib, allowing you to seamlessly combine data manipulation, numerical computation, and data visualization tasks.

Whether you are working on a small data analysis project or dealing with large datasets, Pandas provides efficient and intuitive tools to handle your data. Its extensive documentation and active community support make it a popular choice among data scientists and analysts.

Pandas DataFrame

A Pandas DataFrame is basically a table. In the following snippet data is a list of lists. Each sublist is a row in the table. Each row represents one car.

data = [
    ['Nissan', 'Stanza', 1991, 138, 4, 'MANUAL', 'sedan', 2000],
    ['Hyundai', 'Sonata', 2017, None, 4, 'AUTOMATIC', 'Sedan', 27150],
    ['Lotus', 'Elise', 2010, 218, 4, 'MANUAL', 'convertible', 54990],
    ['GMC', 'Acadia',  2017, 194, 4, 'AUTOMATIC', '4dr SUV', 34450],
    ['Nissan', 'Frontier', 2017, 261, 6, 'MANUAL', 'Pickup', 32340],
]

columns = [
    'Make', 'Model', 'Year', 'Engine HP', 'Engine Cylinders',
    'Transmission Type', 'Vehicle_Style', 'MSRP'
]

pd.DataFrame(data)

The last line outputs a Pandas Dataframe but without column names.

	0	1	2	3	4	5	6	7
0	Nissan	Stanza	1991	138.0	4	MANUAL	sedan	2000
1	Hyundai	Sonata	2017	NaN	4	AUTOMATIC	Sedan	27150
2	Lotus	Elise	2010	218.0	4	MANUAL	convertible	54990
3	GMC	Acadia	2017	194.0	4	AUTOMATIC	4dr SUV	34450
4	Nissan	Frontier	2017	261.0	6	MANUAL	Pickup	32340

DataFrame without any column names

To display the correct output with column names, the columns parameter must be set.

df = pd.DataFrame(data, columns = columns)
df

	Make	Model	Year	Engine HP	Engine Cylinders	Transmission Type	Vehicle_Style	MSRP
0	Nissan	Stanza	1991	138.0	4	MANUAL	sedan	2000
1	Hyundai	Sonata	2017	NaN	4	AUTOMATIC	Sedan	27150
2	Lotus	Elise	2010	218.0	4	MANUAL	convertible	54990
3	GMC	Acadia	2017	194.0	4	AUTOMATIC	4dr SUV	34450
4	Nissan	Frontier	2017	261.0	6	MANUAL	Pickup	32340

DataFrame with column names

There is another way to create a DataFrame. This time we use a list of dictionaries. In a dictionary we explicitly specify the value for each column. That means a dictionary is a key-value structure, where the keys are the column names and the values are the car specific value for that key.

data = [
    {
        "Make": "Nissan",
        "Model": "Stanza",
        "Year": 1991,
        "Engine HP": 138.0,
        "Engine Cylinders": 4,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "sedan",
        "MSRP": 2000
    },
    {
        "Make": "Hyundai",
        "Model": "Sonata",
        "Year": 2017,
        "Engine HP": None,
        "Engine Cylinders": 4,
        "Transmission Type": "AUTOMATIC",
        "Vehicle_Style": "Sedan",
        "MSRP": 27150
    },
    {
        "Make": "Lotus",
        "Model": "Elise",
        "Year": 2010,
        "Engine HP": 218.0,
        "Engine Cylinders": 4,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "convertible",
        "MSRP": 54990
    },
    {
        "Make": "GMC",
        "Model": "Acadia",
        "Year": 2017,
        "Engine HP": 194.0,
        "Engine Cylinders": 4,
        "Transmission Type": "AUTOMATIC",
        "Vehicle_Style": "4dr SUV",
        "MSRP": 34450
    },
    {
        "Make": "Nissan",
        "Model": "Frontier",
        "Year": 2017,
        "Engine HP": 261.0,
        "Engine Cylinders": 6,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "Pickup",
        "MSRP": 32340
    }
]

pd.DataFrame(data)

	Make	Model	Year	Engine HP	Engine Cylinders	Transmission Type	Vehicle_Style	MSRP
0	Nissan	Stanza	1991	138.0	4	MANUAL	sedan	2000
1	Hyundai	Sonata	2017	NaN	4	AUTOMATIC	Sedan	27150
2	Lotus	Elise	2010	218.0	4	MANUAL	convertible	54990
3	GMC	Acadia	2017	194.0	4	AUTOMATIC	4dr SUV	34450
4	Nissan	Frontier	2017	261.0	6	MANUAL	Pickup	32340

After loading a DataFrame from csv file or from sql query the first thing to do is to look at the first rows to get a fast overview about the data.

df.head()

	Make	Model	Year	Engine HP	Engine Cylinders	Transmission Type	Vehicle_Style	MSRP
0	Nissan	Stanza	1991	138.0	4	MANUAL	sedan	2000
1	Hyundai	Sonata	2017	NaN	4	AUTOMATIC	Sedan	27150
2	Lotus	Elise	2010	218.0	4	MANUAL	convertible	54990
3	GMC	Acadia	2017	194.0	4	AUTOMATIC	4dr SUV	34450
4	Nissan	Frontier	2017	261.0	6	MANUAL	Pickup	32340

df.head() outputs the first rows of the DataFrame

df.head(n=2)

	Make	Model	Year	Engine HP	Engine Cylinders	Transmission Type	Vehicle_Style	MSRP
0	Nissan	Stanza	1991	138.0	4	MANUAL	sedan	2000
1	Hyundai	Sonata	2017	NaN	4	AUTOMATIC	Sedan	27150

df.head(n=2) outputs only the first two rows of the DataFrame

Pandas Series

A Pandas Series is a one-dimensional labeled array that can hold data of any type. It is similar to a column in a DataFrame and can be thought of as a single variable. Series provide a powerful way to manipulate and analyze specific columns or subsets of data from a DataFrame.

You can create a Series by passing a list or array of values to the pd.Series() function. For example, let’s create a Series of the “Make” column from the DataFrame:

make_series = pd.Series(df['Make'])

This will create a Series where each element corresponds to the “Make” value of a row in the DataFrame. You can then perform various operations on the Series, such as filtering, sorting, or computing statistics.

For instance, to filter the Series and select only the rows where the make is “Nissan”, you can use boolean indexing:

nissan_cars = make_series[make_series == 'Nissan']

You can also perform mathematical operations on Series, such as adding or multiplying values. For example, let’s say you have a Series representing the MSRP (Manufacturer’s Suggested Retail Price) of the cars:

msrp_series = pd.Series(df['MSRP'])

You can calculate the average MSRP using the mean() method:

average_msrp = msrp_series.mean()

You can also apply a function to each element of the Series using the apply() method. This allows you to perform custom operations on the data. For example, let’s say you want to convert the MSRP values from dollars to euros:

def convert_to_euros(value):
    return value * 0.85

msrp_in_euros = msrp_series.apply(convert_to_euros)

This will create a new Series msrp_in_euros where each value is the MSRP converted to euros.

In addition to the above operations, Pandas Series also provide a wide range of methods for data manipulation and analysis. You can find more information and examples in the official Pandas documentation.

By leveraging the power of Pandas Series, you can easily extract, transform, and analyze specific columns or subsets of data from a DataFrame, making it a versatile tool for data manipulation and analysis in Python.

# To access one serie you can use the dot notation...
df.Make
# Output:
# 0     Nissan
# 1    Hyundai
# 2      Lotus
# 3        GMC
# 4     Nissan
# Name: Make, dtype: object

# ... or the bracket notation 
df['Make']
# Output:
# 0     Nissan
# 1    Hyundai
# 2      Lotus
# 3        GMC
# 4     Nissan
# Name: Make, dtype: object

You can create a new subset of the DataFrame with a selection of columns using the bracket notation.

df[['Make', 'Model', 'MSRP']]

	Make	Model	MSRP
0	Nissan	Stanza	2000
1	Hyundai	Sonata	27150
2	Lotus	Elise	54990
3	GMC	Acadia	34450
4	Nissan	Frontier	32340

Output of selected columns ‘Make’, ‘Model’, and ‘MSRP’

Adding columns to DataFrames

Sometimes it’s necessary to add one more column to the DataFrame. The next snippet shows how you can add a column ‘id’ with predefined values to the DataFrame.

df['id'] = [1, 2, 3, 4, 5]
df

	Make	Model	Year	Engine HP	Engine Cylinders	Transmission Type	Vehicle_Style	MSRP	id
0	Nissan	Stanza	1991	138.0	4	MANUAL	sedan	2000	1
1	Hyundai	Sonata	2017	NaN	4	AUTOMATIC	Sedan	27150	2
2	Lotus	Elise	2010	218.0	4	MANUAL	convertible	54990	3
3	GMC	Acadia	2017	194.0	4	AUTOMATIC	4dr SUV	34450	4
4	Nissan	Frontier	2017	261.0	6	MANUAL	Pickup	32340	5

Adds a new column ‘id’ to the DataFrame and sets values for this new column

df['id']
# Output:
# 0    1
# 1    2
# 2    3
# 3    4
# 4    5
# Name: id, dtype: int64

df['id'] = [10, 20, 30, 40, 50]
df

	Make	Model	Year	Engine HP	Engine Cylinders	Transmission Type	Vehicle_Style	MSRP	id
0	Nissan	Stanza	1991	138.0	4	MANUAL	sedan	2000	10
1	Hyundai	Sonata	2017	NaN	4	AUTOMATIC	Sedan	27150	20
2	Lotus	Elise	2010	218.0	4	MANUAL	convertible	54990	30
3	GMC	Acadia	2017	194.0	4	AUTOMATIC	4dr SUV	34450	40
4	Nissan	Frontier	2017	261.0	6	MANUAL	Pickup	32340	50

Output with changed values for the id column

Deleting columns from DataFrames

To delete a column you can use the del operator.

del df['id']
df

	Make	Model	Year	Engine HP	Engine Cylinders	Transmission Type	Vehicle_Style	MSRP
0	Nissan	Stanza	1991	138.0	4	MANUAL	sedan	2000
1	Hyundai	Sonata	2017	NaN	4	AUTOMATIC	Sedan	27150
2	Lotus	Elise	2010	218.0	4	MANUAL	convertible	54990
3	GMC	Acadia	2017	194.0	4	AUTOMATIC	4dr SUV	34450
4	Nissan	Frontier	2017	261.0	6	MANUAL	Pickup	32340

Output without id column

Index

The numbers in the first column refer to each row (called Index).

df.index
# Output: RangeIndex(start=0, stop=5, step=1)

Accessing elements

Using this index we can access the elements of the DataFrame. You can output only one element as shown in the first line of next snippet. But you can also return multiple rows as shown in the last line.

df.loc[1]
# Output:
# Make                   Hyundai
# Model                   Sonata
# Year                      2017
# Engine HP                  NaN
# Engine Cylinders             4
# Transmission Type    AUTOMATIC
# Vehicle_Style            Sedan
# MSRP                     27150
# Name: 1, dtype: object

df.loc[[1, 2]]

	Make	Model	Year	Engine HP	Engine Cylinders	Transmission Type	Vehicle_Style	MSRP
1	Hyundai	Sonata	2017	NaN	4	AUTOMATIC	Sedan	27150
2	Lotus	Elise	2010	218.0	4	MANUAL	convertible	54990

Returning rows by accessing the index

We can also replace the index…

df.index = ['a', 'b', 'c', 'd', 'e']
df

	Make	Model	Year	Engine HP	Engine Cylinders	Transmission Type	Vehicle_Style	MSRP
a	Nissan	Stanza	1991	138.0	4	MANUAL	sedan	2000
b	Hyundai	Sonata	2017	NaN	4	AUTOMATIC	Sedan	27150
c	Lotus	Elise	2010	218.0	4	MANUAL	convertible	54990
d	GMC	Acadia	2017	194.0	4	AUTOMATIC	4dr SUV	34450
e	Nissan	Frontier	2017	261.0	6	MANUAL	Pickup	32340

Changing index to a categorical one

# usual index
df.loc[['b', 'c']]

	Make	Model	Year	Engine HP	Engine Cylinders	Transmission Type	Vehicle_Style	MSRP
b	Hyundai	Sonata	2017	NaN	4	AUTOMATIC	Sedan	27150
c	Lotus	Elise	2010	218.0	4	MANUAL	convertible	54990

Returning rows by accessing a categorical index

# Still we can refer to a positional index (what we usually use in lists or NumPy arrays)
df.iloc[[1, 2, 4]]

	Make	Model	Year	Engine HP	Engine Cylinders	Transmission Type	Vehicle_Style	MSRP
b	Hyundai	Sonata	2017	NaN	4	AUTOMATIC	Sedan	27150
c	Lotus	Elise	2010	218.0	4	MANUAL	convertible	54990
e	Nissan	Frontier	2017	261.0	6	MANUAL	Pickup	32340

Returning rows by accessing a positional index with iloc

To reset the index we can use reset_index function. This keeps the previous index and creates a new column called index.

df.reset_index()

	index	Make	Model	Year	Engine HP	Engine Cylinders	Transmission Type	Vehicle_Style	MSRP
0	a	Nissan	Stanza	1991	138.0	4	MANUAL	sedan	2000
1	b	Hyundai	Sonata	2017	NaN	4	AUTOMATIC	Sedan	27150
2	c	Lotus	Elise	2010	218.0	4	MANUAL	convertible	54990
3	d	GMC	Acadia	2017	194.0	4	AUTOMATIC	4dr SUV	34450
4	e	Nissan	Frontier	2017	261.0	6	MANUAL	Pickup	32340

DataFrame after resetting the index

If the old index is not needed it can simply be dropped.

df = df.reset_index(drop=True)
df

	Make	Model	Year	Engine HP	Engine Cylinders	Transmission Type	Vehicle_Style	MSRP
0	Nissan	Stanza	1991	138.0	4	MANUAL	sedan	2000
1	Hyundai	Sonata	2017	NaN	4	AUTOMATIC	Sedan	27150
2	Lotus	Elise	2010	218.0	4	MANUAL	convertible	54990
3	GMC	Acadia	2017	194.0	4	AUTOMATIC	4dr SUV	34450
4	Nissan	Frontier	2017	261.0	6	MANUAL	Pickup	32340

DataFrame after removing the old index

Element-wise operations

When looking at data, NaN values can occur. NaN means “not a number”. The following example illustrates this.

# NaN denotes a missing number
df['Engine HP']
# Output:
# 0    138.0
# 1      NaN
# 2    218.0
# 3    194.0
# 4    261.0
# Name: Engine HP, dtype: float64

We can do everything we can do in NumPy, but here we operate on series from Pandas (instead of NumPy arrays). The main difference is, that you have here an index and a name. Under the hood Pandas actually uses NumPy. The next snippet shows the devision of every element by 100.

df['Engine HP'] / 100
# Output:
# 0    1.38
# 1     NaN
# 2    2.18
# 3    1.94
# 4    2.61
# Name: Engine HP, dtype: float64

But it’s also possible to use logical operators.

df.Year
# Output:
# 0    1991
# 1    2017
# 2    2010
# 3    2017
# 4    2017
# Name: Year, dtype: int64

df.Year >= 2015
# Output:
# 0    False
# 1     True
# 2    False
# 3     True
# 4     True
# Name: Year, dtype: bool

ML Zoomcamp 2023 – Introduction to Machine Learning – Part 12

Introduction to Pandas – Part 1/2

Pandas DataFrame

Pandas Series

Adding columns to DataFrames

Deleting columns from DataFrames

Index

Accessing elements

Element-wise operations

Leave a comment Cancel reply

Introduction to Pandas – Part 1/2

Pandas DataFrame

Pandas Series

Adding columns to DataFrames

Deleting columns from DataFrames

Index

Accessing elements

Element-wise operations

Teilen mit:

Related

Leave a comment Cancel reply