Overview:

Introduction to Pandas – Part 2/2

Introduction to Pandas – Part 2/2

Filtering

Filtering refers to the process of selecting specific rows or columns from a DataFrame based on certain conditions. In Pandas, we can use various techniques to filter our data.

One common technique is to use boolean indexing. This involves creating a boolean mask that specifies the conditions we want to apply to our data. The mask is a DataFrame or Series of the same shape as the original data, where each element is either True or False depending on whether the corresponding element in the original data satisfies the condition.

For example, if we have a DataFrame df with a column named Year, and we want to filter out all rows where the year is greater or equal to 2015, we can create our condition as follows:

condition = df.Year >= 2015

We can then use this condition to select the desired rows from the DataFrame:

filtered_df = df[condition]

# or like this
df[df.Year >= 2015]

Another useful technique for filtering is using the .query() method. This method allows us to filter rows based on a string expression, similar to writing SQL queries. For example, to filter out all rows where the year is greater or equal to 2015 using .query(), we can do:

filtered_df = df.query('year >=  2015')

	Make	Model	Year	Engine HP	Engine Cylinders	Transmission Type	Vehicle_Style	MSRP
1	Hyundai	Sonata	2017	NaN	4	AUTOMATIC	Sedan	27150
3	GMC	Acadia	2017	194.0	4	AUTOMATIC	4dr SUV	34450
4	Nissan	Frontier	2017	261.0	6	MANUAL	Pickup	32340

Using filtering Year >= 2015

The next example is a filter on the make Nissan.

df[
    df.Make == 'Nissan'
]

	Make	Model	Year	Engine HP	Engine Cylinders	Transmission Type	Vehicle_Style	MSRP
0	Nissan	Stanza	1991	138.0	4	MANUAL	sedan	2000
4	Nissan	Frontier	2017	261.0	6	MANUAL	Pickup	32340

Using filtering Make == ‘Nissan’

It is also possible to combine several condition, for example, you want to have all Nissans after the year 2015.

df[
    (df.Make == 'Nissan') & (df.Year > 2015)
]

	Make	Model	Year	Engine HP	Engine Cylinders	Transmission Type	Vehicle_Style	MSRP
4	Nissan	Frontier	2017	261.0	6	MANUAL	Pickup	32340

Using chained filtering Make == ‘Nissan’ & Year > 2015

By using these filtering techniques, we can easily extract the data we need from our DataFrame and perform further analysis or computations on it. This can be particularly useful when working with large datasets or when dealing with complex conditions.

String operations

Next, let’s explore string operations in Pandas. That is something NumPy doesn’t have, because NumPy is mostly used for processing numbers. In Pandas we often have to deal with strings.

df

	Make	Model	Year	Engine HP	Engine Cylinders	Transmission Type	Vehicle_Style	MSRP
0	Nissan	Stanza	1991	138.0	4	MANUAL	sedan	2000
1	Hyundai	Sonata	2017	NaN	4	AUTOMATIC	Sedan	27150
2	Lotus	Elise	2010	218.0	4	MANUAL	convertible	54990
3	GMC	Acadia	2017	194.0	4	AUTOMATIC	4dr SUV	34450
4	Nissan	Frontier	2017	261.0	6	MANUAL	Pickup	32340

Unchanged DataFrame

The next snippet shows the Vehicle_Style column. You can immediately see the different formatting. There are different spellings of the same words and there are spaces.

df['Vehicle_Style']
# Output:
# 0          sedan
# 1          Sedan
# 2    convertible
# 3        4dr SUV
# 4         Pickup
# Name: Vehicle_Style, dtype: object

There is a string operator to lower the string as you can see here:

'STRr'.lower()
# Output: 'strr'

Using this string operator, the output of Vehicle_Style looks more consistent, but still there are the spaces…

df['Vehicle_Style'].str.lower()
# Output:
# 0          sedan
# 1          sedan
# 2    convertible
# 3        4dr suv
# 4         pickup
# Name: Vehicle_Style, dtype: object

A typical pre-processing step when you work with text is to replace all spaces with underscores.

'machine learning zoomcamp'.replace(' ','_')
# Output: 'machine_learning_zoomcamp'

Using this string operator, the output of Vehicle_Style looks perfect.

df['Vehicle_Style'].str.replace(' ','_')
# Output:
# 0          sedan
# 1          Sedan
# 2    convertible
# 3        4dr_SUV
# 4         Pickup
# Name: Vehicle_Style, dtype: object

# Both operations can be chained
df['Vehicle_Style'] = df['Vehicle_Style'].str.lower().str.replace(' ','_')
df

	Make	Model	Year	Engine HP	Engine Cylinders	Transmission Type	Vehicle_Style	MSRP
0	Nissan	Stanza	1991	138.0	4	MANUAL	sedan	2000
1	Hyundai	Sonata	2017	NaN	4	AUTOMATIC	sedan	27150
2	Lotus	Elise	2010	218.0	4	MANUAL	convertible	54990
3	GMC	Acadia	2017	194.0	4	AUTOMATIC	4dr_suv	34450
4	Nissan	Frontier	2017	261.0	6	MANUAL	pickup	32340

Applying string operations to Vehicle_Style column

Summarizing operations

df.MSRP
# Output:
# 0     2000
# 1    27150
# 2    54990
# 3    34450
# 4    32340
# Name: MSRP, dtype: int64

df.MSRP.min()
# Output: 2000

df.MSRP.max()
# Output: 54990

df.MSRP.mean()
# Output: 30186.0

Numerical Columns

Describe function

It can be very helpful to get an overview with the describe function. This gives you a lot of information at once (count, mean, std, min, max, and the quantiles for 25%, 50% and 50%). There it can be used only for numerical columns. You can use it for one column…

df.MSRP.describe()
# Output:
# count        5.000000
# mean     30186.000000
# std      18985.044904
# min       2000.000000
# 25%      27150.000000
# 50%      32340.000000
# 75%      34450.000000
# max      54990.000000
# Name: MSRP, dtype: float64

… or apply it on the DataFrame itself. In this case it finds all numerical columns and computes the statistics.

df.describe()

	Year	Engine HP	Engine Cylinders	MSRP
count	5.000000	4.00000	5.000000	5.000000
mean	2010.400000	202.75000	4.400000	30186.000000
std	11.260551	51.29896	0.894427	18985.044904
min	1991.000000	138.00000	4.000000	2000.000000
25%	2010.000000	180.00000	4.000000	27150.000000
50%	2017.000000	206.00000	4.000000	32340.000000
75%	2017.000000	228.75000	4.000000	34450.000000
max	2017.000000	261.00000	6.000000	54990.000000

Data have many digits after decimal point

This overview is a bit confusing because of the many decimal digits. With round(2) you can round all numbers to two decimal digits. The output looks much nicer now.

df.describe().round(2)

	Year	Engine HP	Engine Cylinders	MSRP
count	5.00	4.00	5.00	5.00
mean	2010.40	202.75	4.40	30186.00
std	11.26	51.30	0.89	18985.04
min	1991.00	138.00	4.00	2000.00
25%	2010.00	180.00	4.00	27150.00
50%	2017.00	206.00	4.00	32340.00
75%	2017.00	228.75	4.00	34450.00
max	2017.00	261.00	6.00	54990.00

Data with rounded values

Categorical Columns

There are also some functions for string values. The next function can be used for numerical values, too. Nunique() returns the number of unique values. You can apply this on single columns or DataFrames.

# Returns the number of unique values of column Make
df.Make.nunique()
# Output: 4

# Returns the number of unique values for all columns of the DataFrame
df.nunique()
# Output:
# Make                 4
# Model                5
# Year                 3
# Engine HP            4
# Engine Cylinders     2
# Transmission Type    2
# Vehicle_Style        4
# MSRP                 5
# dtype: int64

However, if you are interested in the unique values themselves, you can use the unique() function.

df.Year.unique()
# Output
array([1991, 2017, 2010])

Missing values

Missing values can make our lives more difficult. That’s why it makes sense to take a look at this problem. The function isnull() returns true for each value/cell that is missing.

df.isnull()

	Make	Model	Year	Engine HP	Engine Cylinders	Transmission Type	Vehicle_Style	MSRP
0	False	False	False	False	False	False	False	False
1	False	False	False	True	False	False	False	False
2	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False

Showing missing values for each row

The representation is very confusing therefor it’s more useful to sums up the number of missing values for each column.

df.isnull().sum()
# Output:
# Make                 0
# Model                0
# Year                 0
# Engine HP            1
# Engine Cylinders     0
# Transmission Type    0
# Vehicle_Style        0
# MSRP                 0
# dtype: int64

Grouping

Grouping is another option that is suitable to get an overview. In this query we are interested in the information from the transmission_type and MSRP columns, sorted by transmission_type. The grouping gives the average price for each transmission_type.

SELECT
    transmission_type,
    AVG(MSRP)
FROM
    cars
GROUP BY
    transmission_type

df.groupby('Transmission Type').MSRP.mean()
# Output:
# Transmission Type
# AUTOMATIC    30800.000000
# MANUAL       29776.666667
# Name: MSRP, dtype: float64

There are many other interesting applications of grouping, e.g. min() or max(). Even describe() can be used here.

df.groupby('Transmission Type').MSRP.min()
# Output:
# Transmission Type
# AUTOMATIC    27150
# MANUAL        2000
# Name: MSRP, dtype: int64

df.groupby('Transmission Type').MSRP.max()
# Output:
# Transmission Type
# AUTOMATIC    34450
# MANUAL       54990
# Name: MSRP, dtype: int64

df.groupby('Transmission Type').MSRP.describe()

Transmission Type	count	mean	std	min	25%	50%	75%	max
AUTOMATIC	2.0	30800.000000	5161.879503	27150.0	28975.0	30800.0	32625.0	34450.0
MANUAL	3.0	29776.666667	26587.836191	2000.0	17170.0	32340.0	43665.0	54990.0

Using grouping of Transmission Type

Getting the NumPy arrays

Sometimes it is necessary to convert a Pandas DataFrame back to the underlying NumPy array.

df.MSRP.values
# Output:
# array([ 2000, 27150, 54990, 34450, 32340])

Getting a list of dictionaries

To convert a Pandas DataFrame back to the form of a list of dictionaries. This form is used to save it to a file.

df.to_dict(orient='records')
# Output:
# [{'Make': 'Nissan',
#  'Model': 'Stanza',
#  'Year': 1991,
#  'Engine HP': 138.0,
#  'Engine Cylinders': 4,
#  'Transmission Type': 'MANUAL',
#  'Vehicle_Style': 'sedan',
#  'MSRP': 2000},
# {'Make': 'Hyundai',
#  'Model': 'Sonata',
#  'Year': 2017,
#  'Engine HP': nan,
#  'Engine Cylinders': 4,
#  'Transmission Type': 'AUTOMATIC',
#  'Vehicle_Style': 'sedan',
#  'MSRP': 27150},
# {'Make': 'Lotus',
#  'Model': 'Elise',
#  'Year': 2010,
#  'Engine HP': 218.0,
#  'Engine Cylinders': 4,
#  'Transmission Type': 'MANUAL',
#  'Vehicle_Style': 'convertible',
#  'MSRP': 54990},
# {'Make': 'GMC',
#  'Model': 'Acadia',
#  'Year': 2017,
#  'Engine HP': 194.0,
#  'Engine Cylinders': 4,
#  'Transmission Type': 'AUTOMATIC',
#  'Vehicle_Style': '4dr_suv',
#  'MSRP': 34450},
# {'Make': 'Nissan',
#  'Model': 'Frontier',
#  'Year': 2017,
#  'Engine HP': 261.0,
#  'Engine Cylinders': 6,
#  'Transmission Type': 'MANUAL',
#  'Vehicle_Style': 'pickup',
#  'MSRP': 32340}]

ML Zoomcamp 2023 – Introduction to Machine Learning – Part 13

Introduction to Pandas – Part 2/2

Filtering

String operations

Summarizing operations

Numerical Columns

Describe function

Categorical Columns

Missing values

Grouping

Getting the NumPy arrays

Getting a list of dictionaries

Leave a comment Cancel reply

Introduction to Pandas – Part 2/2

Filtering

String operations

Summarizing operations

Numerical Columns

Describe function

Categorical Columns

Missing values

Grouping

Getting the NumPy arrays

Getting a list of dictionaries

Teilen mit:

Related

Leave a comment Cancel reply