Overview:
Introduction to Pandas – Part 2/2
Filtering
Filtering refers to the process of selecting specific rows or columns from a DataFrame based on certain conditions. In Pandas, we can use various techniques to filter our data.
One common technique is to use boolean indexing. This involves creating a boolean mask that specifies the conditions we want to apply to our data. The mask is a DataFrame or Series of the same shape as the original data, where each element is either True or False depending on whether the corresponding element in the original data satisfies the condition.
For example, if we have a DataFrame df with a column named Year, and we want to filter out all rows where the year is greater or equal to 2015, we can create our condition as follows:
condition = df.Year >= 2015
We can then use this condition to select the desired rows from the DataFrame:
filtered_df = df[condition]
# or like this
df[df.Year >= 2015]
Another useful technique for filtering is using the .query() method. This method allows us to filter rows based on a string expression, similar to writing SQL queries. For example, to filter out all rows where the year is greater or equal to 2015 using .query(), we can do:
filtered_df = df.query('year >= 2015')
| Make | Model | Year | Engine HP | Engine Cylinders | Transmission Type | Vehicle_Style | MSRP | |
|---|---|---|---|---|---|---|---|---|
| 1 | Hyundai | Sonata | 2017 | NaN | 4 | AUTOMATIC | Sedan | 27150 |
| 3 | GMC | Acadia | 2017 | 194.0 | 4 | AUTOMATIC | 4dr SUV | 34450 |
| 4 | Nissan | Frontier | 2017 | 261.0 | 6 | MANUAL | Pickup | 32340 |
The next example is a filter on the make Nissan.
df[
df.Make == 'Nissan'
]
| Make | Model | Year | Engine HP | Engine Cylinders | Transmission Type | Vehicle_Style | MSRP | |
|---|---|---|---|---|---|---|---|---|
| 0 | Nissan | Stanza | 1991 | 138.0 | 4 | MANUAL | sedan | 2000 |
| 4 | Nissan | Frontier | 2017 | 261.0 | 6 | MANUAL | Pickup | 32340 |
It is also possible to combine several condition, for example, you want to have all Nissans after the year 2015.
df[
(df.Make == 'Nissan') & (df.Year > 2015)
]
| Make | Model | Year | Engine HP | Engine Cylinders | Transmission Type | Vehicle_Style | MSRP | |
|---|---|---|---|---|---|---|---|---|
| 4 | Nissan | Frontier | 2017 | 261.0 | 6 | MANUAL | Pickup | 32340 |
By using these filtering techniques, we can easily extract the data we need from our DataFrame and perform further analysis or computations on it. This can be particularly useful when working with large datasets or when dealing with complex conditions.
String operations
Next, let’s explore string operations in Pandas. That is something NumPy doesn’t have, because NumPy is mostly used for processing numbers. In Pandas we often have to deal with strings.
df
| Make | Model | Year | Engine HP | Engine Cylinders | Transmission Type | Vehicle_Style | MSRP | |
|---|---|---|---|---|---|---|---|---|
| 0 | Nissan | Stanza | 1991 | 138.0 | 4 | MANUAL | sedan | 2000 |
| 1 | Hyundai | Sonata | 2017 | NaN | 4 | AUTOMATIC | Sedan | 27150 |
| 2 | Lotus | Elise | 2010 | 218.0 | 4 | MANUAL | convertible | 54990 |
| 3 | GMC | Acadia | 2017 | 194.0 | 4 | AUTOMATIC | 4dr SUV | 34450 |
| 4 | Nissan | Frontier | 2017 | 261.0 | 6 | MANUAL | Pickup | 32340 |
The next snippet shows the Vehicle_Style column. You can immediately see the different formatting. There are different spellings of the same words and there are spaces.
df['Vehicle_Style']
# Output:
# 0 sedan
# 1 Sedan
# 2 convertible
# 3 4dr SUV
# 4 Pickup
# Name: Vehicle_Style, dtype: object
There is a string operator to lower the string as you can see here:
'STRr'.lower()
# Output: 'strr'
Using this string operator, the output of Vehicle_Style looks more consistent, but still there are the spaces…
df['Vehicle_Style'].str.lower()
# Output:
# 0 sedan
# 1 sedan
# 2 convertible
# 3 4dr suv
# 4 pickup
# Name: Vehicle_Style, dtype: object
A typical pre-processing step when you work with text is to replace all spaces with underscores.
'machine learning zoomcamp'.replace(' ','_')
# Output: 'machine_learning_zoomcamp'
Using this string operator, the output of Vehicle_Style looks perfect.
df['Vehicle_Style'].str.replace(' ','_')
# Output:
# 0 sedan
# 1 Sedan
# 2 convertible
# 3 4dr_SUV
# 4 Pickup
# Name: Vehicle_Style, dtype: object
# Both operations can be chained
df['Vehicle_Style'] = df['Vehicle_Style'].str.lower().str.replace(' ','_')
df
| Make | Model | Year | Engine HP | Engine Cylinders | Transmission Type | Vehicle_Style | MSRP | |
|---|---|---|---|---|---|---|---|---|
| 0 | Nissan | Stanza | 1991 | 138.0 | 4 | MANUAL | sedan | 2000 |
| 1 | Hyundai | Sonata | 2017 | NaN | 4 | AUTOMATIC | sedan | 27150 |
| 2 | Lotus | Elise | 2010 | 218.0 | 4 | MANUAL | convertible | 54990 |
| 3 | GMC | Acadia | 2017 | 194.0 | 4 | AUTOMATIC | 4dr_suv | 34450 |
| 4 | Nissan | Frontier | 2017 | 261.0 | 6 | MANUAL | pickup | 32340 |
Summarizing operations
df.MSRP
# Output:
# 0 2000
# 1 27150
# 2 54990
# 3 34450
# 4 32340
# Name: MSRP, dtype: int64
df.MSRP.min()
# Output: 2000
df.MSRP.max()
# Output: 54990
df.MSRP.mean()
# Output: 30186.0
Numerical Columns
Describe function
It can be very helpful to get an overview with the describe function. This gives you a lot of information at once (count, mean, std, min, max, and the quantiles for 25%, 50% and 50%). There it can be used only for numerical columns. You can use it for one column…
df.MSRP.describe()
# Output:
# count 5.000000
# mean 30186.000000
# std 18985.044904
# min 2000.000000
# 25% 27150.000000
# 50% 32340.000000
# 75% 34450.000000
# max 54990.000000
# Name: MSRP, dtype: float64
… or apply it on the DataFrame itself. In this case it finds all numerical columns and computes the statistics.
df.describe()
| Year | Engine HP | Engine Cylinders | MSRP | |
|---|---|---|---|---|
| count | 5.000000 | 4.00000 | 5.000000 | 5.000000 |
| mean | 2010.400000 | 202.75000 | 4.400000 | 30186.000000 |
| std | 11.260551 | 51.29896 | 0.894427 | 18985.044904 |
| min | 1991.000000 | 138.00000 | 4.000000 | 2000.000000 |
| 25% | 2010.000000 | 180.00000 | 4.000000 | 27150.000000 |
| 50% | 2017.000000 | 206.00000 | 4.000000 | 32340.000000 |
| 75% | 2017.000000 | 228.75000 | 4.000000 | 34450.000000 |
| max | 2017.000000 | 261.00000 | 6.000000 | 54990.000000 |
This overview is a bit confusing because of the many decimal digits. With round(2) you can round all numbers to two decimal digits. The output looks much nicer now.
df.describe().round(2)
| Year | Engine HP | Engine Cylinders | MSRP | |
|---|---|---|---|---|
| count | 5.00 | 4.00 | 5.00 | 5.00 |
| mean | 2010.40 | 202.75 | 4.40 | 30186.00 |
| std | 11.26 | 51.30 | 0.89 | 18985.04 |
| min | 1991.00 | 138.00 | 4.00 | 2000.00 |
| 25% | 2010.00 | 180.00 | 4.00 | 27150.00 |
| 50% | 2017.00 | 206.00 | 4.00 | 32340.00 |
| 75% | 2017.00 | 228.75 | 4.00 | 34450.00 |
| max | 2017.00 | 261.00 | 6.00 | 54990.00 |
Categorical Columns
There are also some functions for string values. The next function can be used for numerical values, too. Nunique() returns the number of unique values. You can apply this on single columns or DataFrames.
# Returns the number of unique values of column Make
df.Make.nunique()
# Output: 4
# Returns the number of unique values for all columns of the DataFrame
df.nunique()
# Output:
# Make 4
# Model 5
# Year 3
# Engine HP 4
# Engine Cylinders 2
# Transmission Type 2
# Vehicle_Style 4
# MSRP 5
# dtype: int64
However, if you are interested in the unique values themselves, you can use the unique() function.
df.Year.unique()
# Output
array([1991, 2017, 2010])
Missing values
Missing values can make our lives more difficult. That’s why it makes sense to take a look at this problem. The function isnull() returns true for each value/cell that is missing.
df.isnull()
| Make | Model | Year | Engine HP | Engine Cylinders | Transmission Type | Vehicle_Style | MSRP | |
|---|---|---|---|---|---|---|---|---|
| 0 | False | False | False | False | False | False | False | False |
| 1 | False | False | False | True | False | False | False | False |
| 2 | False | False | False | False | False | False | False | False |
| 3 | False | False | False | False | False | False | False | False |
| 4 | False | False | False | False | False | False | False | False |
The representation is very confusing therefor it’s more useful to sums up the number of missing values for each column.
df.isnull().sum()
# Output:
# Make 0
# Model 0
# Year 0
# Engine HP 1
# Engine Cylinders 0
# Transmission Type 0
# Vehicle_Style 0
# MSRP 0
# dtype: int64
Grouping
Grouping is another option that is suitable to get an overview. In this query we are interested in the information from the transmission_type and MSRP columns, sorted by transmission_type. The grouping gives the average price for each transmission_type.
SELECT
transmission_type,
AVG(MSRP)
FROM
cars
GROUP BY
transmission_type
df.groupby('Transmission Type').MSRP.mean()
# Output:
# Transmission Type
# AUTOMATIC 30800.000000
# MANUAL 29776.666667
# Name: MSRP, dtype: float64
There are many other interesting applications of grouping, e.g. min() or max(). Even describe() can be used here.
df.groupby('Transmission Type').MSRP.min()
# Output:
# Transmission Type
# AUTOMATIC 27150
# MANUAL 2000
# Name: MSRP, dtype: int64
df.groupby('Transmission Type').MSRP.max()
# Output:
# Transmission Type
# AUTOMATIC 34450
# MANUAL 54990
# Name: MSRP, dtype: int64
df.groupby('Transmission Type').MSRP.describe()
| Transmission Type | count | mean | std | min | 25% | 50% | 75% | max |
|---|---|---|---|---|---|---|---|---|
| AUTOMATIC | 2.0 | 30800.000000 | 5161.879503 | 27150.0 | 28975.0 | 30800.0 | 32625.0 | 34450.0 |
| MANUAL | 3.0 | 29776.666667 | 26587.836191 | 2000.0 | 17170.0 | 32340.0 | 43665.0 | 54990.0 |
Getting the NumPy arrays
Sometimes it is necessary to convert a Pandas DataFrame back to the underlying NumPy array.
df.MSRP.values
# Output:
# array([ 2000, 27150, 54990, 34450, 32340])
Getting a list of dictionaries
To convert a Pandas DataFrame back to the form of a list of dictionaries. This form is used to save it to a file.
df.to_dict(orient='records')
# Output:
# [{'Make': 'Nissan',
# 'Model': 'Stanza',
# 'Year': 1991,
# 'Engine HP': 138.0,
# 'Engine Cylinders': 4,
# 'Transmission Type': 'MANUAL',
# 'Vehicle_Style': 'sedan',
# 'MSRP': 2000},
# {'Make': 'Hyundai',
# 'Model': 'Sonata',
# 'Year': 2017,
# 'Engine HP': nan,
# 'Engine Cylinders': 4,
# 'Transmission Type': 'AUTOMATIC',
# 'Vehicle_Style': 'sedan',
# 'MSRP': 27150},
# {'Make': 'Lotus',
# 'Model': 'Elise',
# 'Year': 2010,
# 'Engine HP': 218.0,
# 'Engine Cylinders': 4,
# 'Transmission Type': 'MANUAL',
# 'Vehicle_Style': 'convertible',
# 'MSRP': 54990},
# {'Make': 'GMC',
# 'Model': 'Acadia',
# 'Year': 2017,
# 'Engine HP': 194.0,
# 'Engine Cylinders': 4,
# 'Transmission Type': 'AUTOMATIC',
# 'Vehicle_Style': '4dr_suv',
# 'MSRP': 34450},
# {'Make': 'Nissan',
# 'Model': 'Frontier',
# 'Year': 2017,
# 'Engine HP': 261.0,
# 'Engine Cylinders': 6,
# 'Transmission Type': 'MANUAL',
# 'Vehicle_Style': 'pickup',
# 'MSRP': 32340}]