ML Zoomcamp 2023 – Introduction to Machine Learning – Part 13

Overview:

  1. Introduction to Pandas – Part 2/2
    1. Filtering
    2. String operations
    3. Summarizing operations
      1. Numerical Columns
        1. Describe function
      2. Categorical Columns
    4. Missing values
    5. Grouping
    6. Getting the NumPy arrays
    7. Getting a list of dictionaries

Introduction to Pandas – Part 2/2

Filtering

Filtering refers to the process of selecting specific rows or columns from a DataFrame based on certain conditions. In Pandas, we can use various techniques to filter our data.

One common technique is to use boolean indexing. This involves creating a boolean mask that specifies the conditions we want to apply to our data. The mask is a DataFrame or Series of the same shape as the original data, where each element is either True or False depending on whether the corresponding element in the original data satisfies the condition.

For example, if we have a DataFrame df with a column named Year, and we want to filter out all rows where the year is greater or equal to 2015, we can create our condition as follows:

condition = df.Year >= 2015

We can then use this condition to select the desired rows from the DataFrame:

filtered_df = df[condition]

# or like this
df[df.Year >= 2015]

Another useful technique for filtering is using the .query() method. This method allows us to filter rows based on a string expression, similar to writing SQL queries. For example, to filter out all rows where the year is greater or equal to 2015 using .query(), we can do:

filtered_df = df.query('year >=  2015')
MakeModelYearEngine HPEngine CylindersTransmission TypeVehicle_StyleMSRP
1HyundaiSonata2017NaN4AUTOMATICSedan27150
3GMCAcadia2017194.04AUTOMATIC4dr SUV34450
4NissanFrontier2017261.06MANUALPickup32340
Using filtering Year >= 2015

The next example is a filter on the make Nissan.

df[
    df.Make == 'Nissan'
]
MakeModelYearEngine HPEngine CylindersTransmission TypeVehicle_StyleMSRP
0NissanStanza1991138.04MANUALsedan2000
4NissanFrontier2017261.06MANUALPickup32340
Using filtering Make == ‘Nissan’

It is also possible to combine several condition, for example, you want to have all Nissans after the year 2015.

df[
    (df.Make == 'Nissan') & (df.Year > 2015)
]
MakeModelYearEngine HPEngine CylindersTransmission TypeVehicle_StyleMSRP
4NissanFrontier2017261.06MANUALPickup32340
Using chained filtering Make == ‘Nissan’ & Year > 2015

By using these filtering techniques, we can easily extract the data we need from our DataFrame and perform further analysis or computations on it. This can be particularly useful when working with large datasets or when dealing with complex conditions.

String operations

Next, let’s explore string operations in Pandas. That is something NumPy doesn’t have, because NumPy is mostly used for processing numbers. In Pandas we often have to deal with strings.

df
MakeModelYearEngine HPEngine CylindersTransmission TypeVehicle_StyleMSRP
0NissanStanza1991138.04MANUALsedan2000
1HyundaiSonata2017NaN4AUTOMATICSedan27150
2LotusElise2010218.04MANUALconvertible54990
3GMCAcadia2017194.04AUTOMATIC4dr SUV34450
4NissanFrontier2017261.06MANUALPickup32340
Unchanged DataFrame

The next snippet shows the Vehicle_Style column. You can immediately see the different formatting. There are different spellings of the same words and there are spaces.

df['Vehicle_Style']
# Output:
# 0          sedan
# 1          Sedan
# 2    convertible
# 3        4dr SUV
# 4         Pickup
# Name: Vehicle_Style, dtype: object

There is a string operator to lower the string as you can see here:

'STRr'.lower()
# Output: 'strr'

Using this string operator, the output of Vehicle_Style looks more consistent, but still there are the spaces…

df['Vehicle_Style'].str.lower()
# Output:
# 0          sedan
# 1          sedan
# 2    convertible
# 3        4dr suv
# 4         pickup
# Name: Vehicle_Style, dtype: object

A typical pre-processing step when you work with text is to replace all spaces with underscores.

'machine learning zoomcamp'.replace(' ','_')
# Output: 'machine_learning_zoomcamp'

Using this string operator, the output of Vehicle_Style looks perfect.

df['Vehicle_Style'].str.replace(' ','_')
# Output:
# 0          sedan
# 1          Sedan
# 2    convertible
# 3        4dr_SUV
# 4         Pickup
# Name: Vehicle_Style, dtype: object

# Both operations can be chained
df['Vehicle_Style'] = df['Vehicle_Style'].str.lower().str.replace(' ','_')
df
MakeModelYearEngine HPEngine CylindersTransmission TypeVehicle_StyleMSRP
0NissanStanza1991138.04MANUALsedan2000
1HyundaiSonata2017NaN4AUTOMATICsedan27150
2LotusElise2010218.04MANUALconvertible54990
3GMCAcadia2017194.04AUTOMATIC4dr_suv34450
4NissanFrontier2017261.06MANUALpickup32340
Applying string operations to Vehicle_Style column

Summarizing operations

df.MSRP
# Output:
# 0     2000
# 1    27150
# 2    54990
# 3    34450
# 4    32340
# Name: MSRP, dtype: int64

df.MSRP.min()
# Output: 2000

df.MSRP.max()
# Output: 54990

df.MSRP.mean()
# Output: 30186.0

Numerical Columns

Describe function

It can be very helpful to get an overview with the describe function. This gives you a lot of information at once (count, mean, std, min, max, and the quantiles for 25%, 50% and 50%). There it can be used only for numerical columns. You can use it for one column…

df.MSRP.describe()
# Output:
# count        5.000000
# mean     30186.000000
# std      18985.044904
# min       2000.000000
# 25%      27150.000000
# 50%      32340.000000
# 75%      34450.000000
# max      54990.000000
# Name: MSRP, dtype: float64

… or apply it on the DataFrame itself. In this case it finds all numerical columns and computes the statistics.

df.describe()
YearEngine HPEngine CylindersMSRP
count5.0000004.000005.0000005.000000
mean2010.400000202.750004.40000030186.000000
std11.26055151.298960.89442718985.044904
min1991.000000138.000004.0000002000.000000
25%2010.000000180.000004.00000027150.000000
50%2017.000000206.000004.00000032340.000000
75%2017.000000228.750004.00000034450.000000
max2017.000000261.000006.00000054990.000000
Data have many digits after decimal point

This overview is a bit confusing because of the many decimal digits. With round(2) you can round all numbers to two decimal digits. The output looks much nicer now.

df.describe().round(2)
YearEngine HPEngine CylindersMSRP
count5.004.005.005.00
mean2010.40202.754.4030186.00
std11.2651.300.8918985.04
min1991.00138.004.002000.00
25%2010.00180.004.0027150.00
50%2017.00206.004.0032340.00
75%2017.00228.754.0034450.00
max2017.00261.006.0054990.00
Data with rounded values

Categorical Columns

There are also some functions for string values. The next function can be used for numerical values, too. Nunique() returns the number of unique values. You can apply this on single columns or DataFrames.

# Returns the number of unique values of column Make
df.Make.nunique()
# Output: 4

# Returns the number of unique values for all columns of the DataFrame
df.nunique()
# Output:
# Make                 4
# Model                5
# Year                 3
# Engine HP            4
# Engine Cylinders     2
# Transmission Type    2
# Vehicle_Style        4
# MSRP                 5
# dtype: int64

However, if you are interested in the unique values themselves, you can use the unique() function.

df.Year.unique()
# Output
array([1991, 2017, 2010])

Missing values

Missing values can make our lives more difficult. That’s why it makes sense to take a look at this problem. The function isnull() returns true for each value/cell that is missing.

df.isnull()
MakeModelYearEngine HPEngine CylindersTransmission TypeVehicle_StyleMSRP
0FalseFalseFalseFalseFalseFalseFalseFalse
1FalseFalseFalseTrueFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalseFalseFalse
4FalseFalseFalseFalseFalseFalseFalseFalse
Showing missing values for each row

The representation is very confusing therefor it’s more useful to sums up the number of missing values for each column.

df.isnull().sum()
# Output:
# Make                 0
# Model                0
# Year                 0
# Engine HP            1
# Engine Cylinders     0
# Transmission Type    0
# Vehicle_Style        0
# MSRP                 0
# dtype: int64

Grouping

Grouping is another option that is suitable to get an overview. In this query we are interested in the information from the transmission_type and MSRP columns, sorted by transmission_type. The grouping gives the average price for each transmission_type.

SELECT
    transmission_type,
    AVG(MSRP)
FROM
    cars
GROUP BY
    transmission_type
df.groupby('Transmission Type').MSRP.mean()
# Output:
# Transmission Type
# AUTOMATIC    30800.000000
# MANUAL       29776.666667
# Name: MSRP, dtype: float64

There are many other interesting applications of grouping, e.g. min() or max(). Even describe() can be used here.

df.groupby('Transmission Type').MSRP.min()
# Output:
# Transmission Type
# AUTOMATIC    27150
# MANUAL        2000
# Name: MSRP, dtype: int64

df.groupby('Transmission Type').MSRP.max()
# Output:
# Transmission Type
# AUTOMATIC    34450
# MANUAL       54990
# Name: MSRP, dtype: int64

df.groupby('Transmission Type').MSRP.describe()
Transmission Typecountmeanstdmin25%50%75%max
AUTOMATIC2.030800.0000005161.87950327150.028975.030800.032625.034450.0
MANUAL3.029776.66666726587.8361912000.017170.032340.043665.054990.0
Using grouping of Transmission Type

Getting the NumPy arrays

Sometimes it is necessary to convert a Pandas DataFrame back to the underlying NumPy array.

df.MSRP.values
# Output:
# array([ 2000, 27150, 54990, 34450, 32340])

Getting a list of dictionaries

To convert a Pandas DataFrame back to the form of a list of dictionaries. This form is used to save it to a file.

df.to_dict(orient='records')
# Output:
# [{'Make': 'Nissan',
#  'Model': 'Stanza',
#  'Year': 1991,
#  'Engine HP': 138.0,
#  'Engine Cylinders': 4,
#  'Transmission Type': 'MANUAL',
#  'Vehicle_Style': 'sedan',
#  'MSRP': 2000},
# {'Make': 'Hyundai',
#  'Model': 'Sonata',
#  'Year': 2017,
#  'Engine HP': nan,
#  'Engine Cylinders': 4,
#  'Transmission Type': 'AUTOMATIC',
#  'Vehicle_Style': 'sedan',
#  'MSRP': 27150},
# {'Make': 'Lotus',
#  'Model': 'Elise',
#  'Year': 2010,
#  'Engine HP': 218.0,
#  'Engine Cylinders': 4,
#  'Transmission Type': 'MANUAL',
#  'Vehicle_Style': 'convertible',
#  'MSRP': 54990},
# {'Make': 'GMC',
#  'Model': 'Acadia',
#  'Year': 2017,
#  'Engine HP': 194.0,
#  'Engine Cylinders': 4,
#  'Transmission Type': 'AUTOMATIC',
#  'Vehicle_Style': '4dr_suv',
#  'MSRP': 34450},
# {'Make': 'Nissan',
#  'Model': 'Frontier',
#  'Year': 2017,
#  'Engine HP': 261.0,
#  'Engine Cylinders': 6,
#  'Transmission Type': 'MANUAL',
#  'Vehicle_Style': 'pickup',
#  'MSRP': 32340}]

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.