Overview:

The second part is about the distinction between Machine Learning (ML) and rule-based systems. The example of a spam filter is used to explain how the implementation would look like without ML.

Rule-based systems

What you need to do is to define some rules to distinguish between ham and spam. So you start defining the rules and for a while everything works fine. However, at some point you have to adjust the rule set and you end up on the hamster wheel because you can’t handle the constant reconfiguration of the rules. Also, this system gets harder and harder to maintain.

Machine Learning

The second way to implement this Spam filter is to use ML instead of using hard-coded rules. That means you need to collect the data, define & calculate (extract) the features, and then train and use the model to classify messages into spam and not spam.

Collect the data

Collecting the data while using the “SPAM” button of your mail system

Define & calculate (extract) the features

Creating the features -> start with the rules you would use in rule-based systems

Features:

Length of title > 10? true/false
Length of body > 10? true/false
Sender “promotions@online.com”? true/false
Sender “hpYOSKmL@test.com”? true/false
Sender domain “test.com”? true/false
Description contains “deposit”? true/false

All of the six features here are binary features, so you can encode each mail as binary code like [1, 1, 0, 0, 1, 1]. Besides this every email has a label¹ / target (spam = 1, no-spam = 0), which is the desired output.

Training

This data is used to train the model. This process is often called as fitting a model.

In training, something happens that is similar to solving a very complex system of equations with many parameters. Here, the features are offset against each other in such a way that the correct classification is obtained at the end. Correct in this example means 1 for spam 0 for no spam. More precisely, we get a probability for the correct label. The trained model contains exactly the information that best solves the equation, namely the weights with which the individual features must be offset to get the correct result.²

Apply the model

If the model is now applied to unknown data sets, the result is a probability. This probability indicates whether this is a spam mail or not. To finally decide how to categorize the mail, a threshold is used (e.g. 0.5). Thus, everything greater than or equal to 0.5 is declared as spam.

the term “label” is not used in this video ↩︎
this is not described in this video ↩︎

ML Zoomcamp 2023 – Introduction to Machine Learning – Part 2

Rule-based systems

Machine Learning

Collect the data

Define & calculate (extract) the features

Training

Apply the model

Leave a comment Cancel reply

Rule-based systems

Machine Learning

Collect the data

Define & calculate (extract) the features

Training

Apply the model

Teilen mit:

Related

Leave a comment Cancel reply