ML Zoomcamp 2023 – Evaluation metrics for classification– Part 6

  1. ROC AUC – Area under the ROC curve
    1. Useful metric
    2. AUC interpretation

ROC AUC – Area under the ROC curve

Useful metric

One way to quantify how close we are to the ideal point is by measuring the area under the ROC curve (AUC). AUC equals 0.5 for a random baseline and 1.0 for an ideal curve. Therefore, our model’s AUC should fall between 0.5 and 1.0. When AUC is less than 0.5, we’ve made a mistake. AUC = 0.8 is considered good, while 0.9 is great, but 0.6 is considered poor. We can calculate AUC using the scikit-learn package. This package is not specifically for roc curves, this is for any curve. It can calculate area under any curve.

from sklearn.metrics import auc
# auc needs values for x-axis and y-axis
auc(fpr, tpr)
# Output: 0.843850505725819
auc(df_scores.fpr, df_scores.tpr)
# Output: 0.8438732975754537
auc(df_ideal.fpr, df_ideal.tpr)
# Output: 0.9999430203759136
fpr, tpr, thresholds = roc_curve(y_val, y_pred)
auc(fpr, tpr)

# Output: 0.843850505725819

There is a shortcut in scikit-learn package

from sklearn.metrics import roc_auc_score

roc_auc_score(y_val, y_pred)

# Output: 0.843850505725819

AUC interpretation

AUC tells us the probability that a randomly selected positive example has a score that is higher than a randomly selected negative example.

neg = y_pred[y_val == 0]
pos = y_pred[y_val == 1]
import random
pos_ind = random.randint(0, len(pos) -1)
neg_ind = random.randint(0, len(neg) -1)

We want to compare the score of this positive example with the score of the negative example.

pos[pos_ind] > neg[neg_ind]
# Output: True

So, for this random example, this is true. We can do this 100,000 times and evaluate the performance.

n = 100000
success = 0

for i in range(n):
    pos_ind = random.randint(0, len(pos) -1)
    neg_ind = random.randint(0, len(neg) -1)

    if pos[pos_ind] > neg[neg_ind]:
        success += 1

success / n

# Output: 0.84389

That result is quite close to roc_auc_score(y_val, y_pred) = 0.843850505725819.

Instead of implementing this manually, we can use NumPy. Be aware that in np.random.randint(low, high, size, dtype), ‘low’ is inclusive, and ‘high’ is exclusive.

n = 50000

np.random.seed(1)
pos_ind = np.random.randint(0, len(pos), size=n)
neg_ind = np.random.randint(0, len(neg), size=n)
pos[pos_ind] > neg[neg_ind]
# Output: array([False,  True,  True, ...,  True,  True,  True])

(pos[pos_ind] > neg[neg_ind]).mean()
# Output: 0.84646

Because of this interpretation, AUC is quite popular as a way of measuring the performance of binary classification models. It’s quite intuitive, and we can use it to assess how well our model ranks positive and negative examples and separates positive examples from negative ones.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.