ML Zoomcamp 2023 – Machine Learning for Classification– Part 10

Training logistic regression with Scikit-Learn

  1. Training logistic regression with Scikit-Learn
    1. Train a model with Scikit-Learn
    2. Apply model to the validation dataset
    3. Calculate the accuracy

Train a model with Scikit-Learn

When you want to train a logistic regression model, the process is quite similar to training a linear regression model.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)


You can use the ‘coef_’ attribute to display the weights (coefficients) in a logistic regression model.

model.coef_
# Output:
# array([[ 4.74725393e-01, -1.74869739e-01, -4.07533674e-01,
#        -2.96832307e-02, -7.79947901e-02,  6.26830488e-02,
#        -8.89697670e-02, -8.13913026e-02, -3.43104989e-02,
#        -7.33675219e-02, -3.35206588e-01,  3.16498334e-01,
#        -8.89697670e-02,  3.67393252e-03, -2.58133752e-01,
#         1.41436648e-01,  9.01908316e-03,  6.25300062e-02,
#        -8.89697670e-02, -8.12382600e-02,  2.65582755e-01,
#        -8.89697670e-02, -2.84291008e-01, -2.31202837e-01,
#         1.23524816e-01, -1.66018462e-01,  5.83404413e-02,
#        -8.70075565e-02, -3.20578701e-02,  7.04875625e-02,
#        -5.91001566e-02,  1.41436648e-01, -2.49114669e-01,
#         2.15471208e-01, -1.20363620e-01, -8.89697670e-02,
#         1.01655367e-01, -7.08936452e-02, -8.89697670e-02,
#         5.21853914e-02,  2.13378878e-01, -8.89697670e-02,
#        -2.32087131e-01, -7.04067163e-02,  3.82395921e-04]])

The ‘coef_’ attribute in logistic regression returns a 2-dimensional array, but if you’re interested in the weight vector ‘w,’ you can access it by indexing the first row. In most cases, you’ll find the weight vector ‘w’ you’re interested in by accessing ‘coef_[0].’

model.coef_[0].round(3)

# Output: 
# array([ 0.475, -0.175, -0.408, -0.03 , -0.078,  0.063, -0.089, -0.081,
#             -0.034, -0.073, -0.335,  0.316, -0.089,  0.004, -0.258,  0.141,
#              0.009,  0.063, -0.089, -0.081,  0.266, -0.089, -0.284, -0.231,
#              0.124, -0.166,  0.058, -0.087, -0.032,  0.07 , -0.059,  0.141,
#            -0.249,  0.215, -0.12 , -0.089,  0.102, -0.071, -0.089,  0.052,
#              0.213, -0.089, -0.232, -0.07 ,  0.   ])

You can use the ‘intercept_’ attribute to display the bias term (intercept) in a logistic regression model.

model.intercept_
# Output: array([-0.10903301])

# actually it's an array with one element
model.intercept_[0]
# Output: -0.10903300803603666

Now we have our trained logistic regression model, we can apply it to a dataset. Let’s begin by testing it on the training data.

model.predict(X_train)

# Output: array([0, 1, 1, ..., 1, 0, 1])

We observe that the model provides hard predictions, meaning it assigns either zeros (representing “not churn”) or ones (representing “churn”). These hard predictions are called such because we already have the exact labels in the training data.

Instead of hard predictions, we can generate soft predictions by using the predict_proba function, as demonstrated in the following snippet.

model.predict_proba(X_train)

# Output:
# array([[0.90451975, 0.09548025],
#       [0.32068109, 0.67931891],
#       [0.36632967, 0.63367033],
#       ...,
#       [0.46839952, 0.53160048],
#       [0.95745572, 0.04254428],
#       [0.30127894, 0.69872106]])

Indeed, when using the predict_proba function in logistic regression, the output contains two columns. The first column represents the probability of belonging to the negative class (0), while the second column represents the probability of belonging to the positive class (1). In the context of churn prediction, we are typically interested in the second column, which represents the probability of churn.

Hence, you can simply extract the second column to obtain the probabilities of churn. Then, to make the final decision about whether to classify individuals as churned or not, you can choose a threshold. People with probabilities above this threshold are classified as churned, while those below it are classified as not churned. The choice of threshold can affect the model’s precision, recall, and other performance metrics, so it’s an important consideration when making predictions with logistic regression.

Apply model to the validation dataset

y_pred = model.predict_proba(X_val)[:,1]
y_pred

# Output: 
# array([0.00899701, 0.20452226, 0.21222307, ..., 0.13638772, 0.79975934,
#       0.83739781])

The result is a binary array with predictions. To proceed, you can define your chosen threshold and use it to select all customers for whom you believe the model predicts churn. This process allows you to identify the customers whom the model suggests are likely to churn based on the chosen threshold.

churn_decision = y_pred >0.5
churn_decision
# Output: array([False, False, False, ..., False,  True,  True])

df_val[churn_decision]
customeridgenderseniorcitizenpartnerdependentstenurephoneservicemultiplelinesinternetserviceonlinesecurityonlinebackupdeviceprotectiontechsupportstreamingtvstreamingmoviescontractpaperlessbillingpaymentmethodmonthlychargestotalcharges
38433-wxgnamale0nono2yesnofiber_opticyesnononononomonth-to-monthyeselectronic_check75.70189.20
83440-jpsclfemale0nono6yesnofiber_opticnonoyesyesyesyesmonth-to-monthyesmailed_check99.95547.65
112637-fkfsyfemale0yesno3yesnodslnonononononomonth-to-monthyesmailed_check46.10130.15
127228-omtpnmale0nono4yesnofiber_opticnonononoyesyesmonth-to-monthyeselectronic_check88.45370.65
196711-fldfbfemale0nono7yesyesfiber_opticnonononononomonth-to-monthyeselectronic_check74.90541.15
13975976-jcjrhmale0yesno10yesnofiber_opticnonononononomonth-to-monthyeselectronic_check70.30738.20
13982034-cgrhzmale1nono24yesyesfiber_opticnoyesyesnoyesyesmonth-to-monthyescredit_card_(automatic)102.952496.70
13995276-kqwhgfemale1nono2yesnofiber_opticnonononononomonth-to-monthyeselectronic_check69.60131.65
14076521-yytyimale0noyes1yesyesfiber_opticnonononoyesyesmonth-to-monthyeselectronic_check93.3093.30
14083049-solayfemale0yesno3yesyesfiber_opticnonononoyesyesmonth-to-monthyeselectronic_check95.20292.85
311 rows × 20 columns

These are the individuals who will receive a promotional email with a discount. The process involves selecting all the rows for which the churn_decision is true, indicating that the model predicts them as likely to churn based on the chosen threshold.

df_val[churn_decision].customerid

# Output:
# 3       8433-wxgna
# 8       3440-jpscl
# 11      2637-fkfsy
# 12      7228-omtpn
# 19      6711-fldfb
#            ...    
# 1397    5976-jcjrh
# 1398    2034-cgrhz
# 1399    5276-kqwhg
# 1407    6521-yytyi
# 1408    3049-solay
# Name: customerid, Length: 311, dtype: object

Calculate the accuracy

Let’s assess the accuracy of our predictions. This time, we’ll use the accuracy metric instead of root mean squared error (RMSE). Accuracy is a common metric for evaluating classification models like logistic regression.

To calculate the accuracy, you can use the actual values y_val and your predicted values as integers. You can obtain integer values from your predicted probabilities by using the astype(int) function, which will convert the probabilities to either 0 or 1 based on your chosen threshold. Then, you can compare these integer predictions with the actual values to calculate the accuracy.

y_val
# Output: array([0, 0, 0, ..., 0, 1, 1])

churn_decision.astype(int)
# Output: array([0, 0, 0, ..., 0, 1, 1])

You can check how many of your predictions match the actual y_val values to calculate accuracy. This is essentially a shortcut for calculating the fraction of True or 1 values in the array of comparisons between predictions and actual values.

(y_val == churn_decision).mean()

# Output: 0.8034066713981547

Let’s examine how the last line of the last snippet works internally.

df_pred = pd.DataFrame()
df_pred['probability'] = y_pred
df_pred['prediction'] = churn_decision.astype(int)
df_pred['actual'] = y_val
df_pred['correct'] = df_pred.prediction == df_pred.actual
df_pred
probabilitypredictionactualcorrect
00.00899700True
10.20452200True
20.21222300True
30.54303911True
40.21378600True
14040.31366800True
14050.03935901False
14060.13638800True
14070.79975911True
14080.83739811True
1409 rows × 4 columns
df_pred.correct.mean()

# Output: 0.8034066713981547

Certainly, in this context, the mean() function calculates the fraction of ones in the binary array. Since it’s a boolean array, True values are automatically converted to 1, and False values are converted to 0 when calculating the mean. This automatic conversion simplifies the process of calculating accuracy.

You’ve observed that the model has an accuracy of 80%, which means it is correct in predicting the outcome in 80% of the cases. This indicates that the model is performing reasonably well in classifying whether customers will churn or not based on the chosen threshold and the evaluation on the validation dataset.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.