Training logistic regression with Scikit-Learn
Train a model with Scikit-Learn
When you want to train a logistic regression model, the process is quite similar to training a linear regression model.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
You can use the ‘coef_’ attribute to display the weights (coefficients) in a logistic regression model.
model.coef_
# Output:
# array([[ 4.74725393e-01, -1.74869739e-01, -4.07533674e-01,
# -2.96832307e-02, -7.79947901e-02, 6.26830488e-02,
# -8.89697670e-02, -8.13913026e-02, -3.43104989e-02,
# -7.33675219e-02, -3.35206588e-01, 3.16498334e-01,
# -8.89697670e-02, 3.67393252e-03, -2.58133752e-01,
# 1.41436648e-01, 9.01908316e-03, 6.25300062e-02,
# -8.89697670e-02, -8.12382600e-02, 2.65582755e-01,
# -8.89697670e-02, -2.84291008e-01, -2.31202837e-01,
# 1.23524816e-01, -1.66018462e-01, 5.83404413e-02,
# -8.70075565e-02, -3.20578701e-02, 7.04875625e-02,
# -5.91001566e-02, 1.41436648e-01, -2.49114669e-01,
# 2.15471208e-01, -1.20363620e-01, -8.89697670e-02,
# 1.01655367e-01, -7.08936452e-02, -8.89697670e-02,
# 5.21853914e-02, 2.13378878e-01, -8.89697670e-02,
# -2.32087131e-01, -7.04067163e-02, 3.82395921e-04]])
The ‘coef_’ attribute in logistic regression returns a 2-dimensional array, but if you’re interested in the weight vector ‘w,’ you can access it by indexing the first row. In most cases, you’ll find the weight vector ‘w’ you’re interested in by accessing ‘coef_[0].’
model.coef_[0].round(3)
# Output:
# array([ 0.475, -0.175, -0.408, -0.03 , -0.078, 0.063, -0.089, -0.081,
# -0.034, -0.073, -0.335, 0.316, -0.089, 0.004, -0.258, 0.141,
# 0.009, 0.063, -0.089, -0.081, 0.266, -0.089, -0.284, -0.231,
# 0.124, -0.166, 0.058, -0.087, -0.032, 0.07 , -0.059, 0.141,
# -0.249, 0.215, -0.12 , -0.089, 0.102, -0.071, -0.089, 0.052,
# 0.213, -0.089, -0.232, -0.07 , 0. ])
You can use the ‘intercept_’ attribute to display the bias term (intercept) in a logistic regression model.
model.intercept_
# Output: array([-0.10903301])
# actually it's an array with one element
model.intercept_[0]
# Output: -0.10903300803603666
Now we have our trained logistic regression model, we can apply it to a dataset. Let’s begin by testing it on the training data.
model.predict(X_train)
# Output: array([0, 1, 1, ..., 1, 0, 1])
We observe that the model provides hard predictions, meaning it assigns either zeros (representing “not churn”) or ones (representing “churn”). These hard predictions are called such because we already have the exact labels in the training data.
Instead of hard predictions, we can generate soft predictions by using the predict_proba function, as demonstrated in the following snippet.
model.predict_proba(X_train)
# Output:
# array([[0.90451975, 0.09548025],
# [0.32068109, 0.67931891],
# [0.36632967, 0.63367033],
# ...,
# [0.46839952, 0.53160048],
# [0.95745572, 0.04254428],
# [0.30127894, 0.69872106]])
Indeed, when using the predict_proba function in logistic regression, the output contains two columns. The first column represents the probability of belonging to the negative class (0), while the second column represents the probability of belonging to the positive class (1). In the context of churn prediction, we are typically interested in the second column, which represents the probability of churn.
Hence, you can simply extract the second column to obtain the probabilities of churn. Then, to make the final decision about whether to classify individuals as churned or not, you can choose a threshold. People with probabilities above this threshold are classified as churned, while those below it are classified as not churned. The choice of threshold can affect the model’s precision, recall, and other performance metrics, so it’s an important consideration when making predictions with logistic regression.
Apply model to the validation dataset
y_pred = model.predict_proba(X_val)[:,1]
y_pred
# Output:
# array([0.00899701, 0.20452226, 0.21222307, ..., 0.13638772, 0.79975934,
# 0.83739781])
The result is a binary array with predictions. To proceed, you can define your chosen threshold and use it to select all customers for whom you believe the model predicts churn. This process allows you to identify the customers whom the model suggests are likely to churn based on the chosen threshold.
churn_decision = y_pred >0.5
churn_decision
# Output: array([False, False, False, ..., False, True, True])
df_val[churn_decision]
| customerid | gender | seniorcitizen | partner | dependents | tenure | phoneservice | multiplelines | internetservice | onlinesecurity | onlinebackup | deviceprotection | techsupport | streamingtv | streamingmovies | contract | paperlessbilling | paymentmethod | monthlycharges | totalcharges | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 8433-wxgna | male | 0 | no | no | 2 | yes | no | fiber_optic | yes | no | no | no | no | no | month-to-month | yes | electronic_check | 75.70 | 189.20 |
| 8 | 3440-jpscl | female | 0 | no | no | 6 | yes | no | fiber_optic | no | no | yes | yes | yes | yes | month-to-month | yes | mailed_check | 99.95 | 547.65 |
| 11 | 2637-fkfsy | female | 0 | yes | no | 3 | yes | no | dsl | no | no | no | no | no | no | month-to-month | yes | mailed_check | 46.10 | 130.15 |
| 12 | 7228-omtpn | male | 0 | no | no | 4 | yes | no | fiber_optic | no | no | no | no | yes | yes | month-to-month | yes | electronic_check | 88.45 | 370.65 |
| 19 | 6711-fldfb | female | 0 | no | no | 7 | yes | yes | fiber_optic | no | no | no | no | no | no | month-to-month | yes | electronic_check | 74.90 | 541.15 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 1397 | 5976-jcjrh | male | 0 | yes | no | 10 | yes | no | fiber_optic | no | no | no | no | no | no | month-to-month | yes | electronic_check | 70.30 | 738.20 |
| 1398 | 2034-cgrhz | male | 1 | no | no | 24 | yes | yes | fiber_optic | no | yes | yes | no | yes | yes | month-to-month | yes | credit_card_(automatic) | 102.95 | 2496.70 |
| 1399 | 5276-kqwhg | female | 1 | no | no | 2 | yes | no | fiber_optic | no | no | no | no | no | no | month-to-month | yes | electronic_check | 69.60 | 131.65 |
| 1407 | 6521-yytyi | male | 0 | no | yes | 1 | yes | yes | fiber_optic | no | no | no | no | yes | yes | month-to-month | yes | electronic_check | 93.30 | 93.30 |
| 1408 | 3049-solay | female | 0 | yes | no | 3 | yes | yes | fiber_optic | no | no | no | no | yes | yes | month-to-month | yes | electronic_check | 95.20 | 292.85 |
These are the individuals who will receive a promotional email with a discount. The process involves selecting all the rows for which the churn_decision is true, indicating that the model predicts them as likely to churn based on the chosen threshold.
df_val[churn_decision].customerid
# Output:
# 3 8433-wxgna
# 8 3440-jpscl
# 11 2637-fkfsy
# 12 7228-omtpn
# 19 6711-fldfb
# ...
# 1397 5976-jcjrh
# 1398 2034-cgrhz
# 1399 5276-kqwhg
# 1407 6521-yytyi
# 1408 3049-solay
# Name: customerid, Length: 311, dtype: object
Calculate the accuracy
Let’s assess the accuracy of our predictions. This time, we’ll use the accuracy metric instead of root mean squared error (RMSE). Accuracy is a common metric for evaluating classification models like logistic regression.
To calculate the accuracy, you can use the actual values y_val and your predicted values as integers. You can obtain integer values from your predicted probabilities by using the astype(int) function, which will convert the probabilities to either 0 or 1 based on your chosen threshold. Then, you can compare these integer predictions with the actual values to calculate the accuracy.
y_val
# Output: array([0, 0, 0, ..., 0, 1, 1])
churn_decision.astype(int)
# Output: array([0, 0, 0, ..., 0, 1, 1])
You can check how many of your predictions match the actual y_val values to calculate accuracy. This is essentially a shortcut for calculating the fraction of True or 1 values in the array of comparisons between predictions and actual values.
(y_val == churn_decision).mean()
# Output: 0.8034066713981547
Let’s examine how the last line of the last snippet works internally.
df_pred = pd.DataFrame()
df_pred['probability'] = y_pred
df_pred['prediction'] = churn_decision.astype(int)
df_pred['actual'] = y_val
df_pred['correct'] = df_pred.prediction == df_pred.actual
df_pred
| probability | prediction | actual | correct | |
|---|---|---|---|---|
| 0 | 0.008997 | 0 | 0 | True |
| 1 | 0.204522 | 0 | 0 | True |
| 2 | 0.212223 | 0 | 0 | True |
| 3 | 0.543039 | 1 | 1 | True |
| 4 | 0.213786 | 0 | 0 | True |
| … | … | … | … | … |
| 1404 | 0.313668 | 0 | 0 | True |
| 1405 | 0.039359 | 0 | 1 | False |
| 1406 | 0.136388 | 0 | 0 | True |
| 1407 | 0.799759 | 1 | 1 | True |
| 1408 | 0.837398 | 1 | 1 | True |
df_pred.correct.mean()
# Output: 0.8034066713981547
Certainly, in this context, the mean() function calculates the fraction of ones in the binary array. Since it’s a boolean array, True values are automatically converted to 1, and False values are converted to 0 when calculating the mean. This automatic conversion simplifies the process of calculating accuracy.
You’ve observed that the model has an accuracy of 80%, which means it is correct in predicting the outcome in 80% of the cases. This indicates that the model is performing reasonably well in classifying whether customers will churn or not based on the chosen threshold and the evaluation on the validation dataset.