Training logistic regression with Scikit-Learn

Training logistic regression with Scikit-Learn

Train a model with Scikit-Learn

When you want to train a logistic regression model, the process is quite similar to training a linear regression model.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

You can use the ‘coef_’ attribute to display the weights (coefficients) in a logistic regression model.

model.coef_
# Output:
# array([[ 4.74725393e-01, -1.74869739e-01, -4.07533674e-01,
#        -2.96832307e-02, -7.79947901e-02,  6.26830488e-02,
#        -8.89697670e-02, -8.13913026e-02, -3.43104989e-02,
#        -7.33675219e-02, -3.35206588e-01,  3.16498334e-01,
#        -8.89697670e-02,  3.67393252e-03, -2.58133752e-01,
#         1.41436648e-01,  9.01908316e-03,  6.25300062e-02,
#        -8.89697670e-02, -8.12382600e-02,  2.65582755e-01,
#        -8.89697670e-02, -2.84291008e-01, -2.31202837e-01,
#         1.23524816e-01, -1.66018462e-01,  5.83404413e-02,
#        -8.70075565e-02, -3.20578701e-02,  7.04875625e-02,
#        -5.91001566e-02,  1.41436648e-01, -2.49114669e-01,
#         2.15471208e-01, -1.20363620e-01, -8.89697670e-02,
#         1.01655367e-01, -7.08936452e-02, -8.89697670e-02,
#         5.21853914e-02,  2.13378878e-01, -8.89697670e-02,
#        -2.32087131e-01, -7.04067163e-02,  3.82395921e-04]])

The ‘coef_’ attribute in logistic regression returns a 2-dimensional array, but if you’re interested in the weight vector ‘w,’ you can access it by indexing the first row. In most cases, you’ll find the weight vector ‘w’ you’re interested in by accessing ‘coef_[0].’

model.coef_[0].round(3)

# Output: 
# array([ 0.475, -0.175, -0.408, -0.03 , -0.078,  0.063, -0.089, -0.081,
#             -0.034, -0.073, -0.335,  0.316, -0.089,  0.004, -0.258,  0.141,
#              0.009,  0.063, -0.089, -0.081,  0.266, -0.089, -0.284, -0.231,
#              0.124, -0.166,  0.058, -0.087, -0.032,  0.07 , -0.059,  0.141,
#            -0.249,  0.215, -0.12 , -0.089,  0.102, -0.071, -0.089,  0.052,
#              0.213, -0.089, -0.232, -0.07 ,  0.   ])

You can use the ‘intercept_’ attribute to display the bias term (intercept) in a logistic regression model.

model.intercept_
# Output: array([-0.10903301])

# actually it's an array with one element
model.intercept_[0]
# Output: -0.10903300803603666

Now we have our trained logistic regression model, we can apply it to a dataset. Let’s begin by testing it on the training data.

model.predict(X_train)

# Output: array([0, 1, 1, ..., 1, 0, 1])

We observe that the model provides hard predictions, meaning it assigns either zeros (representing “not churn”) or ones (representing “churn”). These hard predictions are called such because we already have the exact labels in the training data.

Instead of hard predictions, we can generate soft predictions by using the predict_proba function, as demonstrated in the following snippet.

model.predict_proba(X_train)

# Output:
# array([[0.90451975, 0.09548025],
#       [0.32068109, 0.67931891],
#       [0.36632967, 0.63367033],
#       ...,
#       [0.46839952, 0.53160048],
#       [0.95745572, 0.04254428],
#       [0.30127894, 0.69872106]])

Indeed, when using the predict_proba function in logistic regression, the output contains two columns. The first column represents the probability of belonging to the negative class (0), while the second column represents the probability of belonging to the positive class (1). In the context of churn prediction, we are typically interested in the second column, which represents the probability of churn.

Hence, you can simply extract the second column to obtain the probabilities of churn. Then, to make the final decision about whether to classify individuals as churned or not, you can choose a threshold. People with probabilities above this threshold are classified as churned, while those below it are classified as not churned. The choice of threshold can affect the model’s precision, recall, and other performance metrics, so it’s an important consideration when making predictions with logistic regression.

Apply model to the validation dataset

y_pred = model.predict_proba(X_val)[:,1]
y_pred

# Output: 
# array([0.00899701, 0.20452226, 0.21222307, ..., 0.13638772, 0.79975934,
#       0.83739781])

The result is a binary array with predictions. To proceed, you can define your chosen threshold and use it to select all customers for whom you believe the model predicts churn. This process allows you to identify the customers whom the model suggests are likely to churn based on the chosen threshold.

churn_decision = y_pred >0.5
churn_decision
# Output: array([False, False, False, ..., False,  True,  True])

df_val[churn_decision]

	customerid	gender	seniorcitizen	partner	dependents	tenure	phoneservice	multiplelines	internetservice	onlinesecurity	onlinebackup	deviceprotection	techsupport	streamingtv	streamingmovies	contract	paperlessbilling	paymentmethod	monthlycharges	totalcharges
3	8433-wxgna	male	0	no	no	2	yes	no	fiber_optic	yes	no	no	no	no	no	month-to-month	yes	electronic_check	75.70	189.20
8	3440-jpscl	female	0	no	no	6	yes	no	fiber_optic	no	no	yes	yes	yes	yes	month-to-month	yes	mailed_check	99.95	547.65
11	2637-fkfsy	female	0	yes	no	3	yes	no	dsl	no	no	no	no	no	no	month-to-month	yes	mailed_check	46.10	130.15
12	7228-omtpn	male	0	no	no	4	yes	no	fiber_optic	no	no	no	no	yes	yes	month-to-month	yes	electronic_check	88.45	370.65
19	6711-fldfb	female	0	no	no	7	yes	yes	fiber_optic	no	no	no	no	no	no	month-to-month	yes	electronic_check	74.90	541.15
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
1397	5976-jcjrh	male	0	yes	no	10	yes	no	fiber_optic	no	no	no	no	no	no	month-to-month	yes	electronic_check	70.30	738.20
1398	2034-cgrhz	male	1	no	no	24	yes	yes	fiber_optic	no	yes	yes	no	yes	yes	month-to-month	yes	credit_card_(automatic)	102.95	2496.70
1399	5276-kqwhg	female	1	no	no	2	yes	no	fiber_optic	no	no	no	no	no	no	month-to-month	yes	electronic_check	69.60	131.65
1407	6521-yytyi	male	0	no	yes	1	yes	yes	fiber_optic	no	no	no	no	yes	yes	month-to-month	yes	electronic_check	93.30	93.30
1408	3049-solay	female	0	yes	no	3	yes	yes	fiber_optic	no	no	no	no	yes	yes	month-to-month	yes	electronic_check	95.20	292.85

311 rows × 20 columns

These are the individuals who will receive a promotional email with a discount. The process involves selecting all the rows for which the churn_decision is true, indicating that the model predicts them as likely to churn based on the chosen threshold.

df_val[churn_decision].customerid

# Output:
# 3       8433-wxgna
# 8       3440-jpscl
# 11      2637-fkfsy
# 12      7228-omtpn
# 19      6711-fldfb
#            ...    
# 1397    5976-jcjrh
# 1398    2034-cgrhz
# 1399    5276-kqwhg
# 1407    6521-yytyi
# 1408    3049-solay
# Name: customerid, Length: 311, dtype: object

Calculate the accuracy

Let’s assess the accuracy of our predictions. This time, we’ll use the accuracy metric instead of root mean squared error (RMSE). Accuracy is a common metric for evaluating classification models like logistic regression.

To calculate the accuracy, you can use the actual values y_val and your predicted values as integers. You can obtain integer values from your predicted probabilities by using the astype(int) function, which will convert the probabilities to either 0 or 1 based on your chosen threshold. Then, you can compare these integer predictions with the actual values to calculate the accuracy.

y_val
# Output: array([0, 0, 0, ..., 0, 1, 1])

churn_decision.astype(int)
# Output: array([0, 0, 0, ..., 0, 1, 1])

You can check how many of your predictions match the actual y_val values to calculate accuracy. This is essentially a shortcut for calculating the fraction of True or 1 values in the array of comparisons between predictions and actual values.

(y_val == churn_decision).mean()

# Output: 0.8034066713981547

Let’s examine how the last line of the last snippet works internally.

df_pred = pd.DataFrame()
df_pred['probability'] = y_pred
df_pred['prediction'] = churn_decision.astype(int)
df_pred['actual'] = y_val
df_pred['correct'] = df_pred.prediction == df_pred.actual
df_pred

	probability	prediction	actual	correct
0	0.008997	0	0	True
1	0.204522	0	0	True
2	0.212223	0	0	True
3	0.543039	1	1	True
4	0.213786	0	0	True
…	…	…	…	…
1404	0.313668	0	0	True
1405	0.039359	0	1	False
1406	0.136388	0	0	True
1407	0.799759	1	1	True
1408	0.837398	1	1	True

1409 rows × 4 columns

df_pred.correct.mean()

# Output: 0.8034066713981547

Certainly, in this context, the mean() function calculates the fraction of ones in the binary array. Since it’s a boolean array, True values are automatically converted to 1, and False values are converted to 0 when calculating the mean. This automatic conversion simplifies the process of calculating accuracy.

You’ve observed that the model has an accuracy of 80%, which means it is correct in predicting the outcome in 80% of the cases. This indicates that the model is performing reasonably well in classifying whether customers will churn or not based on the chosen threshold and the evaluation on the validation dataset.

ML Zoomcamp 2023 – Machine Learning for Classification– Part 10

Training logistic regression with Scikit-Learn

Train a model with Scikit-Learn

Apply model to the validation dataset

Calculate the accuracy

Leave a comment Cancel reply

Training logistic regression with Scikit-Learn

Train a model with Scikit-Learn

Apply model to the validation dataset

Calculate the accuracy

Teilen mit:

Related

Leave a comment Cancel reply