Using the model
- train a model on full_train dataset
df_full_train
| customerid | gender | seniorcitizen | partner | dependents | tenure | phoneservice | multiplelines | internetservice | onlinesecurity | … | deviceprotection | techsupport | streamingtv | streamingmovies | contract | paperlessbilling | paymentmethod | monthlycharges | totalcharges | churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5442-pptjy | male | 0 | yes | yes | 12 | yes | no | no | no_internet_service | … | no_internet_service | no_internet_service | no_internet_service | no_internet_service | two_year | no | mailed_check | 19.70 | 258.35 | 0 |
| 1 | 6261-rcvns | female | 0 | no | no | 42 | yes | no | dsl | yes | … | yes | yes | no | yes | one_year | no | credit_card_(automatic) | 73.90 | 3160.55 | 1 |
| 2 | 2176-osjuv | male | 0 | yes | no | 71 | yes | yes | dsl | yes | … | no | yes | no | no | two_year | no | bank_transfer_(automatic) | 65.15 | 4681.75 | 0 |
| 3 | 6161-erdgd | male | 0 | yes | yes | 71 | yes | yes | dsl | yes | … | yes | yes | yes | yes | one_year | no | electronic_check | 85.45 | 6300.85 | 0 |
| 4 | 2364-ufrom | male | 0 | no | no | 30 | yes | no | dsl | yes | … | no | yes | yes | no | one_year | no | electronic_check | 70.40 | 2044.75 | 0 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 5629 | 0781-lkxbr | male | 1 | no | no | 9 | yes | yes | fiber_optic | no | … | yes | no | yes | yes | month-to-month | yes | electronic_check | 100.50 | 918.60 | 1 |
| 5630 | 3507-gasnp | male | 0 | no | yes | 60 | yes | no | no | no_internet_service | … | no_internet_service | no_internet_service | no_internet_service | no_internet_service | two_year | no | mailed_check | 19.95 | 1189.90 | 0 |
| 5631 | 8868-wozgu | male | 0 | no | no | 28 | yes | yes | fiber_optic | no | … | yes | no | yes | yes | month-to-month | yes | electronic_check | 105.70 | 2979.50 | 1 |
| 5632 | 1251-krreg | male | 0 | no | no | 2 | yes | yes | dsl | no | … | no | no | no | no | month-to-month | yes | mailed_check | 54.40 | 114.10 | 1 |
| 5633 | 5840-nvdcg | female | 0 | yes | yes | 16 | yes | no | dsl | yes | … | no | yes | no | yes | two_year | no | bank_transfer_(automatic) | 68.25 | 1114.85 | 0 |
First we need to get the dictionaries.
dicts_full_train = df_full_train[categorical + numerical].to_dict(orient='records')
dicts_full_train[:3]
# Output:
# [{'gender': 'male',
# 'seniorcitizen': 0,
# 'partner': 'yes',
# 'dependents': 'yes',
# 'phoneservice': 'yes',
# 'multiplelines': 'no',
# 'internetservice': 'no',
# 'onlinesecurity': 'no_internet_service',
# 'onlinebackup': 'no_internet_service',
# 'deviceprotection': 'no_internet_service',
# 'techsupport': 'no_internet_service',
# 'streamingtv': 'no_internet_service',
# 'streamingmovies': 'no_internet_service',
# 'contract': 'two_year',
# 'paperlessbilling': 'no',
# 'paymentmethod': 'mailed_check',
# 'tenure': 12,
# 'monthlycharges': 19.7,
# 'totalcharges': 258.35},
# {'gender': 'female',
# 'seniorcitizen': 0,
# 'partner': 'no',
# 'dependents': 'no',
# 'phoneservice': 'yes',
# 'multiplelines': 'no',
# ...
# 'paperlessbilling': 'no',
# 'paymentmethod': 'bank_transfer_(automatic)',
# 'tenure': 71,
# 'monthlycharges': 65.15,
# 'totalcharges': 4681.75}]
# create DictVectorizer
dv = DictVectorizer(sparse=False)
# from this dictionaries we get the feature matrix
X_full_train = dv.fit_transform(dicts_full_train)
# then we train a model on this feature matrix
y_full_train = df_full_train.churn.values
model = LogisticRegression()
model.fit(X_full_train, y_full_train)
# do the same things for test data
dicts_test = df_test[categorical + numerical].to_dict(orient='records')
X_test = dv.transform(dicts_test)
# do the predictions
y_pred = model.predict_proba(X_test)[:, 1]
# compute accuracy
churn_decision = (y_pred >= 0.5)
(churn_decision == y_test).mean()
# Output: 0.815471965933286
An accuracy of 81.5% on the test data is slightly more accurate than what we had in the validation data. Minor differences in performance are acceptable, but significant differences between training and validation/test data can indeed indicate potential issues with the model, such as overfitting. Ensuring that the model’s performance is consistent across different datasets is an important aspect of model evaluation and generalization.
Let’s imagine that we want to deploy the logistic regression model on a website where we can use it to predict whether a customer is likely to leave (churn) or not. When a customer visits the website and provides their information, this data is transferred as a dictionary over the network to the server hosting the model. The server then uses the model to compute a probability, which is returned to determine whether the customer is likely to churn. This approach allows to make real-time predictions about customer churn and take appropriate actions, such as sending promotional offers to customers who are likely to leave.
Let’s take a sample customer from our test set:
customer = dicts_test[10]
customer
# Output:
# {'gender': 'male',
# 'seniorcitizen': 1,
# 'partner': 'yes',
# 'dependents': 'yes',
# 'phoneservice': 'yes',
# 'multiplelines': 'no',
# 'internetservice': 'fiber_optic',
# 'onlinesecurity': 'no',
# 'onlinebackup': 'yes',
# 'deviceprotection': 'no',
# 'techsupport': 'no',
# 'streamingtv': 'yes',
# 'streamingmovies': 'yes',
# 'contract': 'month-to-month',
# 'paperlessbilling': 'yes',
# 'paymentmethod': 'mailed_check',
# 'tenure': 32,
# 'monthlycharges': 93.95,
# 'totalcharges': 2861.45}
To get the feature matrix for the requested customer as a dictionary, we create a list containing just that customer’s dictionary.
X_small = dv.transform([customer])
X_small
# Output:
# array([[1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00,
# 1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00,
# 0.00000e+00, 1.00000e+00, 0.00000e+00, 9.39500e+01, 1.00000e+00,
# 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00,
# 1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00,
# 0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
# 1.00000e+00, 0.00000e+00, 1.00000e+00, 1.00000e+00, 0.00000e+00,
# 0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00,
# 1.00000e+00, 0.00000e+00, 0.00000e+00, 3.20000e+01, 2.86145e+03]])
X_small.shape
# Output: (1, 45)
# one customer with 45 features
model.predict_proba(X_small)[0,1]
# Output: 0.4056810977975889
We see this customer has a probability of only 40% of churning. We assume this customer is not going to churn.
# Let's check the actuel value...
y_test[10]
# Output: 0
Our decision not sending an email to this customer was correct. Let’s test one customer that is going to churn.
customer = dicts_test[-1]
X_small = dv.transform([customer])
model.predict_proba(X_small)[0,1]
# Output: 0.5968852088398422
We see this customer has a probability of almost 60% of churning. We assume this customer is going to churn.
# Let's check the actuel value...
y_test[-1]
# Output: 1
The prediction is correct.