Using the model

train a model on full_train dataset

df_full_train

	customerid	gender	seniorcitizen	partner	dependents	tenure	phoneservice	multiplelines	internetservice	onlinesecurity	…	deviceprotection	techsupport	streamingtv	streamingmovies	contract	paperlessbilling	paymentmethod	monthlycharges	totalcharges	churn
0	5442-pptjy	male	0	yes	yes	12	yes	no	no	no_internet_service	…	no_internet_service	no_internet_service	no_internet_service	no_internet_service	two_year	no	mailed_check	19.70	258.35	0
1	6261-rcvns	female	0	no	no	42	yes	no	dsl	yes	…	yes	yes	no	yes	one_year	no	credit_card_(automatic)	73.90	3160.55	1
2	2176-osjuv	male	0	yes	no	71	yes	yes	dsl	yes	…	no	yes	no	no	two_year	no	bank_transfer_(automatic)	65.15	4681.75	0
3	6161-erdgd	male	0	yes	yes	71	yes	yes	dsl	yes	…	yes	yes	yes	yes	one_year	no	electronic_check	85.45	6300.85	0
4	2364-ufrom	male	0	no	no	30	yes	no	dsl	yes	…	no	yes	yes	no	one_year	no	electronic_check	70.40	2044.75	0
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
5629	0781-lkxbr	male	1	no	no	9	yes	yes	fiber_optic	no	…	yes	no	yes	yes	month-to-month	yes	electronic_check	100.50	918.60	1
5630	3507-gasnp	male	0	no	yes	60	yes	no	no	no_internet_service	…	no_internet_service	no_internet_service	no_internet_service	no_internet_service	two_year	no	mailed_check	19.95	1189.90	0
5631	8868-wozgu	male	0	no	no	28	yes	yes	fiber_optic	no	…	yes	no	yes	yes	month-to-month	yes	electronic_check	105.70	2979.50	1
5632	1251-krreg	male	0	no	no	2	yes	yes	dsl	no	…	no	no	no	no	month-to-month	yes	mailed_check	54.40	114.10	1
5633	5840-nvdcg	female	0	yes	yes	16	yes	no	dsl	yes	…	no	yes	no	yes	two_year	no	bank_transfer_(automatic)	68.25	1114.85	0

5634 rows × 21 columns

First we need to get the dictionaries.

dicts_full_train = df_full_train[categorical + numerical].to_dict(orient='records')
dicts_full_train[:3]

# Output:
# [{'gender': 'male',
#  'seniorcitizen': 0,
#  'partner': 'yes',
#  'dependents': 'yes',
#  'phoneservice': 'yes',
#  'multiplelines': 'no',
#  'internetservice': 'no',
#  'onlinesecurity': 'no_internet_service',
#  'onlinebackup': 'no_internet_service',
#  'deviceprotection': 'no_internet_service',
#  'techsupport': 'no_internet_service',
#  'streamingtv': 'no_internet_service',
#  'streamingmovies': 'no_internet_service',
#  'contract': 'two_year',
#  'paperlessbilling': 'no',
#  'paymentmethod': 'mailed_check',
#  'tenure': 12,
#  'monthlycharges': 19.7,
#  'totalcharges': 258.35},
# {'gender': 'female',
#  'seniorcitizen': 0,
#  'partner': 'no',
#  'dependents': 'no',
#  'phoneservice': 'yes',
#  'multiplelines': 'no',
# ...
#  'paperlessbilling': 'no',
#  'paymentmethod': 'bank_transfer_(automatic)',
#  'tenure': 71,
#  'monthlycharges': 65.15,
#  'totalcharges': 4681.75}]

# create DictVectorizer
dv = DictVectorizer(sparse=False)

# from this dictionaries we get the feature matrix
X_full_train = dv.fit_transform(dicts_full_train)

# then we train a model on this feature matrix
y_full_train = df_full_train.churn.values
model = LogisticRegression()
model.fit(X_full_train, y_full_train)

# do the same things for test data
dicts_test = df_test[categorical + numerical].to_dict(orient='records')
X_test = dv.transform(dicts_test)

# do the predictions
y_pred = model.predict_proba(X_test)[:, 1]

# compute accuracy
churn_decision = (y_pred >= 0.5)
(churn_decision == y_test).mean()
# Output: 0.815471965933286

An accuracy of 81.5% on the test data is slightly more accurate than what we had in the validation data. Minor differences in performance are acceptable, but significant differences between training and validation/test data can indeed indicate potential issues with the model, such as overfitting. Ensuring that the model’s performance is consistent across different datasets is an important aspect of model evaluation and generalization.

Let’s imagine that we want to deploy the logistic regression model on a website where we can use it to predict whether a customer is likely to leave (churn) or not. When a customer visits the website and provides their information, this data is transferred as a dictionary over the network to the server hosting the model. The server then uses the model to compute a probability, which is returned to determine whether the customer is likely to churn. This approach allows to make real-time predictions about customer churn and take appropriate actions, such as sending promotional offers to customers who are likely to leave.

Let’s take a sample customer from our test set:

customer = dicts_test[10]
customer

# Output:
# {'gender': 'male',
#  'seniorcitizen': 1,
#  'partner': 'yes',
#  'dependents': 'yes',
#  'phoneservice': 'yes',
#  'multiplelines': 'no',
#  'internetservice': 'fiber_optic',
#  'onlinesecurity': 'no',
#  'onlinebackup': 'yes',
#  'deviceprotection': 'no',
#  'techsupport': 'no',
#  'streamingtv': 'yes',
#  'streamingmovies': 'yes',
#  'contract': 'month-to-month',
#  'paperlessbilling': 'yes',
#  'paymentmethod': 'mailed_check',
#  'tenure': 32,
#  'monthlycharges': 93.95,
#  'totalcharges': 2861.45}

To get the feature matrix for the requested customer as a dictionary, we create a list containing just that customer’s dictionary.

X_small = dv.transform([customer])
X_small

# Output: 
# array([[1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00,
#        1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00,
#        0.00000e+00, 1.00000e+00, 0.00000e+00, 9.39500e+01, 1.00000e+00,
#        0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00,
#        1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00,
#        0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
#        1.00000e+00, 0.00000e+00, 1.00000e+00, 1.00000e+00, 0.00000e+00,
#        0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00,
#        1.00000e+00, 0.00000e+00, 0.00000e+00, 3.20000e+01, 2.86145e+03]])

X_small.shape
# Output: (1, 45)
# one customer with 45 features

model.predict_proba(X_small)[0,1]

# Output: 0.4056810977975889

We see this customer has a probability of only 40% of churning. We assume this customer is not going to churn.

# Let's check the actuel value...
y_test[10]

# Output: 0

Our decision not sending an email to this customer was correct. Let’s test one customer that is going to churn.

customer = dicts_test[-1]
X_small = dv.transform([customer])
model.predict_proba(X_small)[0,1]

# Output: 0.5968852088398422

We see this customer has a probability of almost 60% of churning. We assume this customer is going to churn.

# Let's check the actuel value...
y_test[-1]

# Output: 1

The prediction is correct.

ML Zoomcamp 2023 – Machine Learning for Classification– Part 12

Using the model

Leave a comment Cancel reply

Using the model

Teilen mit:

Related

Leave a comment Cancel reply