The purpose of this project is to aid Megaline, a mobile carrier, in developing a model that will analyze subscribers' behavior. Once the behavior is analyzed, subscribers on a legacy plan can be recommended one of Megaline's newer plans: Smart or Ultra. We are provided with behavior data from subscribers who have already switched to the new plans. A successful model will classify a correct new plan to a legacy plan customer. We will be working with data that we've used previously, so the data will be clean.
We understand that this is a classification problem, as we are determining which plan to recommend: Surf or Ultra. Since we are concerned with accuracy, we will need to increase the total number of correct plans picked. Consequently, we will likely go with a random forest model, as speed is not a crucial factor. We determine the minimum accuracy threshold to be 0.75, meaning we correctly recommend at least 75% of customers. Since we do not have a separate test dataset, we will split our source data to create a validating and test dataset dataset, each with the conventional 20% of the data.
# !pip install --user -U plotly_express
# import necessary libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import classification_report
from sklearn import metrics
import plotly_express as px
import plotly.graph_objects as go
# Read dataframe
df = pd.read_csv('datasets/users_behavior.csv')
# Look at dataframe
df
calls | minutes | messages | mb_used | is_ultra | |
---|---|---|---|---|---|
0 | 40.0 | 311.90 | 83.0 | 19915.42 | 0 |
1 | 85.0 | 516.75 | 56.0 | 22696.96 | 0 |
2 | 77.0 | 467.66 | 86.0 | 21060.45 | 0 |
3 | 106.0 | 745.53 | 81.0 | 8437.39 | 1 |
4 | 66.0 | 418.74 | 1.0 | 14502.75 | 0 |
... | ... | ... | ... | ... | ... |
3209 | 122.0 | 910.98 | 20.0 | 35124.90 | 1 |
3210 | 25.0 | 190.36 | 0.0 | 3275.61 | 0 |
3211 | 97.0 | 634.44 | 70.0 | 13974.06 | 0 |
3212 | 64.0 | 462.32 | 90.0 | 31239.78 | 0 |
3213 | 80.0 | 566.09 | 6.0 | 29480.52 | 1 |
3214 rows × 5 columns
# Change column types to integer
df.calls = df.calls.astype('int')
df.messages = df.messages.astype('int')
# Confirm data type change, look at data summary
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3214 entries, 0 to 3213 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 calls 3214 non-null int32 1 minutes 3214 non-null float64 2 messages 3214 non-null int32 3 mb_used 3214 non-null float64 4 is_ultra 3214 non-null int64 dtypes: float64(2), int32(2), int64(1) memory usage: 100.6 KB
# Ensure no missing values
df.isna().sum()
calls 0 minutes 0 messages 0 mb_used 0 is_ultra 0 dtype: int64
# Ensure no duplicates
df.duplicated().sum()
0
# Values counts of each plan, 1 is Ultra
df.is_ultra.value_counts()
0 2229 1 985 Name: is_ultra, dtype: int64
# Splitting dataset into 3
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']
features_train, features_test, target_train, target_test = train_test_split(
features, target, test_size=0.2, random_state=19) # split 20% of data to make validation set
features_train, features_valid, target_train, target_valid = train_test_split(
features_train, target_train, test_size=0.25, random_state=19) # 0.25 x 0.8 = 0.2
# Visual of the split data
print(features_train.shape)
print(target_train.shape)
print(features_valid.shape)
print(target_valid.shape)
print(features_test.shape)
print(target_test.shape)
(1928, 4) (1928,) (643, 4) (643,) (643, 4) (643,)
# Decision Tree with loop for depth
best_model = None
best_result = 0
best_depth = 0
for depth in range(1,101):
model = DecisionTreeClassifier(random_state=19, max_depth=depth)
model.fit(features_train, target_train)
predictions_valid = model.predict(features_valid)
result = accuracy_score(target_valid, predictions_valid) ** 0.5 # calculate RMSE on validation set
if result > best_result:
best_model = model
best_result = result
best_depth = depth
print("Best Depth:", best_depth, "," " Accuracy of the best model on the validation set:", best_result)
Best Depth: 8 , Accuracy of the best model on the validation set: 0.8870942657868514
# Accuracy of decision tree
train_predictions = model.predict(features)
valid_predictions = model.predict(features_valid)
print('Accuracy')
print('Training set:', accuracy_score(target, train_predictions))
print('Validation set:', accuracy_score(target_valid, valid_predictions))
Accuracy Training set: 0.8864343497199751 Validation set: 0.71850699844479
The training set was accurate, yet the model was not so accurate making predictions on the validation set. Consequently, we will try another model to see if we can achieve higher accuracy values than our decision tree.
# Logistic Regression
model3 = LogisticRegression(random_state=19, solver='liblinear')
model3.fit(features_train, target_train)
score_train = model3.score(features_train, target_train)
score_valid = model3.score(features_valid, target_valid)
# Accuracy
train_predictions3 = model3.predict(features)
valid_predictions3 = model3.predict(features_valid)
print('Accuracy')
print('Training set:', accuracy_score(target, train_predictions3))
print('Validation set:', accuracy_score(target_valid, valid_predictions3))
Accuracy Training set: 0.7417548226509023 Validation set: 0.7107309486780715
This logistic regression model is less accurate than the decision tree in terms of training set accuracy. However, validation set accuracy is higher. We will try one more model to see if we can get better scores.
# Random Forest
best_model = None
best_result = 0
best_est = 0
best_depth1 = 0
for est in range(10, 101, 10):
for depth in range (1, 101):
model1 = RandomForestClassifier(random_state=19, n_estimators=est, max_depth=depth)
model1.fit(features_train, target_train) # train model on training set
predictions_valid1 = model1.predict(features_valid) # get model predictions on validation set
result1 = accuracy_score(target_valid, predictions_valid1) ** 0.5 # calculate RMSE on validation set
if result1 > best_result:
best_model = model1
best_result = result1
best_est = est
best_depth1 = depth
print("Accuracy of the best model on the validation set:", best_result, "n_estimators:", best_est, "best_depth:", best_depth1)
final_model = RandomForestClassifier(random_state=19, n_estimators=best_est, max_depth=best_depth1) # change n_estimators to get best model
final_model.fit(features_train, target_train)
Accuracy of the best model on the validation set: 0.8984174785618215 n_estimators: 40 best_depth: 13
RandomForestClassifier(max_depth=13, n_estimators=40, random_state=19)
# Model Parameters
final_model.get_params()
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 13, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 40, 'n_jobs': None, 'oob_score': False, 'random_state': 19, 'verbose': 0, 'warm_start': False}
# Accuracy
train_predictions1 = final_model.predict(features)
valid_predictions1 = final_model.predict(features_valid)
print('Accuracy')
print('Training set:', accuracy_score(target, train_predictions1))
print('Validation set:', accuracy_score(target_valid, valid_predictions1))
Accuracy Training set: 0.8820784069695085 Validation set: 0.807153965785381
This model is the most accurate among the training and validation sets. We will move forward with this model and calculate more metrics.
# Overall Accuracy
test_predictions1 = final_model.predict(features_test)
print('Accuracy')
print('Test set:', accuracy_score(target_test, test_predictions1))
Accuracy Test set: 0.80248833592535
The accuracy of the model on the test set is 80%, which is fair. This is a general indication of the model's overall performance. Since it meets our accuracy threshold of 75%, we consider the model appropriate for use.
# Null Accuracy
max(target_test.mean(), 1 - target_test.mean())
0.687402799377916
The null accuracy is 68.7%. This value is the accuracy obtained when always predicting the majority class. Since our null accuracy is far less than our test set accuracy, we conclude our model is better than always assuming not Ultra. This measure is a useful baseline metric to measure against our classifier accuracy.
# Confusion Matrix
conf_matrix = metrics.confusion_matrix(target_test, test_predictions1)
conf_matrix
array([[399, 43], [ 84, 117]], dtype=int64)
# Confusion Matrix Figure
fig = px.imshow(conf_matrix, text_auto=True, labels=dict(y="Actual", x="Predicted"),
x=['Not Ultra', 'Is Ultra'],
y=['Not Ultra', 'Is Ultra'], title='Confusion Matrix')
fig.show()
fig = go.Figure(data=go.Heatmap(z=[[81, 120], [399, 43]], text=[['False Negatives', 'True Positives'], ['True Negatives', 'False Positives']],
texttemplate="%{text}", textfont={"size":20}, x=['Not Ultra', 'Is Ultra'],
y=['Not Ultra', 'Is Ultra']))
fig.show()
The confusion matrix breaks down the results to define the performance of a classification algorithm. It shows us key metrics such as true negatives and true positives. True positives are observations that were predicted to be Ultra, and were actually Ultra. True negatives are observations that were predicted to not be Ultra, and were not Ultra. We see this model is better at predicting true negatives than true positives. This demonstrates that the model can more easily determine when a plan should not be Ultra, than when a plan should be Ultra. We also see the values of false negatives and positives are relatively low.
# Sensitivity
# TP / TP + FN
print('The sensitivity is:', metrics.recall_score(target_test, test_predictions1) * 100, '%')
The sensitivity is: 58.2089552238806 %
The sensitivity evaluates how well the classifier detects positive instances, when the plan is actually Ultra. The value is 59.7%, which is fair.
# Specificity
# TN / (TN + FP)
print('The specificity is:', 399 / float(399 + 43)* 100, '%')
The specificity is: 90.27149321266968 %
The specificity evaluates how well the classifier detects when the plan is not Ultra, when the plan is indeed not Ultra. The value is high, at 90.27%.
# False Positive Rate
# FP / (TN + FP)
print('The false positives are:', 43 / float(399 + 43)* 100, '%')
The false positives are: 9.728506787330318 %
The false positive rate measures a not Ultra prediction when an observation is actually not Ultra. This measure is relatively low, at 9.7%.
# Precision
# TP / (TP + FP)
print('The precision is:', metrics.precision_score(target_test, test_predictions1)*100, '%')
The precision is: 73.125 %
Precision evaluates how often an Ultra prediction correctly identifies an Ultra observation. This measure is 73.6%, which is fair.
# Classification report
print(classification_report(target_test, test_predictions1))
precision recall f1-score support 0 0.83 0.90 0.86 442 1 0.73 0.58 0.65 201 accuracy 0.80 643 macro avg 0.78 0.74 0.76 643 weighted avg 0.80 0.80 0.80 643
Overview of some prediction metrics. Recall measures the ability of a classifier to correctly find all Ultra instances. The f1-score is a weighted average of precision and recall. Support is the number of actual occurrences of the classes in the dataset. We see significantly more plans that are not Ultra. This imbalance signifies a structural weakness in the dataset.
# Separating prediction probabilities for both plans
is_ultra = (model1.predict_proba(features_test)*100)[:, 1]
not_ultra = (model1.predict_proba(features_test)*100)[:, 0]
# Displaying histogram for prediction probabilities of both plans
fig = go.Figure()
fig.add_trace(go.Histogram(x=not_ultra, name='Not Ultra'))
fig.add_trace(go.Histogram(x=is_ultra, name='Is Ultra'))
# Overlay both histograms
fig.update_layout(barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.update_layout(
title_text='Distribution of Ultra vs Not Ultra',
xaxis_title_text='Prediction Percent Confidence',
yaxis_title_text='Count',
)
fig.show()
The figure shows the inner workings of the model in predicting if a plan should be Ultra. We see that most of the occurrences when the model predicts Ultra, it is only 3-7 percent sure of that prediction. Very few predictions are seen where the model is more certain than 50%. On the other hand, the model is usually 93-97 percent certain when making a prediction that the plan is not Ultra. This is the case, as the model will associate a likelihood for the prediction to either be Ultra or not Ultra. The classification with the higher likelihood determines the predicted classification, with a threshold of 0.5. Therefore, if the model makes an Ultra classification prediction with a probability of 0.51, the model will determine that the plan should be Ultra, as the probability of not Ultra is 0.49. Consequently, prediction probabilities in the mid range, 0.4 to 0.6, are not as accurate in predicting the correct classification.
We successfully created a model to predict which plan a legacy customer should upgrade to. The choices were the Ultra plan, or not the Ultra plan. Our model accurately predicted the correct classes 80% of the time. Accuracy was determined to be the key metric to evaluate the model, as it considers both precision and recall. This is important because we want to ensure customer satisfaction by recommending the correct plan for their needs. The model worked best with not Ultra classifications, as it had high specificity. The precision of the model was fair, correctly predicting Ultra 73.6% of the time. The model was optimized for the number of estimators and max depth to a reasonable degree that would not make the model training extremely long. Other hyperparameters were experimented with to determine what the best model should have. Overall, the model can be improved by further increasing the range of estimators and max depth. Also, we believe the model would work better with a more balanced dataset, so more data on customers that migrated to the Ultra plan. This could prove difficult if customers prefer to migrate to not Ultra, so this model may be close to the final product.