Table of contents

  • The Sure Tomorrow Insurance Company
  • Data Preprocessing & Exploration
    • Initialization
    • Load Data
      • Conclusions
    • EDA
  • Task 1. Similar Customers
  • The Sure Tomorrow Insurance Company
  • Task 2. Is Customer Likely to Receive Insurance Benefit?
    • Unscaled
      • Conclusions
    • Scaled
      • Conclusions
    • Dummy Model
      • Conclusion
  • Task 3. Regression (with Linear Regression)
    • Original data
    • Scaled Data
      • Conclusions
  • Task 4. Obfuscating Data
    • Test Linear Regression With Data Obfuscation
    • Original
    • Obfuscated
      • Conclusion
  • Final Conclusions

The Sure Tomorrow Insurance Company¶

The Sure Tomorrow insurance company wants to solve several tasks with the help of Machine Learning and we are asked to evaluate that possibility.

  • Task 1: Find customers who are similar to a given customer. This will help the company's agents with marketing.
  • Task 2: Predict whether a new customer is likely to receive an insurance benefit. Can a prediction model do better than a dummy model?
  • Task 3: Predict the number of insurance benefits a new customer is likely to receive using a linear regression model.
  • Task 4: Protect clients' personal data without breaking the model from the previous task. It's necessary to develop a data transformation algorithm that would make it hard to recover personal information if the data fell into the wrong hands. This is called data masking, or data obfuscation. But the data should be protected in such a way that the quality of machine learning models doesn't suffer. You don't need to pick the best model, just prove that the algorithm works correctly.

Data Preprocessing & Exploration¶

Initialization¶

In [ ]:
# pip install scikit-learn --upgrade
In [ ]:
# import libraries
import numpy as np
import pandas as pd
import math
import seaborn as sns
import sklearn.linear_model
import sklearn.metrics
from sklearn.metrics import f1_score
import sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier
import sklearn.preprocessing
from sklearn.preprocessing import Binarizer
from sklearn.model_selection import train_test_split, cross_val_score
from IPython.display import display
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve, auc, r2_score, mean_squared_error as mse
from sklearn.utils import shuffle
import plotly_express as px
import plotly.graph_objects as go 
from sklearn.metrics import classification_report
from sklearn.neighbors import NearestNeighbors

Load Data¶

Load data and conduct a basic check that it's free from obvious issues.

In [ ]:
# read dataframe
df = pd.read_csv('datasets/insurance_us.csv')

We rename the colums to make the code look more consistent with its style.

In [ ]:
# change column names
df = df.rename(columns={'Gender': 'gender', 'Age': 'age', 'Salary': 'income', 'Family members': 'family_members', 'Insurance benefits': 'insurance_benefits'})
In [ ]:
# look at the data
df.sample(10)
Out[ ]:
gender age income family_members insurance_benefits
3213 0 22.0 35400.0 1 0
4851 0 31.0 39200.0 0 0
2587 1 26.0 59200.0 0 0
4581 1 41.0 38400.0 1 0
3940 0 25.0 46800.0 2 0
4394 0 38.0 36800.0 1 0
2641 0 32.0 28300.0 1 0
46 0 26.0 34500.0 1 0
3731 1 31.0 57100.0 2 0
1178 1 18.0 58400.0 2 0

We have 4 features, gender, age, salary, and family members, and one target, insurance benefits

In [ ]:
# info on columns
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   gender              5000 non-null   int64  
 1   age                 5000 non-null   float64
 2   income              5000 non-null   float64
 3   family_members      5000 non-null   int64  
 4   insurance_benefits  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB
In [ ]:
# we may want to fix the age type (from float to int) though this is not critical

# write your conversion here if you choose:
df.age = df.age.astype('int')
df.income = df.income.astype('int')
In [ ]:
# check to see that the conversion was successful
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   gender              5000 non-null   int64
 1   age                 5000 non-null   int32
 2   income              5000 non-null   int32
 3   family_members      5000 non-null   int64
 4   insurance_benefits  5000 non-null   int64
dtypes: int32(2), int64(3)
memory usage: 156.4 KB
In [ ]:
# looking for missing values
df.isna().sum()
Out[ ]:
gender                0
age                   0
income                0
family_members        0
insurance_benefits    0
dtype: int64
In [ ]:
# looking for duplicates
df[['age','gender', 'income', 'family_members', 'insurance_benefits']].duplicated().sum()
Out[ ]:
153
In [ ]:
# now have a look at the data's descriptive statistics. 
# Does everything look okay?
df.describe()
Out[ ]:
gender age income family_members insurance_benefits
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000
mean 0.499000 30.952800 39916.359400 1.194200 0.148000
std 0.500049 8.440807 9900.082063 1.091387 0.463183
min 0.000000 18.000000 5300.000000 0.000000 0.000000
25% 0.000000 24.000000 33300.000000 0.000000 0.000000
50% 0.000000 30.000000 40200.000000 1.000000 0.000000
75% 1.000000 37.000000 46600.000000 2.000000 0.000000
max 1.000000 65.000000 79000.000000 6.000000 5.000000

Conclusions¶

The data is clean, with no missing values. The duplicates that we see may not be true duplicates. Without unique identifiers, we assume that these are all unique values.

EDA¶

In [ ]:
# correlations 
df.corr()
Out[ ]:
gender age income family_members insurance_benefits
gender 1.000000 0.002074 0.014910 -0.008991 0.010140
age 0.002074 1.000000 -0.019093 -0.006692 0.651030
income 0.014910 -0.019093 1.000000 -0.030296 -0.014963
family_members -0.008991 -0.006692 -0.030296 1.000000 -0.036290
insurance_benefits 0.010140 0.651030 -0.014963 -0.036290 1.000000
In [ ]:
# data skew
df.skew()
Out[ ]:
gender                0.004001
age                   0.515148
income               -0.036724
family_members        0.898297
insurance_benefits    3.845707
dtype: float64
In [ ]:
# correlation matrix
px.imshow(df.corr(), title='Correlation Matrix', text_auto=True, height=900, 
    template='ggplot2')

Let's quickly check whether there are certain groups of customers by looking at the pair plot.

In [ ]:
# scatter matrix
fig = px.scatter_matrix(df, height=800, title='Scatter Matrix')
fig.update_traces(showupperhalf=False, diagonal_visible=False)

Ok, it is a bit difficult to spot obvious groups (clusters) as it is difficult to combine several variables simultaneously (to analyze multivariate distributions). That's where LA and ML can be quite handy. Correlation matrix shows there is a positive relationship between age and insurance benefits. All other features show weak relationships.

In [ ]:
# distribution of age
px.histogram(df.age, title=' Distribution of Age', template='ggplot2', height=800, labels={'value': 'Age'})

Age shows a right skew distribution, with the mean and median at around 30. The minimum age is 18, while the maximum age is 65 years old.

In [ ]:
# distribution of income
px.histogram(df.income, title='Distribution of Income', template='plotly_dark', height=800, labels={'value': 'Salary'})

Income is normally distributed around the mean of $40,000. Income rages from $5,300 to $79,000.

In [ ]:
# distribution of family members
px.histogram(df.family_members, title='Distribution of Family Members', template='seaborn', height=800, labels={'value': 'Number of Family Members'})

The distribution of family members is skewed to the right. The range of values is 0 to 6 with a mean of 1.19, and a median of 1.0.

In [ ]:
# distribution of gender
px.bar(df.gender.value_counts(), color_discrete_sequence=[['pink', 'blue']], labels={'index': 'Gender', 'value': 'Count'}, title='Distribution of Gender', height=800)

Gender is more or less identical in the dataset. We assume 0 is female, and male is 1.

In [ ]:
# distribution of insurance benefits
px.histogram(df.insurance_benefits,  labels={'value': 'Number of Benefits'}, title='Distribution of Insurance Benefits', height=800, template='none')

The distribution of insurance benefits is heavily skewed to the right. The mean is 0.14, and the median is 0. The maximum value is 5.

Task 1. Similar Customers¶

In [ ]:
# euclidean
feature_names = ['gender', 'age', 'income', 'family_members']
nbrs = NearestNeighbors(metric='euclidean')
nbrs.fit(df[feature_names])
nbrs_distances, nbrs_indices = nbrs.kneighbors([df.iloc[1][feature_names]], n_neighbors=5, return_distance=True)
df_res = pd.concat([df.iloc[nbrs_indices[0]], pd.DataFrame(nbrs_distances.T, index=nbrs_indices[0], columns=['distance'])], axis=1)
C:\Users\XIX\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning:

X does not have valid feature names, but NearestNeighbors was fitted with feature names

In [ ]:
# euclidean
df_res.head()
Out[ ]:
gender age income family_members insurance_benefits distance
1 0 46 38000 1 1 0.000000
3920 0 40 38000 0 0 6.082763
4948 1 37 38000 1 0 9.055385
2528 1 36 38000 0 0 10.099505
3593 0 33 38000 0 0 13.038405
In [ ]:
# Manhattan 
feature_names = ['gender', 'age', 'income', 'family_members']
nbrs = NearestNeighbors(metric='manhattan')
nbrs.fit(df[feature_names])
nbrs_distances, nbrs_indices = nbrs.kneighbors([df.iloc[1][feature_names]], n_neighbors=5, return_distance=True)
df_res = pd.concat([df.iloc[nbrs_indices[0]], pd.DataFrame(nbrs_distances.T, index=nbrs_indices[0], columns=['distance'])], axis=1)
C:\Users\XIX\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning:

X does not have valid feature names, but NearestNeighbors was fitted with feature names

In [ ]:
# manhattan
df_res.head()
Out[ ]:
gender age income family_members insurance_benefits distance
1 0 46 38000 1 1 0.0
3920 0 40 38000 0 0 7.0
4948 1 37 38000 1 0 10.0
2528 1 36 38000 0 0 12.0
3593 0 33 38000 0 0 14.0

The Sure Tomorrow Insurance Company¶

Scaling the data.

In [ ]:
# scaling numerical columns
feature_names = ['gender', 'age', 'income', 'family_members']

transformer_mas = sklearn.preprocessing.MaxAbsScaler().fit(df[feature_names].to_numpy())

df_scaled = df.copy()
df_scaled.loc[:, feature_names] = transformer_mas.transform(df[feature_names].to_numpy())
In [ ]:
# look at scaled data
df_scaled.sample(5)
Out[ ]:
gender age income family_members insurance_benefits
1161 0.0 0.292308 0.494937 0.5 0
1921 1.0 0.492308 0.507595 0.0 0
2074 1.0 0.307692 0.369620 0.0 0
1181 0.0 0.692308 0.358228 0.0 1
3271 1.0 0.615385 0.736709 0.0 0

Now, let's get similar records for a given one for every combination

In [ ]:
# euclidean
feature_names = ['gender', 'age', 'income', 'family_members']
nbrs2 = NearestNeighbors(metric='euclidean')
nbrs2.fit(df_scaled[feature_names])
nbrs_distances2, nbrs_indices2 = nbrs2.kneighbors([df_scaled.iloc[1][feature_names]], n_neighbors=5, return_distance=True)
df_res2 = pd.concat([df_scaled.iloc[nbrs_indices2[0]], pd.DataFrame(nbrs_distances2.T, index=nbrs_indices2[0], columns=['distance'])], axis=1)
C:\Users\XIX\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning:

X does not have valid feature names, but NearestNeighbors was fitted with feature names

In [ ]:
# euclidean
df_res2.head()
Out[ ]:
gender age income family_members insurance_benefits distance
1 0.0 0.707692 0.481013 0.166667 1 0.000000
4162 0.0 0.707692 0.477215 0.166667 1 0.003797
1863 0.0 0.707692 0.492405 0.166667 1 0.011392
4986 0.0 0.723077 0.491139 0.166667 1 0.018418
4477 0.0 0.692308 0.459494 0.166667 1 0.026453
In [ ]:
# manhattan
feature_names = ['gender', 'age', 'income', 'family_members']
nbrs2 = NearestNeighbors(metric='manhattan')
nbrs2.fit(df_scaled[feature_names])
nbrs_distances2, nbrs_indices2 = nbrs2.kneighbors([df_scaled.iloc[1][feature_names]], n_neighbors=5, return_distance=True)
df_res2 = pd.concat([df_scaled.iloc[nbrs_indices2[0]], pd.DataFrame(nbrs_distances2.T, index=nbrs_indices2[0], columns=['distance'])], axis=1)
C:\Users\XIX\anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning:

X does not have valid feature names, but NearestNeighbors was fitted with feature names

In [ ]:
# manhattan
df_res2.head()
Out[ ]:
gender age income family_members insurance_benefits distance
1 0.0 0.707692 0.481013 0.166667 1 0.000000
4162 0.0 0.707692 0.477215 0.166667 1 0.003797
1863 0.0 0.707692 0.492405 0.166667 1 0.011392
4986 0.0 0.723077 0.491139 0.166667 1 0.025511
2434 0.0 0.676923 0.482278 0.166667 1 0.032035

Does the data being not scaled affect the kNN algorithm? If so, how does that appear?

When comparing the unscaled results of euclidean and manhattan metrics, we see differences in results. However, after scaling the data, the results of the two metrics are the same. Therefore, scaling does affect the results. Scaling levels the income values, which is magnitudes greater than the other features.

How similar are the results using the Manhattan distance metric (regardless of the scaling)?

The manhattan results are the euclidean results rounded up to the next integer.

Task 2. Is Customer Likely to Receive Insurance Benefit?¶

Unscaled¶

In [ ]:
# look at data
df.head()
Out[ ]:
gender age income family_members insurance_benefits
0 1 41 49600 1 0
1 0 46 38000 1 1
2 0 29 21000 0 0
3 0 21 41700 2 0
4 1 28 26100 0 0
In [ ]:
# binarize target with threshold of 0.5, drop old target column
binarizer = Binarizer(threshold=0.5)
df['insurance_benefits_received'] = binarizer.fit_transform(df[['insurance_benefits']])
In [ ]:
# probability of insurance benefit received
df['insurance_benefits_received'].sum() / len(df)
Out[ ]:
0.1128
In [ ]:
# check for the class imbalance with value_counts()
df.insurance_benefits_received.value_counts()
Out[ ]:
0    4436
1     564
Name: insurance_benefits_received, dtype: int64
In [ ]:
# unscaled data
X = df[['age', 'gender', 'income', 'family_members']]
y = df['insurance_benefits_received']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=19)
In [ ]:
# unscaled knn model
neighbors = np.arange(1,11)
test_accuracies = {}

for neighbor in neighbors:
    knn = KNeighborsClassifier(n_neighbors=neighbor)
    knn.fit(X_train, y_train)
    predicted_test = knn.predict(X_test)
    test_accuracies[neighbor] = f1_score(y_test, predicted_test)
    
print(neighbors, '\n', test_accuracies)
[ 1  2  3  4  5  6  7  8  9 10] 
 {1: 0.6222222222222223, 2: 0.4184100418410042, 3: 0.43243243243243246, 4: 0.22935779816513763, 5: 0.23529411764705882, 6: 0.10256410256410257, 7: 0.10050251256281409, 8: 0.0, 9: 0.010752688172043012, 10: 0.0}
In [ ]:
# creating a dataframe for accuracies, transposed
accuracies = pd.DataFrame([test_accuracies.values()], columns=test_accuracies.keys()).T
In [ ]:
# grapph of train and test accuracies
px.line(accuracies, title='KNN: Varying Number of Neighbors', labels={'index': 'Number of Neighbors', 'value': 'F1_score'})

Conclusions¶

The model parameter that gave the best results was 3 nearest neighbors, then the test score general trend is down.

Scaled¶

In [ ]:
# binarize target with threshold of 0.5, drop old target column
binarizer = Binarizer(threshold=0.5)
df_scaled['insurance_benefits_received'] = binarizer.fit_transform(df_scaled[['insurance_benefits']])
In [ ]:
# scaled data
X_scaled = df_scaled[['age', 'gender', 'income', 'family_members']]
y_scaled = df_scaled['insurance_benefits_received']

X_scaled_train, X_scaled_test, y_scaled_train, y_scaled_test = train_test_split(X_scaled, y_scaled, test_size=0.3, random_state=19)
In [ ]:
# unscaled knn model
neighbors = np.arange(1,11)

test_accuracies = {}

for neighbor in neighbors:
    knn_scaled = KNeighborsClassifier(n_neighbors=neighbor)
    knn_scaled.fit(X_scaled_train, y_scaled_train)
    predicted_test_scaled = knn_scaled.predict(X_scaled_test)
    test_accuracies[neighbor] = f1_score(y_scaled_test, predicted_test_scaled)
    
print(neighbors, '\n', test_accuracies)
[ 1  2  3  4  5  6  7  8  9 10] 
 {1: 0.9497206703910615, 2: 0.8961424332344213, 3: 0.9178470254957507, 4: 0.880952380952381, 5: 0.9273743016759777, 6: 0.8802395209580838, 7: 0.8979591836734695, 8: 0.8734939759036144, 9: 0.9005847953216374, 10: 0.876876876876877}
In [ ]:
# creating a dataframe for accuracies, transposed
accuracies = pd.DataFrame([ test_accuracies.values()], columns=test_accuracies.keys()).T
In [ ]:
# grapph of train and test accuracies
px.line(accuracies, title='KNN: Varying Number of Neighbors', labels={'index': 'Number of Neighbors', 'value': 'F1 Score'})

Conclusions¶

The model parameter that gave the best results was 5 nearest neighbors. Then, the test scores fluctuate, but the general trend is down. These results are better than the results of the unscaled data, for each respective amount of neighbors.

Dummy Model¶

In [ ]:
# function for classifier evaluation
def eval_classifier(y_true, y_pred):
    
    f1_score = sklearn.metrics.f1_score(y_true, y_pred)
    print(f'F1: {f1_score:.2f}')
    
# if you have an issue with the following line, restart the kernel and run the notebook again
    cm = sklearn.metrics.confusion_matrix(y_true, y_pred, normalize='all')
    print('Confusion Matrix')
    print(cm)
In [ ]:
# generating output of a random model

def rnd_model_predict(P, size, seed=42):

    rng = np.random.default_rng(seed=seed)
    return rng.binomial(n=1, p=P, size=size)
In [ ]:
# probabilities
for P in [0, df['insurance_benefits_received'].sum() / len(df), 0.5, 1]:

    print(f'The probability: {P:.2f}')
    y_pred_rnd = rnd_model_predict(P, 5000)
        
    eval_classifier(df['insurance_benefits_received'], y_pred_rnd)
    
    print()
The probability: 0.00
F1: 0.00
Confusion Matrix
[[0.8872 0.    ]
 [0.1128 0.    ]]

The probability: 0.11
F1: 0.12
Confusion Matrix
[[0.7914 0.0958]
 [0.0994 0.0134]]

The probability: 0.50
F1: 0.20
Confusion Matrix
[[0.456  0.4312]
 [0.053  0.0598]]

The probability: 1.00
F1: 0.20
Confusion Matrix
[[0.     0.8872]
 [0.     0.1128]]

Conclusion¶

The dummy model demonstrates F1 scores of different probabilities for insurance benefits received. We see that the nearest neighbor models with the scaled and unscaled data have better f1 scores than the dummy model. Therefore, the previous models are better at classifying insurance benefits received than a random model.

Task 3. Regression (with Linear Regression)¶

In [ ]:
# creating linear regression algorithm 
class MyLinearRegression:
    
    def __init__(self):
        
        self.weights = None
    
    def fit(self, X, y):
        
        # adding the unities
        X2 = np.append(np.ones([len(X), 1]), X, axis=1)
        self.weights = np.linalg.inv(X2.T @ X2) @ X2.T @ y

    def predict(self, X):
        
        # adding the unities
        X2 = np.append(np.ones([len(X), 1]), X, axis=1)
        y_pred = X2 @ self.weights
        
        return y_pred
In [ ]:
# evaluation for regressor algorithm
def eval_regressor(y_true, y_pred):
    
    rmse = math.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    print(f'RMSE: {rmse:.2f}')
    
    r2_score = math.sqrt(sklearn.metrics.r2_score(y_true, y_pred))
    print(f'R2: {r2_score:.2f}')    

Original data¶

In [ ]:
# Running regression model on original data
X = df[['age', 'gender', 'income', 'family_members']].to_numpy()
y = df['insurance_benefits'].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12345)

lr = MyLinearRegression()

lr.fit(X_train, y_train)
print(lr.weights)

y_test_pred = lr.predict(X_test)
eval_regressor(y_test, y_test_pred)
[-9.43538930e-01  3.57495491e-02  1.64272730e-02 -2.60745684e-07
 -1.16902138e-02]
RMSE: 0.34
R2: 0.66

Scaled Data¶

In [ ]:
# Running regression model on scaled data
X_scaled = df_scaled[['age', 'gender', 'income', 'family_members']].to_numpy()
y_scaled = df_scaled['insurance_benefits'].to_numpy()

X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled = train_test_split(X_scaled, y_scaled, test_size=0.3, random_state=12345)

lr_scaled = MyLinearRegression()

lr_scaled.fit(X_train_scaled, y_train_scaled)
print(lr_scaled.weights)

y_test_pred_scaled = lr_scaled.predict(X_test_scaled)
eval_regressor(y_test_scaled, y_test_pred_scaled)
[-0.94353893  2.32372069  0.01642727 -0.02059891 -0.07014128]
RMSE: 0.34
R2: 0.66

Conclusions¶

Running a linear regression on both the original data and the scaled data, we see how the evaluation metrics do not change. The RMSE is 0.34, and the R2 is 0.66. Therefore, scaling the data did not change the accuracy of the model.

Task 4. Obfuscating Data¶

In [ ]:
# data to be obfuscated
personal_info_column_list = ['gender', 'age', 'income', 'family_members']
df_pn = df[personal_info_column_list]
In [ ]:
# convert data to numpy array
X = df_pn.to_numpy()
In [ ]:
# look at the array
X
Out[ ]:
array([[    1,    41, 49600,     1],
       [    0,    46, 38000,     1],
       [    0,    29, 21000,     0],
       ...,
       [    0,    20, 33900,     2],
       [    1,    22, 32700,     3],
       [    1,    28, 40600,     1]], dtype=int64)

Generating a random matrix $P$.

In [ ]:
# random factor P
rng = np.random.default_rng(seed=42)
P = rng.random(size=(X.shape[1], X.shape[1]))

Checking the matrix $P$ is invertible

In [ ]:
# visual of P
P
Out[ ]:
array([[0.77395605, 0.43887844, 0.85859792, 0.69736803],
       [0.09417735, 0.97562235, 0.7611397 , 0.78606431],
       [0.12811363, 0.45038594, 0.37079802, 0.92676499],
       [0.64386512, 0.82276161, 0.4434142 , 0.22723872]])
In [ ]:
# inverse of P
P_inv = np.linalg.inv(P)
P_inv
Out[ ]:
array([[ 0.41467992, -1.43783972,  0.62798546,  1.14001268],
       [-1.06101789,  0.44219337,  0.1329549 ,  1.18425933],
       [ 1.42362442,  1.60461607, -2.0553823 , -1.53699695],
       [-0.11128575, -0.65813802,  1.74995517, -0.11816316]])
In [ ]:
# new obfuscated dataset
X_new = X @ P
X_new
Out[ ]:
array([[ 6359.71527314, 22380.40467609, 18424.09074184, 46000.69669016],
       [ 4873.29406479, 17160.36702982, 14125.78076133, 35253.45577301],
       [ 2693.11742928,  9486.397744  ,  7808.83156024, 19484.86063067],
       ...,
       [ 4346.2234249 , 15289.24126492, 12586.16264392, 31433.50888552],
       [ 4194.09324155, 14751.9910242 , 12144.02930637, 30323.88763426],
       [ 5205.46827354, 18314.24814446, 15077.01370762, 37649.59295455]])

Can you guess the customers' ages or income after the transformation?

No

Can you recover the original data from $X'$ if you know $P$?

In [ ]:
# product of P and X
P @ X.T
Out[ ]:
array([[42605.92216771, 32647.60673289, 18043.28379289, ...,
        29116.64178985, 28088.67336691, 34872.83546879],
       [37793.40997679, 28968.97336811, 16012.22678999, ...,
        25823.72047312, 24913.18431708, 30930.46956831],
       [18411.10270401, 14111.96943897,  7799.81970108, ...,
        12580.91427022, 12137.91229164, 15068.06546873],
       [22027.94859182, 16887.81382837,  9335.55826216, ...,
        15048.65104996, 14519.07063843, 18026.5249014 ]])
In [ ]:
# product of inverse of P and X_new
X_new @ P_inv
Out[ ]:
array([[ 1.00000000e+00,  4.10000000e+01,  4.96000000e+04,
         1.00000000e+00],
       [-3.63797881e-12,  4.60000000e+01,  3.80000000e+04,
         1.00000000e+00],
       [ 1.81898940e-12,  2.90000000e+01,  2.10000000e+04,
         0.00000000e+00],
       ...,
       [ 0.00000000e+00,  2.00000000e+01,  3.39000000e+04,
         2.00000000e+00],
       [ 1.00000000e+00,  2.20000000e+01,  3.27000000e+04,
         3.00000000e+00],
       [ 1.00000000e+00,  2.80000000e+01,  4.06000000e+04,
         1.00000000e+00]])

Print all three cases for a few customers

  • The original data
  • The transformed one
  • The reversed (recovered) one
In [ ]:
# original 
X
Out[ ]:
array([[    1,    41, 49600,     1],
       [    0,    46, 38000,     1],
       [    0,    29, 21000,     0],
       ...,
       [    0,    20, 33900,     2],
       [    1,    22, 32700,     3],
       [    1,    28, 40600,     1]], dtype=int64)
In [ ]:
# transformed
X @ P
Out[ ]:
array([[ 6359.71527314, 22380.40467609, 18424.09074184, 46000.69669016],
       [ 4873.29406479, 17160.36702982, 14125.78076133, 35253.45577301],
       [ 2693.11742928,  9486.397744  ,  7808.83156024, 19484.86063067],
       ...,
       [ 4346.2234249 , 15289.24126492, 12586.16264392, 31433.50888552],
       [ 4194.09324155, 14751.9910242 , 12144.02930637, 30323.88763426],
       [ 5205.46827354, 18314.24814446, 15077.01370762, 37649.59295455]])
In [ ]:
# recovered 
X_recovered = P_inv @ X_new.T
X_recovered
Out[ ]:
array([[ 34469.24991267,  26407.17129216,  14593.68305532, ...,
         23557.33895772,  22724.09776049,  28214.8083902 ],
       [ 60075.02241217,  46044.87413318,  25450.72689052, ...,
         41048.19857736,  39599.17981236,  49166.78883559],
       [-63605.67724923, -48744.79237922, -26942.29140034, ...,
        -43461.83096724, -41926.25007333, -52058.36776964],
       [ 11368.60389364,   8717.60489413,   4819.64768036, ...,
          7764.83369297,   7492.75126654,   9302.7054582 ]])

You can probably see that some values are not exactly the same as they are in the original data. What might be the reason for that?

Test Linear Regression With Data Obfuscation¶

Original¶

In [ ]:
# original data
X = df[['age', 'gender', 'income', 'family_members']].to_numpy()
y = df['insurance_benefits'].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=19)

linreg = MyLinearRegression()

linreg.fit(X_train, y_train)
print(linreg.weights)

y_test_pred = linreg.predict(X_test)
eval_regressor(y_test, y_test_pred)
[-9.21298248e-01  3.48958441e-02 -9.02018203e-03  1.92322314e-07
 -1.30438947e-02]
RMSE: 0.37
R2: 0.66

Obfuscated¶

In [ ]:
# Obfuscated data
# X_new
# y = df['insurance_benefits'].to_numpy()

X_train2, X_test2, y_train2, y_test2 = train_test_split(X_new, y, test_size=0.3, random_state=19)

linreg2 = MyLinearRegression()

linreg2.fit(X_train2, y_train2)
print(linreg2.weights)

y_test_pred2 = linreg2.predict(X_test2)
eval_regressor(y_test2, y_test_pred2)
[-0.92129821 -0.0687852   0.00955393  0.06320114 -0.02042082]
RMSE: 0.37
R2: 0.66

Conclusion¶

Running a linear regression on both the original data and the obfuscated data, we see how the evaluation metrics do not change. The RMSE is 0.37, and the R2 is 0.66. Therefore, obfuscating the data did not break the model, as the accuracy did not change.

Final Conclusions¶

We trained a model that would return similar customers for a given one. This model was calculated while scaled and unscaled, using euclidean and manhattan distances. Then, we created a dummy model to test the f1 scores of different probability values. We found the dummy model to be less accurate than the classification model we built, using both original and scaled data. After, a linear regression model was built with matrix operations. The evaluation metrics of RMSE and R2 score were measured, and then compared to a linear regression model on the obfuscated data. We concluded that obfuscation did not alter the accuracy of the model, as the RMSE and R2 metrics were the same before and after obfuscation. Overall, we have provided results that suggest a very accurate prediction as to whether a customer will, or will not receive insurance benefits. This is more accurate than trying to predict the actual number of insurance benefits a customer will receive. Consequently, we suggest Sure Tomorrow use the more accurate classification model over the regression model.