Table of contents

  • OilyGiant Model
    • Purpose
    • Read Data
    • Prepare Data for Models
    • Model Training
      • First Region
    • Second Region
      • Third Region
      • Comparing Regions
        • Region 0
        • Region 1
        • Region 2
      • Break Even Point
      • Profit of Selected Region
      • Profit per Top 200 Wells per Region based on Budget
    • Risk and Profit for each Region
      • Confidence Intervals
      • Profit Risk
        • Region 0
        • Region 1
        • Region 2
      • Results
    • Conclusion

OilyGiant Model¶

Purpose¶

The purpose of this project is to find the best location for a our customer, OilyGiant, to place a new well for mining oil. We are given oil well parameters in three distinct regions, upon which we will use to create our linear regression model. The model will predict the volume of reserves in the new wells, and the region with the highest total profit will be chosen for the new well.

Read Data¶

In [ ]:
# Import Libraries
import pandas as pd 
import numpy as np 
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error as mse, r2_score
from sklearn.linear_model import LinearRegression
import numpy as np
from scipy import stats as st
In [ ]:
# Read datasets 
df0 = pd.read_csv('datasets/geo_data_0.csv')
df1 = pd.read_csv('datasets/geo_data_1.csv')
df2 = pd.read_csv('datasets/geo_data_2.csv')
In [ ]:
# Shape of datasets 
print(df0.shape)
print(df1.shape)
print(df2.shape)
(100000, 5)
(100000, 5)
(100000, 5)
In [ ]:
# Check for missing values
print(df0.isnull().sum())
print()
print(df1.isnull().sum())
print()
print(df2.isnull().sum())
id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64

id         0
f0         0
f1         0
f2         0
product    0
dtype: int64
In [ ]:
# Check for duplicates
print(df0.duplicated().sum())
print()
print(df1.duplicated().sum())
print()
print(df2.duplicated().sum())
0

0

0
In [ ]:
# correlation between features and target
print('First dataset:')
print(df0.corr()['product'])
print()

# correlation between features and target
print('Second dataset:')
print( df1.corr()['product'])
print()

# correlation between features and target
print('Third dataset:')
print(df2.corr()['product'])
First dataset:
f0         0.143536
f1        -0.192356
f2         0.483663
product    1.000000
Name: product, dtype: float64

Second dataset:
f0        -0.030491
f1        -0.010155
f2         0.999397
product    1.000000
Name: product, dtype: float64

Third dataset:
f0        -0.001987
f1        -0.001012
f2         0.445871
product    1.000000
Name: product, dtype: float64

Data did not need to be cleaned, as there were neither missing values nor duplicates found. This is crucial, as our models cannot run with missing or duplicates.

Prepare Data for Models¶

In [ ]:
print(df0.id.nunique())
print(df1.id.nunique())
print(df2.id.nunique())
99990
99996
99996

Most ID's are unique.

In [ ]:
# First dataset
target_0 = df0['product']
features_0 = df0.drop(['product', 'id'], axis=1)

# Second dataset
target_1 = df1['product']
features_1 = df1.drop(['product', 'id'], axis=1)

# Third dataset
target_2 = df2['product']
features_2 = df2.drop(['product', 'id'], axis=1)

Dropped the categorical id column in order to run the models

In [ ]:
# Random state variable
random_state = 19
In [ ]:
# Splitting data into train and valid sets, 75/25 

# First dataset
features_train_0, features_valid_0, target_train_0, target_valid_0 = train_test_split(features_0, target_0, test_size=0.25, random_state=random_state)

# Second dataset
features_train_1, features_valid_1, target_train_1, target_valid_1 = train_test_split(features_1, target_1, test_size=0.25, random_state=random_state)

# Third dataset
features_train_2, features_valid_2, target_train_2, target_valid_2 = train_test_split(features_2, target_2, test_size=0.25, random_state=random_state)
In [ ]:
# Shape of results
print(features_train_0.shape)
print(target_train_0.shape)
print(features_valid_0.shape)
print(target_valid_0.shape)
(75000, 3)
(75000,)
(25000, 3)
(25000,)

Model Training¶

In [ ]:
# Model Function
def linear_model(ft, tt, fv, tv, n): # features_train, target_train, features_valid, target_valid, dataset number
    
    model = LinearRegression()
    model.fit(ft, tt) # train model on training set
    predictions_valid = model.predict(fv) # get model predictions on validation set
    result = mse(tv, predictions_valid) ** 0.5 # calculate RMSE on validation set
           
    print('RMSE of the linear regression model on the validation set:', result)
    print('Dataset number :', (n))
    print('R2 score: ', r2_score(tv, predictions_valid))
In [ ]:
# Dataset 0
linear_model(features_train_0, target_train_0, features_valid_0, target_valid_0, 0)
RMSE of the linear regression model on the validation set: 37.489176404931385
Dataset number : 0
R2 score:  0.2800427586322962
In [ ]:
# Dataset 1
linear_model(features_train_1, target_train_1, features_valid_1, target_valid_1, 1)
RMSE of the linear regression model on the validation set: 0.8949747975866225
Dataset number : 1
R2 score:  0.9996221715647515
In [ ]:
# Dataset 2
linear_model(features_train_2, target_train_2, features_valid_2, target_valid_2, 2)
RMSE of the linear regression model on the validation set: 40.03911711877307
Dataset number : 2
R2 score:  0.19763128804301888

The linear model with the highest quality was with dataset 1. That model had a coefficient of determination, R^2, of .0999. This means the model almost perfectly predicts all the targets. This model also had the lowest RMSE. The models did not perform as well with datasets 0 and 2.

First Region¶

In [ ]:
# Model 0
model_0 = LinearRegression()
model_0.fit(features_train_0, target_train_0) # train model on training set

# Predictions for model 0
predictions_0 = model_0.predict(features_valid_0)
predictions_0 = pd.Series(predictions_0)

# Merged dataframe of target and predictions 
merged_0 = pd.concat([target_valid_0.reset_index(drop=True), predictions_0], axis=1)
merged_0.columns = ['target', 'predictions']

Second Region¶

In [ ]:
# Model 1
model_1 = LinearRegression()
model_1.fit(features_train_1, target_train_1) # train model on training set

# Predictions for model 1
predictions_1 = model_1.predict(features_valid_1)
predictions_1 = pd.Series(predictions_1)

# Merged dataframe of target and predictions 
merged_1 = pd.concat([target_valid_1.reset_index(drop=True), predictions_1], axis=1)
merged_1.columns = ['target', 'predictions']

Third Region¶

In [ ]:
# Model 2
model_2 = LinearRegression()
model_2.fit(features_train_2, target_train_2) # train model on training set

predictions_2 = model_2.predict(features_valid_2)
predictions_2 = pd.Series(predictions_2)

merged_2 = pd.concat([target_valid_2.reset_index(drop=True), predictions_2], axis=1)
merged_2.columns = ['target', 'predictions']

Here, we trained our models with the various datasets to create the predictions on well volume. We then merged these predictions to their corresponding targets to create one dataframe.

Comparing Regions¶

In [ ]:
print('Mean Predictions 0:', predictions_0.mean(), '        ', 'Stdev Predictions 0:', predictions_0.std())
print('Mean Predictions 1:', predictions_1.mean(),'        ', 'Stdev Predictions 1:', predictions_1.std())
print('Mean Predictions 2:', predictions_2.mean(), '        ', 'Stdev Predictions 2:', predictions_2.std())
Mean Predictions 0: 92.42951694121892          Stdev Predictions 0: 23.25461037847274
Mean Predictions 1: 68.68102185200009          Stdev Predictions 1: 46.0429779067258
Mean Predictions 2: 94.91135196705469          Stdev Predictions 2: 19.79760332480081

The budget for development of 200 oil wells is 100 USD million. One barrel of raw materials brings 4.5 USD of revenue The revenue from one unit of product is 4,500 dollars (volume of reserves is in thousand barrels). It appears that Region 2 has the highest average predicted well volume reserve.

Region 0¶

In [ ]:
# Comparing target and predictions of dataset 0 
merged_0.sum()
Out[ ]:
target         2.311242e+06
predictions    2.310738e+06
dtype: float64
In [ ]:
# Comparing target and predictions of dataset 0 
merged_0.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
target 25000.0 92.449669 44.183614 0.000000 56.610227 91.61242 128.522832 185.355615
predictions 25000.0 92.429517 23.254610 -9.764724 76.659319 92.45290 108.449070 180.227277

Region 1¶

In [ ]:
# Comparing target and predictions of dataset 1
merged_1.sum()
Out[ ]:
target         1.716792e+06
predictions    1.717026e+06
dtype: float64
In [ ]:
# Comparing target and predictions of dataset 1
merged_1.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
target 25000.0 68.671662 46.043907 0.000000 26.953261 57.085625 107.813044 137.945408
predictions 25000.0 68.681022 46.042978 -2.061922 28.391403 58.319042 109.280975 139.984148

Region 2¶

In [ ]:
# Comparing target and predictions of dataset 2
merged_2.sum()
Out[ ]:
target         2.376934e+06
predictions    2.372784e+06
dtype: float64
In [ ]:
# Comparing target and predictions of dataset 2
merged_2.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
target 25000.0 95.077342 44.699862 0.009204 59.634772 94.882039 130.685889 190.029838
predictions 25000.0 94.911352 19.797603 16.341091 81.549013 94.925208 108.410820 173.959528

Comparing targets and predictions across all datasets, we can see that our models were excellent. The mean an sum of our target and predictions columns are identical across all three datasets. The metrics that are different are the standard deviation, min, and max values. Again, Region 2 has the highest mean and total barrel reserve volume.

Break Even Point¶

In [ ]:
# Cost per barrel
cpb = 4.5 

# budget
budget = 100000000

# Number of wells
wells = 200

# Factor for unit of barrels
factor = 1000

unit_cost = 4500

# Budget per well
bpw = budget / wells

# Budget per barrel
bpb = bpw / unit_cost

# Break even reserve volume
break_even = budget / (cpb * factor)

# Minimum volume to break even
min_product = break_even / wells
In [ ]:
# Comparing mean volume with break even point
print('Volume in region 0 after break even: ', f'{merged_0.target.mean() - (min_product):,}', 'barrels')
print('Volume in region 1 after break even: ', f'{merged_1.target.mean() - (min_product):,}', 'barrels')
print('Volume in region 2 after break even: ', f'{merged_2.target.mean() - (min_product):,}', 'barrels')
Volume in region 0 after break even:  -18.66144208004802 barrels
Volume in region 1 after break even:  -42.43944890421423 barrels
Volume in region 2 after break even:  -16.033768694089275 barrels

Break even metric is compared with the mean of the complete regional datasets, comparing the target values. Here, we see that the mean targets are all lower than the minimum volume of oil needed to break even. This evaluation is misleading, as some wells will be far more profitable than other wells. Furthermore, it is in the interest of the company to exclusively pick wells with the highest profit, and avoid wells that do not break even.

Profit of Selected Region¶

In [ ]:
# Function to determine profit predictions to target
def profit(target, predictions, count):
    predictions_sorted = predictions.sort_values(ascending=False)
    selected = target[predictions_sorted.index][:count]
    return (unit_cost * selected.sum()) - budget

Profit takes into consideration the budget of 200 wells at $100 million.

Profit per Top 200 Wells per Region based on Budget¶

In [ ]:
print('Total profit from Region 0 :', '$', f'{profit(target_0, predictions_0, 200 ):,}')
print('Total profit from Region 1 :', '$', f'{profit(target_1, predictions_1, 200 ):,}')
print('Total profit from Region 2 :', '$', f'{profit(target_2, predictions_2, 200 ):,}')
Total profit from Region 0 : $ -16,617,291.92250064
Total profit from Region 1 : $ -37,881,648.63171914
Total profit from Region 2 : $ -16,543,855.132537901

Based on the the data of the top 200 wells of each region, Region 2 will bring the most profit. This data is based on target values of the largest 200 model predictions.

Risk and Profit for each Region¶

Confidence Intervals¶

In [ ]:
# Function for confidence interval
def confidence(sample): 
    print('Mean:', sample.mean())
    confidence_interval = st.t.interval(0.95, len(sample)-1, sample.mean(), sample.sem())
    print('95% confidence interval:', confidence_interval)
In [ ]:
confidence(predictions_0)
Mean: 92.42951694121892
95% confidence interval: (92.141241144103, 92.71779273833484)
In [ ]:
confidence(predictions_1)
Mean: 68.68102185200009
95% confidence interval: (68.11025003765316, 69.25179366634701)
In [ ]:
confidence(predictions_2)
Mean: 94.91135196705469
95% confidence interval: (94.66593096078726, 95.1567729733221)

This confirms the earlier theory in the distribution of the well values among the regions. Region 2 values are slightly higher than Region 1. Region 0 is mostly distributed around smaller values of barrel reserve volume.

Profit Risk¶

In [ ]:
# Function for determining profit, risk, and losses 
def profit_risk(target, predictions):
    state = np.random.RandomState(19)
    values = []
    for i in range(1000):
        target_subsample = target.sample(n=500, replace=True, random_state=state)
        pred_subsample = predictions[target_subsample.index]
        values.append(profit(target_subsample, pred_subsample, 200))

    values = pd.Series(values)
    losses = (values < 0 ).mean() * 100
   

    print((f'{values.quantile(0.025):,}', f'{values.quantile(0.975):,}'))
    print('Mean Profit: ', f'{values.mean():,}')
    print('Losses : ', losses)   
    return

Region 0¶

In [ ]:
# Region 0
profit_risk(merged_0.target, merged_0.predictions)
('-682,314.259189359', '9,774,510.185401097')
Mean Profit:  4,533,870.619212874
Losses :  4.8

Region 1¶

In [ ]:
# Region 1
profit_risk(merged_1.target, merged_1.predictions)
('681,283.8236853313', '8,630,951.198799688')
Mean Profit:  4,945,452.4751050165
Losses :  1.0999999999999999

Region 2¶

In [ ]:
# Region 2
profit_risk(merged_2.target, merged_2.predictions)
('-1,538,054.1441032141', '9,219,736.453033209')
Mean Profit:  3,756,892.0224074163
Losses :  8.3

Results¶

Based on the mean profit, Region 1 has the highest profit potential. Furthermore, Region 1 has the lowest chance of losses, at just over 1%. The 95% confidence zone of Region 1, has the highest range for profit. Regions 0 and 2 have 4.8 and 8.3 % chances of losses, respectively. This is supported by negative profit values seen in the 95% confidence interval.

Conclusion¶

Considering the three datasets, we will suggest OilGiant to start a new site in Region 1. Region 1 has the greatest profit margin, as the range of well reserve volume is better than those of the other regions, due to the higher lower bound. In addition, the chance of losses in Region 1 is around 1%, which is extremely low. Therefore, the chances of randomly picking 200 sites that are extremely profitable will be more likely in Region 1. As we are focusing on as subsample of the 200 most profitable wells out of a 500 well sample, we limit our chance of loss, rather than just randomly picking wells. Overall, our model will perform well when predicting new sites to implement, and predicting the most profitable wells, given the same features we have in our model.