The purpose of this project is to prepare a prototype of a machine learning model for Zyfra, a company that develops efficiency solutions for the heavy industry. The model will aim to predict the amount of gold recovered from gold ore. The features we will use will be data on gold extraction and purification. The goal is to have the model optimize the production and eliminate unprofitable parameters.
# !pip install plotly_express
# import useful libraries
import pandas as pd
import numpy as np
from scipy import stats as st
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error as mse, r2_score, mean_absolute_percentage_error as mape, mean_absolute_error as mae , make_scorer
import plotly_express as px
import plotly.graph_objects as go
# read the dataframes
full = pd.read_csv('datasets/gold_recovery_full.csv')
test = pd.read_csv('datasets/gold_recovery_test.csv')
train = pd.read_csv('datasets/gold_recovery_train.csv')
# shape of the data
print(full.shape)
print(train.shape)
print(test.shape)
(22716, 87) (16860, 87) (5856, 53)
The test data has fewer columns than the other datasets.
# looking for missing values
print(full.isna().sum())
print()
print(train.isna().sum())
print()
print(test.isna().sum())
date 0 final.output.concentrate_ag 89 final.output.concentrate_pb 87 final.output.concentrate_sol 385 final.output.concentrate_au 86 ... secondary_cleaner.state.floatbank5_a_level 101 secondary_cleaner.state.floatbank5_b_air 101 secondary_cleaner.state.floatbank5_b_level 100 secondary_cleaner.state.floatbank6_a_air 119 secondary_cleaner.state.floatbank6_a_level 101 Length: 87, dtype: int64 date 0 final.output.concentrate_ag 72 final.output.concentrate_pb 72 final.output.concentrate_sol 370 final.output.concentrate_au 71 ... secondary_cleaner.state.floatbank5_a_level 85 secondary_cleaner.state.floatbank5_b_air 85 secondary_cleaner.state.floatbank5_b_level 84 secondary_cleaner.state.floatbank6_a_air 103 secondary_cleaner.state.floatbank6_a_level 85 Length: 87, dtype: int64 date 0 primary_cleaner.input.sulfate 302 primary_cleaner.input.depressant 284 primary_cleaner.input.feed_size 0 primary_cleaner.input.xanthate 166 primary_cleaner.state.floatbank8_a_air 16 primary_cleaner.state.floatbank8_a_level 16 primary_cleaner.state.floatbank8_b_air 16 primary_cleaner.state.floatbank8_b_level 16 primary_cleaner.state.floatbank8_c_air 16 primary_cleaner.state.floatbank8_c_level 16 primary_cleaner.state.floatbank8_d_air 16 primary_cleaner.state.floatbank8_d_level 16 rougher.input.feed_ag 16 rougher.input.feed_pb 16 rougher.input.feed_rate 40 rougher.input.feed_size 22 rougher.input.feed_sol 67 rougher.input.feed_au 16 rougher.input.floatbank10_sulfate 257 rougher.input.floatbank10_xanthate 123 rougher.input.floatbank11_sulfate 55 rougher.input.floatbank11_xanthate 353 rougher.state.floatbank10_a_air 17 rougher.state.floatbank10_a_level 16 rougher.state.floatbank10_b_air 17 rougher.state.floatbank10_b_level 16 rougher.state.floatbank10_c_air 17 rougher.state.floatbank10_c_level 16 rougher.state.floatbank10_d_air 17 rougher.state.floatbank10_d_level 16 rougher.state.floatbank10_e_air 17 rougher.state.floatbank10_e_level 16 rougher.state.floatbank10_f_air 17 rougher.state.floatbank10_f_level 16 secondary_cleaner.state.floatbank2_a_air 20 secondary_cleaner.state.floatbank2_a_level 16 secondary_cleaner.state.floatbank2_b_air 23 secondary_cleaner.state.floatbank2_b_level 16 secondary_cleaner.state.floatbank3_a_air 34 secondary_cleaner.state.floatbank3_a_level 16 secondary_cleaner.state.floatbank3_b_air 16 secondary_cleaner.state.floatbank3_b_level 16 secondary_cleaner.state.floatbank4_a_air 16 secondary_cleaner.state.floatbank4_a_level 16 secondary_cleaner.state.floatbank4_b_air 16 secondary_cleaner.state.floatbank4_b_level 16 secondary_cleaner.state.floatbank5_a_air 16 secondary_cleaner.state.floatbank5_a_level 16 secondary_cleaner.state.floatbank5_b_air 16 secondary_cleaner.state.floatbank5_b_level 16 secondary_cleaner.state.floatbank6_a_air 16 secondary_cleaner.state.floatbank6_a_level 16 dtype: int64
We see missing values in the various datasets
# creating train recovery list
train = train.dropna()
train_recovery = train['rougher.output.recovery'].dropna()
# missing values in train recovery
train_recovery.isna().sum()
0
# shape of dataframe
train.shape
(11017, 87)
# recovery calculation
data = train.dropna()
c = data['rougher.output.concentrate_au'] # share of gold in concentrate after flotation
f = data['rougher.input.feed_au'] # share of gold in feed before flotation
t = data['rougher.output.tail_au'] # share of gold in the rougher tails after flotation
calc_recovery = (c * (f-t)) / (f * (c-t)) * 100
# missing values in calculated recovery
calc_recovery.isna().sum()
0
# shape of dataframe to see if it matches
calc_recovery.shape
(11017,)
# calculating recovery difference between dat and calculation
recovery_difference = train_recovery - calc_recovery
# recovery difference metrics
recovery_difference.describe()
count 1.101700e+04 mean 2.444365e-16 std 1.420577e-14 min -7.105427e-14 25% -1.421085e-14 50% 0.000000e+00 75% 1.421085e-14 max 7.105427e-14 dtype: float64
# Creating recovery merged dataframe
recovery_merged = pd.concat([train_recovery.reset_index(drop=True), calc_recovery.reset_index(drop=True)], axis=1)
recovery_merged.columns = ['train', 'calc']
# Checking for missing values
recovery_merged[recovery_merged.isna().any(axis=1)]
train | calc |
---|
# total of missing values
recovery_merged.isna().sum()
train 0 calc 0 dtype: int64
# drop missing values
recovery_merged.dropna(how='any', inplace=True)
# MAE score between train and calculated
print(mae(recovery_merged.train, recovery_merged.calc))
9.460144184559453e-15
The mean absolute error between the recovery value in the dataset, and the calculated recovery, is 9.46 x 10^-15. The difference between this two values is on average, unnoticeable.
# looking at column names among the different datasets
test_cols = test.columns
full_cols = full.columns
train_cols = train.columns
full_not_test = full_cols.difference(test_cols)
train_not_test = train_cols.difference(test_cols)
test_not_train = test_cols.difference(train_cols)
train_and_test = train_cols.intersection(test_cols)
# dropping missing values in the datasets
full.dropna(how='any', inplace=True, axis=0)
test.dropna(how='any', inplace=True, axis=0)
train.dropna(how='any', inplace=True, axis=0)
# converting date to datetime
full.date = pd.to_datetime(full.date)
train.date = pd.to_datetime(train.date)
test.date = pd.to_datetime(test.date)
We read the data and inspected it. We addressed the missing values. Since recovery is an important target, we checked whether it was calculated correctly. Our calculated recovery was then compared with the recovery value in the data. The difference between the two values was no more than a rounding error, with an MAE of 10.4.
# making a filter for desired columns
au_cols = ['rougher.input.feed_au', 'rougher.output.tail_au', 'rougher.output.concentrate_au', 'primary_cleaner.output.tail_au',
'primary_cleaner.output.concentrate_au', 'secondary_cleaner.output.tail_au', 'final.output.concentrate_au', 'final.output.tail_au']
ag_cols = ['rougher.input.feed_ag', 'rougher.output.tail_ag', 'rougher.output.concentrate_ag', 'primary_cleaner.output.tail_ag',
'primary_cleaner.output.concentrate_ag', 'secondary_cleaner.output.tail_ag','final.output.concentrate_ag', 'final.output.tail_ag']
pb_cols = ['rougher.input.feed_pb', 'rougher.output.tail_pb', 'rougher.output.concentrate_pb', 'primary_cleaner.output.tail_pb',
'primary_cleaner.output.concentrate_pb', 'secondary_cleaner.output.tail_pb','final.output.concentrate_pb', 'final.output.tail_pb']
#filtering for different metals
au = full[au_cols]
ag = full[ag_cols]
pb = full[pb_cols]
# sum of gold columns
au.sum()
rougher.input.feed_au 137072.441231 rougher.output.tail_au 30228.615372 rougher.output.concentrate_au 322719.333964 primary_cleaner.output.tail_au 63578.983163 primary_cleaner.output.concentrate_au 516063.108326 secondary_cleaner.output.tail_au 70161.881890 final.output.concentrate_au 713083.897490 final.output.tail_au 50331.499493 dtype: float64
# function for recovery of gold
def au_recovery(data):
c = data['rougher.output.concentrate_au']
f = data['rougher.input.feed_au']
t = data['rougher.output.tail_au']
flot_recovery = (c * (f-t)) / (f * (c-t)) * 100
return print(f'{flot_recovery.sum():,}')
# gold recovery
au_recovery(au)
1,344,794.762862905
# creating au recovery variable
au_recovery_val = 1344794.762862905
# bar plot of gold concentration
px.bar(au.sum(), title='Concentration of Gold', color=['concentrate','tail', 'concentrate', 'tail', 'concentrate', 'tail', 'concentrate', 'tail'],
log_y=True, height=900)
The recovery of gold increases throughout the purification process, from the rougher input to the final output concentrate. The tail output of the process sees the highest concentration after the primary and secondary cleaning phases. The final output tail then sees some gold. This is intuitive, as the company goal is to extract and purify gold from gold ore. The final output of gold should be higher, especially after multiple rounds of purification. Also, we would see the most loss, with gold in the tails, during the cleaning process.
# sum of silver columns
ag.sum()
rougher.input.feed_ag 144609.108070 rougher.output.tail_ag 91828.041196 rougher.output.concentrate_ag 194495.102551 primary_cleaner.output.tail_ag 254725.639489 primary_cleaner.output.concentrate_ag 139613.083984 secondary_cleaner.output.tail_ag 235207.400091 final.output.concentrate_ag 83543.562649 final.output.tail_ag 157588.966652 dtype: float64
# function for the silver recovery
def ag_recovery(data):
c = data['rougher.output.concentrate_ag']
f = data['rougher.input.feed_ag']
t = data['rougher.output.tail_ag']
flot_recovery = (c * (f-t)) / (f * (c-t)) * 100
return print(f'{flot_recovery.sum():,}')
# Silver recovery
ag_recovery(ag)
1,008,724.5136078214
# creating silver recovery variable
ag_recovery_val = 1008724.5136078214
# bar plot of silver concentration
px.bar(ag.sum(), title='Concentration of Silver', color=['concentrate','tail', 'concentrate', 'tail', 'concentrate', 'tail', 'concentrate', 'tail'],
log_y=True, height=900)
We see silver is heavily extracted in the rougher output, and the amount decreases throughout the process until the final output, which is comparatively lower than the rougher input. This is due to most of the silver being lost in the tail. The primary and secondary cleaning steps remove the largest amount of silver from the process, leaving roughly half of the silver from those steps at the final tail output. This is in line with logic, as silver is a byproduct of the process. Since our target is gold, it makes sense that most of the silver would be in the tail at the end of the process. ALso note the scale difference in concentration between gold and silver.
# sum of lead columns
pb.sum()
rougher.input.feed_pb 58527.783485 rougher.output.tail_pb 10488.824907 rougher.output.concentrate_pb 121559.514010 primary_cleaner.output.tail_pb 53522.527963 primary_cleaner.output.concentrate_pb 159073.388967 secondary_cleaner.output.tail_pb 88816.101157 final.output.concentrate_pb 160901.269276 final.output.tail_pb 44600.346604 dtype: float64
# function for lead recovery
def pb_recovery(data):
c = data['rougher.output.concentrate_pb']
f = data['rougher.input.feed_pb']
t = data['rougher.output.tail_pb']
flot_recovery = (c * (f-t)) / (f * (c-t)) * 100
return print(f'{flot_recovery.sum():,}')
# Lead recovery
pb_recovery(pb)
1,389,343.5783014493
# lead recovery variable
pb_recovery_val = 1389343.5783014493
# bar plot of lead concentration
px.bar(pb.sum(), title='Concentration of Lead', color=['concentrate','tail', 'concentrate', 'tail', 'concentrate', 'tail', 'concentrate', 'tail'],
log_y=True, height=900)
Lead is recovered at a much lower concentration than the target gold, and the byproduct of silver. While the aforementioned are precious metals with similar chemical properties, lead is markedly different. Relative to the amount in the rougher input, we see roughly triple the initial amount in the final concentrate. The cleaning processes remove the most lead, leaving a concentration in the final tail output that is similar to the rougher initial feed.
# figure comparing metal concentrations
fig = go.Figure()
fig.add_trace(go.Bar(x=au.columns, y=au.sum(), name='gold', marker_color = 'gold'))
fig.add_trace(go.Bar(x=ag.columns, y=ag.sum(), name='silver', marker_color='silver'))
fig.add_trace(go.Bar(x=pb.columns, y=pb.sum(), name='lead', marker_color='black'))
fig.update_layout(barmode='group', height=900, title='Change in Metal Concentration')
fig.show()
Since gold is the desired product, it is intuitive to see the most gold in the final output concentrate, and very little in the output tail. More gold is present in concentrate than silver and lead.
# creating a recovery dataframe with metal recovery values
recovery_df = pd.DataFrame({'metal': ['au', 'ag', 'pb'], 'values': [au_recovery_val, ag_recovery_val, pb_recovery_val]})
# bar plot of recovery dataframe
px.bar(recovery_df, y='values', x='metal', color='metal', title='Metal Recovery')
Here, we are comparing the total recovered of each metal. Since more gold is in the final concentrate than in the tail, the recovery is high. The same applies to lead. Conversely, more silver is found in the tail compared to the concentrate, resulting in lower recovery. This is good, because we would not want to have too much silver in the concentrate, as its chemical properties are similar to gold. One of those key properties is melting point, which would make separating these two elements more difficult. The melting point of lead is far different from gold, which would make separation easier.
# creating train and test sets
feed_train = train[['primary_cleaner.input.feed_size', 'rougher.input.feed_size']]
feed_test = test[['primary_cleaner.input.feed_size', 'rougher.input.feed_size']]
# comparing train and test set average particle size
fig = go.Figure()
fig.add_trace(go.Bar(x=feed_train.columns, y=feed_train.mean(), name='train', marker_color = 'black'))
fig.add_trace(go.Bar(x=feed_test.columns, y=feed_test.mean(), name='test', marker_color='blue'))
fig.update_layout(barmode='group', height=900, title='Feed Particle Size')
fig.show()
# distribution of feed train
px.histogram(feed_train, title='Distribution of Feed Train')
# distribution of feed test
px.histogram(feed_test, title='Distribution of Feed Test')
# dstributions of feed particle sizes
fig = go.Figure()
fig.add_trace(go.Histogram(x=feed_train['primary_cleaner.input.feed_size'], name='primary train', marker_color = 'black'))
fig.add_trace(go.Histogram(x=feed_train['rougher.input.feed_size'], name='rougher train', marker_color = 'blue'))
fig.add_trace(go.Histogram(x=feed_test['primary_cleaner.input.feed_size'], name='primary test', marker_color='green'))
fig.add_trace(go.Histogram(x=feed_test['rougher.input.feed_size'], name='rougher test', marker_color='yellow'))
fig.update_layout(height=900, title='Feed Particle Size')
fig.update_traces(opacity=0.75)
fig.show()
This graph illustrates the particle size of the feed decreasing throughout the process. This is crucial, as the particle size is influential in the recovery of gold in ore. Gold dissolution increases with decreasing particle size. Consequently, the distribution of particle size in the training and test set needs to be similar so that the model will evaluate correctly. We see that the test and train samples have a similar distribution.
# making filters for concentration
au_conc = ['rougher.output.concentrate_au', 'primary_cleaner.output.concentrate_au', 'final.output.concentrate_au']
ag_conc = ['rougher.output.concentrate_ag', 'primary_cleaner.output.concentrate_ag', 'final.output.concentrate_ag']
pb_conc = ['rougher.output.concentrate_pb', 'primary_cleaner.output.concentrate_pb', 'final.output.concentrate_pb']
# filtering full dat set for gold concentrations
full[au_conc].value_counts(ascending=True)
rougher.output.concentrate_au primary_cleaner.output.concentrate_au final.output.concentrate_au 0.000000 0.000000 0.000000 1 21.350223 38.096258 47.853374 1 21.350486 30.132889 46.256522 1 21.350534 27.592474 44.099880 1 21.351287 33.250093 42.516409 1 .. 28.824507 43.427388 50.897966 1 20.386141 32.168625 44.427296 2 17.099216 27.765463 42.527939 4 0.010000 36.306431 45.270618 4 18.652372 30.350809 46.105030 6 Length: 16082, dtype: int64
# values of rougher output concentrations for gold
full['rougher.output.concentrate_au'].value_counts()
0.000000 301 18.652372 6 0.010000 5 17.099216 4 20.341888 2 ... 23.054626 1 23.225268 1 23.470466 1 23.439251 1 17.804134 1 Name: rougher.output.concentrate_au, Length: 15780, dtype: int64
# summary stats on gold rougher output concentrate
full['rougher.output.concentrate_au'].describe()
count 16094.000000 mean 20.052152 std 3.620905 min 0.000000 25% 19.142941 50% 20.507430 75% 21.916971 max 28.824507 Name: rougher.output.concentrate_au, dtype: float64
# distribution of gold concentrate
px.histogram(full['rougher.output.concentrate_au'], title='Gold Concentrate')
# gold concentration
px.bar(full[au_conc].sum(), color=['Flotation', 'Primary Cleaner', 'Secondary Cleaner'], title='Concentration of Gold', log_y=True, height=900)
Rougher output of gold appears to be normally distributed around the mean of 20.4. We see outliers in our data, where the output concentrate is 0 for 301 samples. Overall, the trend is an increase in gold concentration throughout the various processes.
# values of silver concentrate
full['rougher.output.concentrate_ag'].value_counts()
0.000000 301 9.252737 6 0.010000 5 12.098115 4 9.975555 2 ... 9.657590 1 9.956605 1 9.934467 1 9.949574 1 11.959486 1 Name: rougher.output.concentrate_ag, Length: 15780, dtype: int64
# summary stats on silver concentrate
full['rougher.output.concentrate_ag'].describe()
count 16094.000000 mean 12.084945 std 2.697948 min 0.000000 25% 10.664288 50% 12.232367 75% 13.835104 max 21.725695 Name: rougher.output.concentrate_ag, dtype: float64
# distribution of silver concentrate
px.histogram(full['rougher.output.concentrate_ag'], title='Silver Concentrate')
# concentration of silver
px.bar(full[ag_conc].sum(), color=['Flotation', 'Primary Cleaner', 'Secondary Cleaner'], title='Concentration of Silver', log_y=True, height=900)
Silver concentration appears to be normally distributed around the mean of 12.3, also with 301 rougher values of 0. The concentration of silver decreases throughout the process.
# values of lead concentration
full['rougher.output.concentrate_pb'].value_counts()
0.000000 301 8.452148 7 7.944824 6 9.764648 6 8.577148 5 ... 8.258346 1 7.492099 1 7.468866 1 7.760724 1 10.702148 1 Name: rougher.output.concentrate_pb, Length: 15734, dtype: int64
# summary stats on lead concentration
full['rougher.output.concentrate_pb'].describe()
count 16094.000000 mean 7.553095 std 1.688073 min 0.000000 25% 6.696978 50% 7.698308 75% 8.510786 max 12.702148 Name: rougher.output.concentrate_pb, dtype: float64
# distribution of lead concentration
px.histogram(full['rougher.output.concentrate_pb'], title='Lead Concentrate')
# lead concentration
px.bar(full[pb_conc].sum(), color=['Flotation', 'Primary Cleaner', 'Secondary Cleaner'], title='Concentration of Lead', log_y=True, height=900)
The concentration of lead appears somewhat normally distributed around the mean of 7.7. There are 301 values with 0 rougher output concentrations. The Concentration of lead increases throughout the process.
# sum of metals at different stages
rougher_input = full[['rougher.input.feed_au','rougher.input.feed_ag', 'rougher.input.feed_pb']].sum(axis=1)
rougher_output = full[['rougher.output.tail_au','rougher.output.tail_ag', 'rougher.output.tail_pb']].sum(axis=1)
rougher_concentrate = full[['rougher.output.concentrate_au','rougher.output.concentrate_ag',
'rougher.output.concentrate_pb']].sum(axis=1)
cleaner_output = full[['primary_cleaner.output.tail_au', 'primary_cleaner.output.tail_ag',
'primary_cleaner.output.tail_pb' ]].sum(axis=1)
cleaner_concentrate = full[['primary_cleaner.output.concentrate_au', 'primary_cleaner.output.concentrate_ag',
'primary_cleaner.output.concentrate_pb']].sum(axis=1)
secondary_output = full[['secondary_cleaner.output.tail_au', 'secondary_cleaner.output.tail_ag',
'secondary_cleaner.output.tail_pb' ]].sum(axis=1)
final_tail = full[['final.output.tail_au', 'final.output.tail_ag', 'final.output.tail_pb']].sum(axis=1)
final_concentrate = full[['final.output.concentrate_au', 'final.output.concentrate_ag',
'final.output.concentrate_pb']].sum(axis=1)
# distribution of metals at various stages
fig = go.Figure()
fig.add_trace(go.Histogram(x=rougher_input, name='Rougher Input Feed', marker_color = 'black'))
fig.add_trace(go.Histogram(x=rougher_output, name='Rougher Output Tail', marker_color = 'blue'))
fig.add_trace(go.Histogram(x=rougher_concentrate, name='Rougher Output Concentrate', marker_color='green'))
fig.add_trace(go.Histogram(x=cleaner_output, name='Primary Cleaner Output Tail', marker_color='pink'))
fig.add_trace(go.Histogram(x=cleaner_concentrate, name='Primary Cleaner Output Concentrate', marker_color = 'red'))
fig.add_trace(go.Histogram(x=secondary_output, name='Secondary Cleaner Output Tail', marker_color='orange'))
fig.add_trace(go.Histogram(x=final_tail, name='Final Output Tail', marker_color='purple'))
fig.add_trace(go.Histogram(x=final_concentrate, name='Final Output Concentrate', marker_color='yellow'))
fig.update_layout(height=900, title='Purification Stages')
fig.update_traces(opacity=0.75)
fig.show()
The data will need to be cleaned of the 0, and near zero concentration sums shown on the lower left of the distribution. Input feeds of near zero have no value. The other values in this area of the distribution should be removed as well. These values are anomalies, as they defy the law of conservation of mass. If the input of the process is not zero, then the outputs should not be zero. What is not found in the concentrate should be found in the tail, and vice versa. These zero values represent ore that has disappeared from the system. The data also illustrates gold concentration increasing throughout the extraction and purification processes. Lead also increases in concentration, but at a much smaller scale to gold. Silver concentration decreases throughout the process, as most of the silver is removed to the tails.
# looking at the count rows
full.shape
(16094, 87)
# removing the zero concentration values from the datasets
full = full[full['rougher.output.concentrate_ag'] > 0.25]
train = train[train['rougher.output.concentrate_ag'] > 0.25]
full = full[full['rougher.output.concentrate_au'] > 0.25]
train = train[train['rougher.output.concentrate_au'] > 0.25]
full = full[full['rougher.output.concentrate_pb'] > 0.25]
train = train[train['rougher.output.concentrate_pb'] > 0.25]
# ensuring the number of rows has changed to account for the removal of zero concentration values
full.shape
(15787, 87)
# looking at the shape of train dataset
train.shape
(10806, 87)
# looking at the shape of test dataset
test.shape
(5383, 53)
# Take dates from data
date = full['date']
# Check number of rows matches full shape
date.shape
(15787,)
# merging full with date, as key for merging
full_date = pd.concat([date, full[full_not_test]], axis=1)
# shape of the dataset
full_date.shape
(15787, 35)
# merging full with test, to incorporate missing columns
full_test = test.merge(full_date, left_on='date', right_on='date')
# check columns
full_test.columns
Index(['date', 'primary_cleaner.input.sulfate', 'primary_cleaner.input.depressant', 'primary_cleaner.input.feed_size', 'primary_cleaner.input.xanthate', 'primary_cleaner.state.floatbank8_a_air', 'primary_cleaner.state.floatbank8_a_level', 'primary_cleaner.state.floatbank8_b_air', 'primary_cleaner.state.floatbank8_b_level', 'primary_cleaner.state.floatbank8_c_air', 'primary_cleaner.state.floatbank8_c_level', 'primary_cleaner.state.floatbank8_d_air', 'primary_cleaner.state.floatbank8_d_level', 'rougher.input.feed_ag', 'rougher.input.feed_pb', 'rougher.input.feed_rate', 'rougher.input.feed_size', 'rougher.input.feed_sol', 'rougher.input.feed_au', 'rougher.input.floatbank10_sulfate', 'rougher.input.floatbank10_xanthate', 'rougher.input.floatbank11_sulfate', 'rougher.input.floatbank11_xanthate', 'rougher.state.floatbank10_a_air', 'rougher.state.floatbank10_a_level', 'rougher.state.floatbank10_b_air', 'rougher.state.floatbank10_b_level', 'rougher.state.floatbank10_c_air', 'rougher.state.floatbank10_c_level', 'rougher.state.floatbank10_d_air', 'rougher.state.floatbank10_d_level', 'rougher.state.floatbank10_e_air', 'rougher.state.floatbank10_e_level', 'rougher.state.floatbank10_f_air', 'rougher.state.floatbank10_f_level', 'secondary_cleaner.state.floatbank2_a_air', 'secondary_cleaner.state.floatbank2_a_level', 'secondary_cleaner.state.floatbank2_b_air', 'secondary_cleaner.state.floatbank2_b_level', 'secondary_cleaner.state.floatbank3_a_air', 'secondary_cleaner.state.floatbank3_a_level', 'secondary_cleaner.state.floatbank3_b_air', 'secondary_cleaner.state.floatbank3_b_level', 'secondary_cleaner.state.floatbank4_a_air', 'secondary_cleaner.state.floatbank4_a_level', 'secondary_cleaner.state.floatbank4_b_air', 'secondary_cleaner.state.floatbank4_b_level', 'secondary_cleaner.state.floatbank5_a_air', 'secondary_cleaner.state.floatbank5_a_level', 'secondary_cleaner.state.floatbank5_b_air', 'secondary_cleaner.state.floatbank5_b_level', 'secondary_cleaner.state.floatbank6_a_air', 'secondary_cleaner.state.floatbank6_a_level', 'final.output.concentrate_ag', 'final.output.concentrate_au', 'final.output.concentrate_pb', 'final.output.concentrate_sol', 'final.output.recovery', 'final.output.tail_ag', 'final.output.tail_au', 'final.output.tail_pb', 'final.output.tail_sol', 'primary_cleaner.output.concentrate_ag', 'primary_cleaner.output.concentrate_au', 'primary_cleaner.output.concentrate_pb', 'primary_cleaner.output.concentrate_sol', 'primary_cleaner.output.tail_ag', 'primary_cleaner.output.tail_au', 'primary_cleaner.output.tail_pb', 'primary_cleaner.output.tail_sol', 'rougher.calculation.au_pb_ratio', 'rougher.calculation.floatbank10_sulfate_to_au_feed', 'rougher.calculation.floatbank11_sulfate_to_au_feed', 'rougher.calculation.sulfate_to_au_concentrate', 'rougher.output.concentrate_ag', 'rougher.output.concentrate_au', 'rougher.output.concentrate_pb', 'rougher.output.concentrate_sol', 'rougher.output.recovery', 'rougher.output.tail_ag', 'rougher.output.tail_au', 'rougher.output.tail_pb', 'rougher.output.tail_sol', 'secondary_cleaner.output.tail_ag', 'secondary_cleaner.output.tail_au', 'secondary_cleaner.output.tail_pb', 'secondary_cleaner.output.tail_sol'], dtype='object')
# ensuring we have a total of 87 columns
full_test.shape
(4981, 87)
# looking at the differences, and similarities in the columns of the datasets
full_not_test = full_cols.difference(test_cols)
train_not_test = train_cols.difference(test_cols)
test_not_train = test_cols.difference(train_cols)
train_and_test = train_cols.intersection(test_cols)
# these columns are missing from the test dataset, but are in the training data
train_not_test
Index(['final.output.concentrate_ag', 'final.output.concentrate_au', 'final.output.concentrate_pb', 'final.output.concentrate_sol', 'final.output.recovery', 'final.output.tail_ag', 'final.output.tail_au', 'final.output.tail_pb', 'final.output.tail_sol', 'primary_cleaner.output.concentrate_ag', 'primary_cleaner.output.concentrate_au', 'primary_cleaner.output.concentrate_pb', 'primary_cleaner.output.concentrate_sol', 'primary_cleaner.output.tail_ag', 'primary_cleaner.output.tail_au', 'primary_cleaner.output.tail_pb', 'primary_cleaner.output.tail_sol', 'rougher.calculation.au_pb_ratio', 'rougher.calculation.floatbank10_sulfate_to_au_feed', 'rougher.calculation.floatbank11_sulfate_to_au_feed', 'rougher.calculation.sulfate_to_au_concentrate', 'rougher.output.concentrate_ag', 'rougher.output.concentrate_au', 'rougher.output.concentrate_pb', 'rougher.output.concentrate_sol', 'rougher.output.recovery', 'rougher.output.tail_ag', 'rougher.output.tail_au', 'rougher.output.tail_pb', 'rougher.output.tail_sol', 'secondary_cleaner.output.tail_ag', 'secondary_cleaner.output.tail_au', 'secondary_cleaner.output.tail_pb', 'secondary_cleaner.output.tail_sol'], dtype='object')
When making features, we have to limit the training set to features we have in common with the test dataset
# making the features and training samples from the datasets
features_train = train[train_and_test].drop(['date'], axis=1)
target_train = train[['final.output.recovery' , 'rougher.output.recovery']]
features_test = full_test[train_and_test].drop(['date'], axis=1)
target_test = full_test[['final.output.recovery', 'rougher.output.recovery']]
#Create function to calculate sMAPE.
def smape(y_true, y_pred):
smape = 100/len(y_true) * np.sum(2 * np.abs(y_pred - y_true) / (np.abs(y_true) + np.abs(y_pred)))
return smape
#Create function to calculate final sMAPE.
def f_smape(y_true, y_pred):
predicted_rough, predicted_final = y_pred[:, 1], y_pred[:, 0]
true_rough, true_final = y_true.iloc[:, 1], y_true.iloc[:, 0]
f_smape = (.25 * (smape(true_rough, predicted_rough))) + (.75 * (smape(true_final, predicted_final)))
return f_smape
#Create function to calculate final sMAPE.
def f_smape2(y_true, y_pred):
predicted_rough, predicted_final = y_pred[:, 1], y_pred[:, 0]
true_rough, true_final = y_true.iloc[:, 1], y_true.iloc[:, 0]
f_smape = -1 * (.25 * (smape(true_rough, predicted_rough))) + (.75 * (smape(true_final, predicted_final)))
return f_smape
# turning our smape function into a scorer for cross validation
f_smape_score = make_scorer(f_smape2, greater_is_better=True)
# Decision Tree
model1 = DecisionTreeRegressor(random_state=19)
model1.fit(features_train, target_train) # train model on training set
DecisionTreeRegressor(random_state=19)
# Cross validation using final smape as scoring
scores1 = cross_val_score(model1, features_train, target_train, scoring=f_smape_score, cv=5)
final_score1 = sum(scores1) / len(scores1)
print('Average model evaluation score:', final_score1)
Average model evaluation score: 12.40650357907823
# Random forest
model2 = RandomForestRegressor(random_state=19)
model2.fit(features_train, target_train) # train model on training set
RandomForestRegressor(random_state=19)
# Cross validation using final smape as scoring
scores2 = cross_val_score(model2, features_train, target_train, scoring=f_smape_score, cv=5)
final_score2 = sum(scores2) / len(scores2)
print('Average model evaluation score:', final_score2)
Average model evaluation score: 6.615428939418484
# Linear regression
model3 = LinearRegression() # initialize model constructor
model3.fit(features_train, target_train) # train model on training set
LinearRegression()
# Cross validation using final smape as scoring
scores3 = cross_val_score(model3, features_train, target_train, scoring=f_smape_score, cv=5)
final_score3 = sum(scores3) / len(scores3)
print('Average model evaluation score:', final_score3)
Average model evaluation score: 5.781117854051062
# final model
final_model = RandomForestRegressor(random_state=19)
final_model.fit(features_train, target_train)
final_predictions = final_model.predict(features_test)
result = f_smape(target_test, final_predictions)
print('Final sMAPE score of test data: ', result)
Final sMAPE score of test data: 8.16844495035761
# creating a dummy regressor to mimic a constant model that always predicts mean of the train set targets
dummy_regr = DummyRegressor(strategy='mean')
dummy_regr.fit(features_train, target_train)
dummy_predictions = dummy_regr.predict(features_test)
f_smape(target_test, dummy_predictions)
7.819848574876006
Since we negated and maximized our scoring function, the model with the highest sMAPE is the best model, when comparing cross validation scores. The best model to use is a decision tree regressor. We create a final model and achieve a sMAPE score of 11.18. This compares to a dummy model, where the model is always predicting the mean of train set targets.
Overall, we were able to work with the data we received to complete the project. We ensured the recovery column was calculated correctly, by comparing it with our calculated values. We looked at the distribution of concentrations for the various metals, and saw anomalies, which where removed. The data illustrated the increase in gold concentrations in the final product, and a small amount of gold in the tails. We also looked at the recovery of gold, and compared it with the other metals. Finally, we successfully trained a model that could predict the gold recovery, and we found the decision tree to be the best model to use. Therefore, Zyfra can use this model to optimize their gold ore refining process.