Table of contents

  • Zyfra Machine Learning Model
    • Purpose
    • Prepare the Data
    • Data Analysis
      • Gold
      • Silver
      • Lead
      • Comparing Metal Concentrations
      • Metal Recovery
      • Comparing Feed Particle Size
      • Concentration values
        • Gold
        • Silver
        • Lead
        • Stage Distributions
        • Results
    • Model Preparation
    • Models
      • Decision Tree
      • Random Forest
      • Linear Regression
      • Final Model
    • Conclusion

Zyfra Machine Learning Model¶

Purpose¶

The purpose of this project is to prepare a prototype of a machine learning model for Zyfra, a company that develops efficiency solutions for the heavy industry. The model will aim to predict the amount of gold recovered from gold ore. The features we will use will be data on gold extraction and purification. The goal is to have the model optimize the production and eliminate unprofitable parameters.


Prepare the Data¶

In [ ]:
# !pip install plotly_express
In [ ]:
# import useful libraries
import pandas as pd 
import numpy as np 
from scipy import stats as st
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error as mse, r2_score, mean_absolute_percentage_error as mape, mean_absolute_error as mae , make_scorer
import plotly_express as px
import plotly.graph_objects as go 
In [ ]:
# read the dataframes
full = pd.read_csv('datasets/gold_recovery_full.csv')
test = pd.read_csv('datasets/gold_recovery_test.csv')
train = pd.read_csv('datasets/gold_recovery_train.csv')
In [ ]:
# shape of the data
print(full.shape)
print(train.shape)
print(test.shape)
(22716, 87)
(16860, 87)
(5856, 53)

The test data has fewer columns than the other datasets.

In [ ]:
# looking for missing values
print(full.isna().sum())
print()
print(train.isna().sum())
print()
print(test.isna().sum())
date                                            0
final.output.concentrate_ag                    89
final.output.concentrate_pb                    87
final.output.concentrate_sol                  385
final.output.concentrate_au                    86
                                             ... 
secondary_cleaner.state.floatbank5_a_level    101
secondary_cleaner.state.floatbank5_b_air      101
secondary_cleaner.state.floatbank5_b_level    100
secondary_cleaner.state.floatbank6_a_air      119
secondary_cleaner.state.floatbank6_a_level    101
Length: 87, dtype: int64

date                                            0
final.output.concentrate_ag                    72
final.output.concentrate_pb                    72
final.output.concentrate_sol                  370
final.output.concentrate_au                    71
                                             ... 
secondary_cleaner.state.floatbank5_a_level     85
secondary_cleaner.state.floatbank5_b_air       85
secondary_cleaner.state.floatbank5_b_level     84
secondary_cleaner.state.floatbank6_a_air      103
secondary_cleaner.state.floatbank6_a_level     85
Length: 87, dtype: int64

date                                            0
primary_cleaner.input.sulfate                 302
primary_cleaner.input.depressant              284
primary_cleaner.input.feed_size                 0
primary_cleaner.input.xanthate                166
primary_cleaner.state.floatbank8_a_air         16
primary_cleaner.state.floatbank8_a_level       16
primary_cleaner.state.floatbank8_b_air         16
primary_cleaner.state.floatbank8_b_level       16
primary_cleaner.state.floatbank8_c_air         16
primary_cleaner.state.floatbank8_c_level       16
primary_cleaner.state.floatbank8_d_air         16
primary_cleaner.state.floatbank8_d_level       16
rougher.input.feed_ag                          16
rougher.input.feed_pb                          16
rougher.input.feed_rate                        40
rougher.input.feed_size                        22
rougher.input.feed_sol                         67
rougher.input.feed_au                          16
rougher.input.floatbank10_sulfate             257
rougher.input.floatbank10_xanthate            123
rougher.input.floatbank11_sulfate              55
rougher.input.floatbank11_xanthate            353
rougher.state.floatbank10_a_air                17
rougher.state.floatbank10_a_level              16
rougher.state.floatbank10_b_air                17
rougher.state.floatbank10_b_level              16
rougher.state.floatbank10_c_air                17
rougher.state.floatbank10_c_level              16
rougher.state.floatbank10_d_air                17
rougher.state.floatbank10_d_level              16
rougher.state.floatbank10_e_air                17
rougher.state.floatbank10_e_level              16
rougher.state.floatbank10_f_air                17
rougher.state.floatbank10_f_level              16
secondary_cleaner.state.floatbank2_a_air       20
secondary_cleaner.state.floatbank2_a_level     16
secondary_cleaner.state.floatbank2_b_air       23
secondary_cleaner.state.floatbank2_b_level     16
secondary_cleaner.state.floatbank3_a_air       34
secondary_cleaner.state.floatbank3_a_level     16
secondary_cleaner.state.floatbank3_b_air       16
secondary_cleaner.state.floatbank3_b_level     16
secondary_cleaner.state.floatbank4_a_air       16
secondary_cleaner.state.floatbank4_a_level     16
secondary_cleaner.state.floatbank4_b_air       16
secondary_cleaner.state.floatbank4_b_level     16
secondary_cleaner.state.floatbank5_a_air       16
secondary_cleaner.state.floatbank5_a_level     16
secondary_cleaner.state.floatbank5_b_air       16
secondary_cleaner.state.floatbank5_b_level     16
secondary_cleaner.state.floatbank6_a_air       16
secondary_cleaner.state.floatbank6_a_level     16
dtype: int64

We see missing values in the various datasets

In [ ]:
# creating train recovery list
train = train.dropna()
train_recovery = train['rougher.output.recovery'].dropna()
In [ ]:
# missing values in train recovery
train_recovery.isna().sum()
Out[ ]:
0
In [ ]:
# shape of dataframe
train.shape
Out[ ]:
(11017, 87)
In [ ]:
# recovery calculation

data = train.dropna()

c = data['rougher.output.concentrate_au']     # share of gold in concentrate after flotation
f = data['rougher.input.feed_au']             # share of gold in feed before flotation
t = data['rougher.output.tail_au']  # share of gold in the rougher tails after flotation

calc_recovery = (c * (f-t)) / (f * (c-t)) * 100
In [ ]:
# missing values in calculated recovery
calc_recovery.isna().sum()
Out[ ]:
0
In [ ]:
# shape of dataframe to see if it matches
calc_recovery.shape
Out[ ]:
(11017,)
In [ ]:
# calculating recovery difference between dat and calculation
recovery_difference = train_recovery - calc_recovery
In [ ]:
# recovery difference metrics
recovery_difference.describe()
Out[ ]:
count    1.101700e+04
mean     2.444365e-16
std      1.420577e-14
min     -7.105427e-14
25%     -1.421085e-14
50%      0.000000e+00
75%      1.421085e-14
max      7.105427e-14
dtype: float64
In [ ]:
# Creating recovery merged dataframe
recovery_merged = pd.concat([train_recovery.reset_index(drop=True), calc_recovery.reset_index(drop=True)], axis=1)
recovery_merged.columns = ['train', 'calc']
In [ ]:
# Checking for missing values
recovery_merged[recovery_merged.isna().any(axis=1)]
Out[ ]:
train calc
In [ ]:
# total of missing values
recovery_merged.isna().sum()
Out[ ]:
train    0
calc     0
dtype: int64
In [ ]:
# drop missing values
recovery_merged.dropna(how='any', inplace=True)
In [ ]:
# MAE score between train and calculated 
print(mae(recovery_merged.train, recovery_merged.calc))
9.460144184559453e-15

The mean absolute error between the recovery value in the dataset, and the calculated recovery, is 9.46 x 10^-15. The difference between this two values is on average, unnoticeable.

In [ ]:
# looking at column names among the different datasets
test_cols = test.columns
full_cols = full.columns
train_cols = train.columns

full_not_test = full_cols.difference(test_cols)
train_not_test = train_cols.difference(test_cols)
test_not_train = test_cols.difference(train_cols)
train_and_test = train_cols.intersection(test_cols)
In [ ]:
# dropping missing values in the datasets
full.dropna(how='any', inplace=True, axis=0)
test.dropna(how='any', inplace=True, axis=0)
train.dropna(how='any', inplace=True, axis=0)
In [ ]:
# converting date to datetime
full.date = pd.to_datetime(full.date)
train.date = pd.to_datetime(train.date)
test.date = pd.to_datetime(test.date)

We read the data and inspected it. We addressed the missing values. Since recovery is an important target, we checked whether it was calculated correctly. Our calculated recovery was then compared with the recovery value in the data. The difference between the two values was no more than a rounding error, with an MAE of 10.4.

Data Analysis¶

In [ ]:
# making a filter for desired columns
au_cols = ['rougher.input.feed_au', 'rougher.output.tail_au', 'rougher.output.concentrate_au', 'primary_cleaner.output.tail_au', 
        'primary_cleaner.output.concentrate_au', 'secondary_cleaner.output.tail_au', 'final.output.concentrate_au', 'final.output.tail_au']

ag_cols = ['rougher.input.feed_ag', 'rougher.output.tail_ag', 'rougher.output.concentrate_ag', 'primary_cleaner.output.tail_ag', 
        'primary_cleaner.output.concentrate_ag', 'secondary_cleaner.output.tail_ag','final.output.concentrate_ag', 'final.output.tail_ag'] 

pb_cols = ['rougher.input.feed_pb', 'rougher.output.tail_pb', 'rougher.output.concentrate_pb', 'primary_cleaner.output.tail_pb', 
        'primary_cleaner.output.concentrate_pb', 'secondary_cleaner.output.tail_pb','final.output.concentrate_pb', 'final.output.tail_pb'] 
In [ ]:
#filtering for different metals 
au = full[au_cols]
ag = full[ag_cols]
pb = full[pb_cols]

Gold¶

In [ ]:
# sum of gold columns
au.sum()
Out[ ]:
rougher.input.feed_au                    137072.441231
rougher.output.tail_au                    30228.615372
rougher.output.concentrate_au            322719.333964
primary_cleaner.output.tail_au            63578.983163
primary_cleaner.output.concentrate_au    516063.108326
secondary_cleaner.output.tail_au          70161.881890
final.output.concentrate_au              713083.897490
final.output.tail_au                      50331.499493
dtype: float64
In [ ]:
# function for recovery of gold
def au_recovery(data):
    c = data['rougher.output.concentrate_au'] 
    f = data['rougher.input.feed_au']
    t = data['rougher.output.tail_au']
    flot_recovery = (c * (f-t)) / (f * (c-t)) * 100
    return print(f'{flot_recovery.sum():,}')
In [ ]:
# gold recovery
au_recovery(au)
1,344,794.762862905
In [ ]:
# creating au recovery variable
au_recovery_val = 1344794.762862905
In [ ]:
# bar plot of gold concentration
px.bar(au.sum(), title='Concentration of Gold', color=['concentrate','tail', 'concentrate', 'tail', 'concentrate', 'tail', 'concentrate', 'tail'], 
    log_y=True, height=900)

The recovery of gold increases throughout the purification process, from the rougher input to the final output concentrate. The tail output of the process sees the highest concentration after the primary and secondary cleaning phases. The final output tail then sees some gold. This is intuitive, as the company goal is to extract and purify gold from gold ore. The final output of gold should be higher, especially after multiple rounds of purification. Also, we would see the most loss, with gold in the tails, during the cleaning process.

Silver¶

In [ ]:
# sum of silver columns
ag.sum()
Out[ ]:
rougher.input.feed_ag                    144609.108070
rougher.output.tail_ag                    91828.041196
rougher.output.concentrate_ag            194495.102551
primary_cleaner.output.tail_ag           254725.639489
primary_cleaner.output.concentrate_ag    139613.083984
secondary_cleaner.output.tail_ag         235207.400091
final.output.concentrate_ag               83543.562649
final.output.tail_ag                     157588.966652
dtype: float64
In [ ]:
# function for the silver recovery 
def ag_recovery(data):
    c = data['rougher.output.concentrate_ag'] 
    f = data['rougher.input.feed_ag']
    t = data['rougher.output.tail_ag']
    flot_recovery = (c * (f-t)) / (f * (c-t)) * 100
    return print(f'{flot_recovery.sum():,}')
In [ ]:
# Silver recovery
ag_recovery(ag)
1,008,724.5136078214
In [ ]:
# creating silver recovery variable
ag_recovery_val = 1008724.5136078214
In [ ]:
# bar plot of silver concentration
px.bar(ag.sum(), title='Concentration of Silver', color=['concentrate','tail', 'concentrate', 'tail', 'concentrate', 'tail', 'concentrate', 'tail'], 
    log_y=True, height=900)

We see silver is heavily extracted in the rougher output, and the amount decreases throughout the process until the final output, which is comparatively lower than the rougher input. This is due to most of the silver being lost in the tail. The primary and secondary cleaning steps remove the largest amount of silver from the process, leaving roughly half of the silver from those steps at the final tail output. This is in line with logic, as silver is a byproduct of the process. Since our target is gold, it makes sense that most of the silver would be in the tail at the end of the process. ALso note the scale difference in concentration between gold and silver.

Lead¶

In [ ]:
# sum of lead columns
pb.sum()
Out[ ]:
rougher.input.feed_pb                     58527.783485
rougher.output.tail_pb                    10488.824907
rougher.output.concentrate_pb            121559.514010
primary_cleaner.output.tail_pb            53522.527963
primary_cleaner.output.concentrate_pb    159073.388967
secondary_cleaner.output.tail_pb          88816.101157
final.output.concentrate_pb              160901.269276
final.output.tail_pb                      44600.346604
dtype: float64
In [ ]:
# function for lead recovery
def pb_recovery(data):
    c = data['rougher.output.concentrate_pb'] 
    f = data['rougher.input.feed_pb']
    t = data['rougher.output.tail_pb']
    flot_recovery = (c * (f-t)) / (f * (c-t)) * 100
    return print(f'{flot_recovery.sum():,}')
In [ ]:
# Lead recovery
pb_recovery(pb)
1,389,343.5783014493
In [ ]:
# lead recovery variable
pb_recovery_val = 1389343.5783014493
In [ ]:
# bar plot of lead concentration
px.bar(pb.sum(), title='Concentration of Lead', color=['concentrate','tail', 'concentrate', 'tail', 'concentrate', 'tail', 'concentrate', 'tail'], 
    log_y=True, height=900)

Lead is recovered at a much lower concentration than the target gold, and the byproduct of silver. While the aforementioned are precious metals with similar chemical properties, lead is markedly different. Relative to the amount in the rougher input, we see roughly triple the initial amount in the final concentrate. The cleaning processes remove the most lead, leaving a concentration in the final tail output that is similar to the rougher initial feed.

Comparing Metal Concentrations¶

In [ ]:
# figure comparing metal concentrations
fig = go.Figure()

fig.add_trace(go.Bar(x=au.columns, y=au.sum(), name='gold', marker_color = 'gold'))

fig.add_trace(go.Bar(x=ag.columns, y=ag.sum(), name='silver', marker_color='silver'))

fig.add_trace(go.Bar(x=pb.columns, y=pb.sum(), name='lead', marker_color='black'))

fig.update_layout(barmode='group', height=900, title='Change in Metal Concentration')
fig.show()

Since gold is the desired product, it is intuitive to see the most gold in the final output concentrate, and very little in the output tail. More gold is present in concentrate than silver and lead.

Metal Recovery¶

In [ ]:
# creating a recovery dataframe with metal recovery values
recovery_df = pd.DataFrame({'metal': ['au', 'ag', 'pb'], 'values': [au_recovery_val, ag_recovery_val, pb_recovery_val]})
In [ ]:
# bar plot of recovery dataframe
px.bar(recovery_df, y='values', x='metal', color='metal', title='Metal Recovery')

Here, we are comparing the total recovered of each metal. Since more gold is in the final concentrate than in the tail, the recovery is high. The same applies to lead. Conversely, more silver is found in the tail compared to the concentrate, resulting in lower recovery. This is good, because we would not want to have too much silver in the concentrate, as its chemical properties are similar to gold. One of those key properties is melting point, which would make separating these two elements more difficult. The melting point of lead is far different from gold, which would make separation easier.

Comparing Feed Particle Size¶

In [ ]:
# creating train and test sets
feed_train = train[['primary_cleaner.input.feed_size', 'rougher.input.feed_size']]
feed_test = test[['primary_cleaner.input.feed_size', 'rougher.input.feed_size']]
In [ ]:
# comparing train and test set average particle size
fig = go.Figure()

fig.add_trace(go.Bar(x=feed_train.columns, y=feed_train.mean(), name='train', marker_color = 'black'))

fig.add_trace(go.Bar(x=feed_test.columns, y=feed_test.mean(), name='test', marker_color='blue'))

fig.update_layout(barmode='group', height=900, title='Feed Particle Size')
fig.show()
In [ ]:
# distribution of feed train
px.histogram(feed_train, title='Distribution of Feed Train')
In [ ]:
# distribution of feed test
px.histogram(feed_test, title='Distribution of Feed Test')
In [ ]:
# dstributions of feed particle sizes
fig = go.Figure()

fig.add_trace(go.Histogram(x=feed_train['primary_cleaner.input.feed_size'], name='primary train', marker_color = 'black'))
fig.add_trace(go.Histogram(x=feed_train['rougher.input.feed_size'], name='rougher train', marker_color = 'blue'))
fig.add_trace(go.Histogram(x=feed_test['primary_cleaner.input.feed_size'], name='primary test', marker_color='green'))
fig.add_trace(go.Histogram(x=feed_test['rougher.input.feed_size'], name='rougher test', marker_color='yellow'))
fig.update_layout(height=900, title='Feed Particle Size')
fig.update_traces(opacity=0.75)
fig.show()

This graph illustrates the particle size of the feed decreasing throughout the process. This is crucial, as the particle size is influential in the recovery of gold in ore. Gold dissolution increases with decreasing particle size. Consequently, the distribution of particle size in the training and test set needs to be similar so that the model will evaluate correctly. We see that the test and train samples have a similar distribution.

Concentration values¶

In [ ]:
# making filters for concentration
au_conc = ['rougher.output.concentrate_au', 'primary_cleaner.output.concentrate_au', 'final.output.concentrate_au']

ag_conc = ['rougher.output.concentrate_ag', 'primary_cleaner.output.concentrate_ag', 'final.output.concentrate_ag']

pb_conc = ['rougher.output.concentrate_pb', 'primary_cleaner.output.concentrate_pb', 'final.output.concentrate_pb']

Gold¶

In [ ]:
# filtering full dat set for gold concentrations
full[au_conc].value_counts(ascending=True)
Out[ ]:
rougher.output.concentrate_au  primary_cleaner.output.concentrate_au  final.output.concentrate_au
0.000000                       0.000000                               0.000000                       1
21.350223                      38.096258                              47.853374                      1
21.350486                      30.132889                              46.256522                      1
21.350534                      27.592474                              44.099880                      1
21.351287                      33.250093                              42.516409                      1
                                                                                                    ..
28.824507                      43.427388                              50.897966                      1
20.386141                      32.168625                              44.427296                      2
17.099216                      27.765463                              42.527939                      4
0.010000                       36.306431                              45.270618                      4
18.652372                      30.350809                              46.105030                      6
Length: 16082, dtype: int64
In [ ]:
# values of rougher output concentrations for gold
full['rougher.output.concentrate_au'].value_counts()
Out[ ]:
0.000000     301
18.652372      6
0.010000       5
17.099216      4
20.341888      2
            ... 
23.054626      1
23.225268      1
23.470466      1
23.439251      1
17.804134      1
Name: rougher.output.concentrate_au, Length: 15780, dtype: int64
In [ ]:
# summary stats on gold rougher output concentrate
full['rougher.output.concentrate_au'].describe()
Out[ ]:
count    16094.000000
mean        20.052152
std          3.620905
min          0.000000
25%         19.142941
50%         20.507430
75%         21.916971
max         28.824507
Name: rougher.output.concentrate_au, dtype: float64
In [ ]:
# distribution of gold concentrate
px.histogram(full['rougher.output.concentrate_au'], title='Gold Concentrate')
In [ ]:
# gold concentration
px.bar(full[au_conc].sum(), color=['Flotation', 'Primary Cleaner', 'Secondary Cleaner'], title='Concentration of Gold', log_y=True, height=900)

Rougher output of gold appears to be normally distributed around the mean of 20.4. We see outliers in our data, where the output concentrate is 0 for 301 samples. Overall, the trend is an increase in gold concentration throughout the various processes.

Silver¶

In [ ]:
# values of silver concentrate
full['rougher.output.concentrate_ag'].value_counts()
Out[ ]:
0.000000     301
9.252737       6
0.010000       5
12.098115      4
9.975555       2
            ... 
9.657590       1
9.956605       1
9.934467       1
9.949574       1
11.959486      1
Name: rougher.output.concentrate_ag, Length: 15780, dtype: int64
In [ ]:
# summary stats on silver concentrate
full['rougher.output.concentrate_ag'].describe()
Out[ ]:
count    16094.000000
mean        12.084945
std          2.697948
min          0.000000
25%         10.664288
50%         12.232367
75%         13.835104
max         21.725695
Name: rougher.output.concentrate_ag, dtype: float64
In [ ]:
# distribution of silver concentrate
px.histogram(full['rougher.output.concentrate_ag'], title='Silver Concentrate')
In [ ]:
# concentration of silver
px.bar(full[ag_conc].sum(), color=['Flotation', 'Primary Cleaner', 'Secondary Cleaner'], title='Concentration of Silver', log_y=True, height=900)

Silver concentration appears to be normally distributed around the mean of 12.3, also with 301 rougher values of 0. The concentration of silver decreases throughout the process.

Lead¶

In [ ]:
# values of lead concentration
full['rougher.output.concentrate_pb'].value_counts()
Out[ ]:
0.000000     301
8.452148       7
7.944824       6
9.764648       6
8.577148       5
            ... 
8.258346       1
7.492099       1
7.468866       1
7.760724       1
10.702148      1
Name: rougher.output.concentrate_pb, Length: 15734, dtype: int64
In [ ]:
# summary stats on lead concentration
full['rougher.output.concentrate_pb'].describe()
Out[ ]:
count    16094.000000
mean         7.553095
std          1.688073
min          0.000000
25%          6.696978
50%          7.698308
75%          8.510786
max         12.702148
Name: rougher.output.concentrate_pb, dtype: float64
In [ ]:
# distribution of lead concentration
px.histogram(full['rougher.output.concentrate_pb'], title='Lead Concentrate')
In [ ]:
# lead concentration
px.bar(full[pb_conc].sum(), color=['Flotation', 'Primary Cleaner', 'Secondary Cleaner'], title='Concentration of Lead', log_y=True, height=900)

The concentration of lead appears somewhat normally distributed around the mean of 7.7. There are 301 values with 0 rougher output concentrations. The Concentration of lead increases throughout the process.

Stage Distributions¶

In [ ]:
# sum of metals at different stages
rougher_input = full[['rougher.input.feed_au','rougher.input.feed_ag', 'rougher.input.feed_pb']].sum(axis=1)

rougher_output = full[['rougher.output.tail_au','rougher.output.tail_ag', 'rougher.output.tail_pb']].sum(axis=1)

rougher_concentrate = full[['rougher.output.concentrate_au','rougher.output.concentrate_ag', 
                            'rougher.output.concentrate_pb']].sum(axis=1)

cleaner_output = full[['primary_cleaner.output.tail_au', 'primary_cleaner.output.tail_ag', 
                       'primary_cleaner.output.tail_pb' ]].sum(axis=1)

cleaner_concentrate = full[['primary_cleaner.output.concentrate_au', 'primary_cleaner.output.concentrate_ag', 
                            'primary_cleaner.output.concentrate_pb']].sum(axis=1)

secondary_output = full[['secondary_cleaner.output.tail_au', 'secondary_cleaner.output.tail_ag', 
                         'secondary_cleaner.output.tail_pb' ]].sum(axis=1)

final_tail = full[['final.output.tail_au', 'final.output.tail_ag', 'final.output.tail_pb']].sum(axis=1)

final_concentrate = full[['final.output.concentrate_au', 'final.output.concentrate_ag', 
                          'final.output.concentrate_pb']].sum(axis=1)
In [ ]:
# distribution of metals at various stages
fig = go.Figure()

fig.add_trace(go.Histogram(x=rougher_input, name='Rougher Input Feed', marker_color = 'black'))
fig.add_trace(go.Histogram(x=rougher_output, name='Rougher Output Tail', marker_color = 'blue'))
fig.add_trace(go.Histogram(x=rougher_concentrate, name='Rougher Output Concentrate', marker_color='green'))
fig.add_trace(go.Histogram(x=cleaner_output, name='Primary Cleaner Output Tail', marker_color='pink'))
fig.add_trace(go.Histogram(x=cleaner_concentrate, name='Primary Cleaner Output Concentrate', marker_color = 'red'))
fig.add_trace(go.Histogram(x=secondary_output, name='Secondary Cleaner Output Tail', marker_color='orange'))
fig.add_trace(go.Histogram(x=final_tail, name='Final Output Tail', marker_color='purple'))
fig.add_trace(go.Histogram(x=final_concentrate, name='Final Output Concentrate', marker_color='yellow'))

fig.update_layout(height=900, title='Purification Stages')
fig.update_traces(opacity=0.75)
fig.show()

Results¶

The data will need to be cleaned of the 0, and near zero concentration sums shown on the lower left of the distribution. Input feeds of near zero have no value. The other values in this area of the distribution should be removed as well. These values are anomalies, as they defy the law of conservation of mass. If the input of the process is not zero, then the outputs should not be zero. What is not found in the concentrate should be found in the tail, and vice versa. These zero values represent ore that has disappeared from the system. The data also illustrates gold concentration increasing throughout the extraction and purification processes. Lead also increases in concentration, but at a much smaller scale to gold. Silver concentration decreases throughout the process, as most of the silver is removed to the tails.

Model Preparation¶

In [ ]:
# looking at the count rows
full.shape
Out[ ]:
(16094, 87)
In [ ]:
# removing the zero concentration values from the datasets 
full = full[full['rougher.output.concentrate_ag'] > 0.25]
train = train[train['rougher.output.concentrate_ag'] > 0.25]


full = full[full['rougher.output.concentrate_au'] > 0.25]
train = train[train['rougher.output.concentrate_au'] > 0.25]


full = full[full['rougher.output.concentrate_pb'] > 0.25]
train = train[train['rougher.output.concentrate_pb'] > 0.25]
In [ ]:
# ensuring the number of rows has changed to account for the removal of zero concentration values
full.shape
Out[ ]:
(15787, 87)
In [ ]:
# looking at the shape of train dataset
train.shape
Out[ ]:
(10806, 87)
In [ ]:
# looking at the shape of test dataset
test.shape
Out[ ]:
(5383, 53)
In [ ]:
# Take dates from data
date = full['date']
In [ ]:
# Check number of rows matches full shape
date.shape
Out[ ]:
(15787,)
In [ ]:
# merging full with date, as key for merging
full_date = pd.concat([date, full[full_not_test]], axis=1)
In [ ]:
# shape of the dataset
full_date.shape
Out[ ]:
(15787, 35)
In [ ]:
# merging full with test, to incorporate missing columns
full_test = test.merge(full_date, left_on='date', right_on='date')
In [ ]:
# check columns 
full_test.columns
Out[ ]:
Index(['date', 'primary_cleaner.input.sulfate',
       'primary_cleaner.input.depressant', 'primary_cleaner.input.feed_size',
       'primary_cleaner.input.xanthate',
       'primary_cleaner.state.floatbank8_a_air',
       'primary_cleaner.state.floatbank8_a_level',
       'primary_cleaner.state.floatbank8_b_air',
       'primary_cleaner.state.floatbank8_b_level',
       'primary_cleaner.state.floatbank8_c_air',
       'primary_cleaner.state.floatbank8_c_level',
       'primary_cleaner.state.floatbank8_d_air',
       'primary_cleaner.state.floatbank8_d_level', 'rougher.input.feed_ag',
       'rougher.input.feed_pb', 'rougher.input.feed_rate',
       'rougher.input.feed_size', 'rougher.input.feed_sol',
       'rougher.input.feed_au', 'rougher.input.floatbank10_sulfate',
       'rougher.input.floatbank10_xanthate',
       'rougher.input.floatbank11_sulfate',
       'rougher.input.floatbank11_xanthate', 'rougher.state.floatbank10_a_air',
       'rougher.state.floatbank10_a_level', 'rougher.state.floatbank10_b_air',
       'rougher.state.floatbank10_b_level', 'rougher.state.floatbank10_c_air',
       'rougher.state.floatbank10_c_level', 'rougher.state.floatbank10_d_air',
       'rougher.state.floatbank10_d_level', 'rougher.state.floatbank10_e_air',
       'rougher.state.floatbank10_e_level', 'rougher.state.floatbank10_f_air',
       'rougher.state.floatbank10_f_level',
       'secondary_cleaner.state.floatbank2_a_air',
       'secondary_cleaner.state.floatbank2_a_level',
       'secondary_cleaner.state.floatbank2_b_air',
       'secondary_cleaner.state.floatbank2_b_level',
       'secondary_cleaner.state.floatbank3_a_air',
       'secondary_cleaner.state.floatbank3_a_level',
       'secondary_cleaner.state.floatbank3_b_air',
       'secondary_cleaner.state.floatbank3_b_level',
       'secondary_cleaner.state.floatbank4_a_air',
       'secondary_cleaner.state.floatbank4_a_level',
       'secondary_cleaner.state.floatbank4_b_air',
       'secondary_cleaner.state.floatbank4_b_level',
       'secondary_cleaner.state.floatbank5_a_air',
       'secondary_cleaner.state.floatbank5_a_level',
       'secondary_cleaner.state.floatbank5_b_air',
       'secondary_cleaner.state.floatbank5_b_level',
       'secondary_cleaner.state.floatbank6_a_air',
       'secondary_cleaner.state.floatbank6_a_level',
       'final.output.concentrate_ag', 'final.output.concentrate_au',
       'final.output.concentrate_pb', 'final.output.concentrate_sol',
       'final.output.recovery', 'final.output.tail_ag', 'final.output.tail_au',
       'final.output.tail_pb', 'final.output.tail_sol',
       'primary_cleaner.output.concentrate_ag',
       'primary_cleaner.output.concentrate_au',
       'primary_cleaner.output.concentrate_pb',
       'primary_cleaner.output.concentrate_sol',
       'primary_cleaner.output.tail_ag', 'primary_cleaner.output.tail_au',
       'primary_cleaner.output.tail_pb', 'primary_cleaner.output.tail_sol',
       'rougher.calculation.au_pb_ratio',
       'rougher.calculation.floatbank10_sulfate_to_au_feed',
       'rougher.calculation.floatbank11_sulfate_to_au_feed',
       'rougher.calculation.sulfate_to_au_concentrate',
       'rougher.output.concentrate_ag', 'rougher.output.concentrate_au',
       'rougher.output.concentrate_pb', 'rougher.output.concentrate_sol',
       'rougher.output.recovery', 'rougher.output.tail_ag',
       'rougher.output.tail_au', 'rougher.output.tail_pb',
       'rougher.output.tail_sol', 'secondary_cleaner.output.tail_ag',
       'secondary_cleaner.output.tail_au', 'secondary_cleaner.output.tail_pb',
       'secondary_cleaner.output.tail_sol'],
      dtype='object')
In [ ]:
# ensuring we have a total of 87 columns
full_test.shape
Out[ ]:
(4981, 87)
In [ ]:
# looking at the differences, and similarities in the columns of the datasets
full_not_test = full_cols.difference(test_cols)
train_not_test = train_cols.difference(test_cols)
test_not_train = test_cols.difference(train_cols)
train_and_test = train_cols.intersection(test_cols)
In [ ]:
# these columns are missing from the test dataset, but are in the training data
train_not_test
Out[ ]:
Index(['final.output.concentrate_ag', 'final.output.concentrate_au',
       'final.output.concentrate_pb', 'final.output.concentrate_sol',
       'final.output.recovery', 'final.output.tail_ag', 'final.output.tail_au',
       'final.output.tail_pb', 'final.output.tail_sol',
       'primary_cleaner.output.concentrate_ag',
       'primary_cleaner.output.concentrate_au',
       'primary_cleaner.output.concentrate_pb',
       'primary_cleaner.output.concentrate_sol',
       'primary_cleaner.output.tail_ag', 'primary_cleaner.output.tail_au',
       'primary_cleaner.output.tail_pb', 'primary_cleaner.output.tail_sol',
       'rougher.calculation.au_pb_ratio',
       'rougher.calculation.floatbank10_sulfate_to_au_feed',
       'rougher.calculation.floatbank11_sulfate_to_au_feed',
       'rougher.calculation.sulfate_to_au_concentrate',
       'rougher.output.concentrate_ag', 'rougher.output.concentrate_au',
       'rougher.output.concentrate_pb', 'rougher.output.concentrate_sol',
       'rougher.output.recovery', 'rougher.output.tail_ag',
       'rougher.output.tail_au', 'rougher.output.tail_pb',
       'rougher.output.tail_sol', 'secondary_cleaner.output.tail_ag',
       'secondary_cleaner.output.tail_au', 'secondary_cleaner.output.tail_pb',
       'secondary_cleaner.output.tail_sol'],
      dtype='object')

When making features, we have to limit the training set to features we have in common with the test dataset

In [ ]:
# making the features and training samples from the datasets
features_train = train[train_and_test].drop(['date'], axis=1)
target_train = train[['final.output.recovery' , 'rougher.output.recovery']]

features_test = full_test[train_and_test].drop(['date'], axis=1)
target_test = full_test[['final.output.recovery', 'rougher.output.recovery']]
In [ ]:
#Create function to calculate sMAPE.
def smape(y_true, y_pred):
    smape = 100/len(y_true) * np.sum(2 * np.abs(y_pred - y_true) / (np.abs(y_true) + np.abs(y_pred)))
    return smape

#Create function to calculate final sMAPE.
def f_smape(y_true, y_pred):
    predicted_rough, predicted_final = y_pred[:, 1], y_pred[:, 0]
    true_rough, true_final = y_true.iloc[:, 1], y_true.iloc[:, 0]
    f_smape = (.25 * (smape(true_rough, predicted_rough))) + (.75 * (smape(true_final, predicted_final)))
    return f_smape

#Create function to calculate final sMAPE.
def f_smape2(y_true, y_pred):
    predicted_rough, predicted_final = y_pred[:, 1], y_pred[:, 0]
    true_rough, true_final = y_true.iloc[:, 1], y_true.iloc[:, 0]
    f_smape = -1 * (.25 * (smape(true_rough, predicted_rough))) + (.75 * (smape(true_final, predicted_final)))
    return f_smape
In [ ]:
# turning our smape function into a scorer for cross validation 
f_smape_score = make_scorer(f_smape2, greater_is_better=True)

Models¶

Decision Tree¶

In [ ]:
# Decision Tree
model1 = DecisionTreeRegressor(random_state=19)
model1.fit(features_train, target_train) # train model on training set
Out[ ]:
DecisionTreeRegressor(random_state=19)
In [ ]:
# Cross validation using final smape as scoring
scores1 = cross_val_score(model1, features_train, target_train, scoring=f_smape_score, cv=5) 
final_score1 = sum(scores1) / len(scores1)
print('Average model evaluation score:', final_score1)
Average model evaluation score: 12.40650357907823

Random Forest¶

In [ ]:
# Random forest 
model2 = RandomForestRegressor(random_state=19)
model2.fit(features_train, target_train) # train model on training set
    
Out[ ]:
RandomForestRegressor(random_state=19)
In [ ]:
# Cross validation using final smape as scoring
scores2 = cross_val_score(model2, features_train, target_train, scoring=f_smape_score, cv=5) 
final_score2 = sum(scores2) / len(scores2)
print('Average model evaluation score:', final_score2)
Average model evaluation score: 6.615428939418484

Linear Regression¶

In [ ]:
# Linear regression
model3 = LinearRegression() # initialize model constructor
model3.fit(features_train, target_train) # train model on training set
Out[ ]:
LinearRegression()
In [ ]:
# Cross validation using final smape as scoring
scores3 = cross_val_score(model3, features_train, target_train, scoring=f_smape_score, cv=5) 
final_score3 = sum(scores3) / len(scores3)
print('Average model evaluation score:', final_score3)
Average model evaluation score: 5.781117854051062

Final Model¶

In [ ]:
# final model 
final_model = RandomForestRegressor(random_state=19)
final_model.fit(features_train, target_train)
final_predictions = final_model.predict(features_test)
result = f_smape(target_test, final_predictions)
print('Final sMAPE score of test data: ', result) 
Final sMAPE score of test data:  8.16844495035761
In [ ]:
# creating a dummy regressor to mimic a constant model that always predicts mean of the train set targets
dummy_regr = DummyRegressor(strategy='mean')
dummy_regr.fit(features_train, target_train)
dummy_predictions = dummy_regr.predict(features_test)
f_smape(target_test, dummy_predictions)
Out[ ]:
7.819848574876006

Since we negated and maximized our scoring function, the model with the highest sMAPE is the best model, when comparing cross validation scores. The best model to use is a decision tree regressor. We create a final model and achieve a sMAPE score of 11.18. This compares to a dummy model, where the model is always predicting the mean of train set targets.

Conclusion¶

Overall, we were able to work with the data we received to complete the project. We ensured the recovery column was calculated correctly, by comparing it with our calculated values. We looked at the distribution of concentrations for the various metals, and saw anomalies, which where removed. The data illustrated the increase in gold concentrations in the final product, and a small amount of gold in the tails. We also looked at the recovery of gold, and compared it with the other metals. Finally, we successfully trained a model that could predict the gold recovery, and we found the decision tree to be the best model to use. Therefore, Zyfra can use this model to optimize their gold ore refining process.