Rusty Bargain is developing an app to attract new customers, and they need to be able to quickly find out the market value of a car. We are given access to historical data: technical specifications, trim, and prices. Rusty Bargain wants a model that will predict the value of the car, while also considering the quality and speed of the prediction, and also the time required for training.
# !pip install --user plotly_express
# import libraries
import pandas as pd
import numpy as np
import time
import plotly_express as px
import plotly.graph_objects as go
from sklearn.metrics import mean_squared_error as mse
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer, KNNImputer
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
from sklearn.pipeline import Pipeline
import lightgbm as lgb
from catboost import CatBoostRegressor, Pool
import xgboost as xgb
# show graphs in html
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"
# read dataset
df = pd.read_csv('datasets/car_data.csv', parse_dates=True)
# look at data
df.head()
DateCrawled | Price | VehicleType | RegistrationYear | Gearbox | Power | Model | Mileage | RegistrationMonth | FuelType | Brand | NotRepaired | DateCreated | NumberOfPictures | PostalCode | LastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 24/03/2016 11:52 | 480 | NaN | 1993 | manual | 0 | golf | 150000 | 0 | petrol | volkswagen | NaN | 24/03/2016 00:00 | 0 | 70435 | 07/04/2016 03:16 |
1 | 24/03/2016 10:58 | 18300 | coupe | 2011 | manual | 190 | NaN | 125000 | 5 | gasoline | audi | yes | 24/03/2016 00:00 | 0 | 66954 | 07/04/2016 01:46 |
2 | 14/03/2016 12:52 | 9800 | suv | 2004 | auto | 163 | grand | 125000 | 8 | gasoline | jeep | NaN | 14/03/2016 00:00 | 0 | 90480 | 05/04/2016 12:47 |
3 | 17/03/2016 16:54 | 1500 | small | 2001 | manual | 75 | golf | 150000 | 6 | petrol | volkswagen | no | 17/03/2016 00:00 | 0 | 91074 | 17/03/2016 17:40 |
4 | 31/03/2016 17:25 | 3600 | small | 2008 | manual | 69 | fabia | 90000 | 7 | gasoline | skoda | no | 31/03/2016 00:00 | 0 | 60437 | 06/04/2016 10:17 |
# summary of columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 354369 entries, 0 to 354368 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 DateCrawled 354369 non-null object 1 Price 354369 non-null int64 2 VehicleType 316879 non-null object 3 RegistrationYear 354369 non-null int64 4 Gearbox 334536 non-null object 5 Power 354369 non-null int64 6 Model 334664 non-null object 7 Mileage 354369 non-null int64 8 RegistrationMonth 354369 non-null int64 9 FuelType 321474 non-null object 10 Brand 354369 non-null object 11 NotRepaired 283215 non-null object 12 DateCreated 354369 non-null object 13 NumberOfPictures 354369 non-null int64 14 PostalCode 354369 non-null int64 15 LastSeen 354369 non-null object dtypes: int64(7), object(9) memory usage: 43.3+ MB
# loking for missing values
df.isna().sum()
DateCrawled 0 Price 0 VehicleType 37490 RegistrationYear 0 Gearbox 19833 Power 0 Model 19705 Mileage 0 RegistrationMonth 0 FuelType 32895 Brand 0 NotRepaired 71154 DateCreated 0 NumberOfPictures 0 PostalCode 0 LastSeen 0 dtype: int64
# change column names
df.columns =['date_crawled', 'price', 'vehicle_type', 'registration_year', 'gearbox', 'power', 'model', 'mileage', 'registration_month', 'fuel_type', 'brand', 'not_repaired', 'date_created', 'number_of_pictures', 'postal_code', 'last_seen']
# Percentage missing
df.not_repaired.isna().sum() / len(df) * 100
20.079070121822166
# Percentage missing
df.vehicle_type.isna().sum() / len(df) * 100
10.57936783409384
# Percentage missing
df.gearbox.isna().sum() / len(df) * 100
5.596708515699737
# Percentage missing
df.fuel_type.isna().sum() / len(df) * 100
9.282696849893755
# changing datatypes to datetimes
df.date_crawled = pd.to_datetime(df.date_crawled)
# check proper implementation
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 354369 entries, 0 to 354368 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_crawled 354369 non-null datetime64[ns] 1 price 354369 non-null int64 2 vehicle_type 316879 non-null object 3 registration_year 354369 non-null int64 4 gearbox 334536 non-null object 5 power 354369 non-null int64 6 model 334664 non-null object 7 mileage 354369 non-null int64 8 registration_month 354369 non-null int64 9 fuel_type 321474 non-null object 10 brand 354369 non-null object 11 not_repaired 283215 non-null object 12 date_created 354369 non-null object 13 number_of_pictures 354369 non-null int64 14 postal_code 354369 non-null int64 15 last_seen 354369 non-null object dtypes: datetime64[ns](1), int64(7), object(8) memory usage: 43.3+ MB
# extracting data from datetime columns
df['year_crawled'] = df.date_crawled.dt.year
df['month_crawled'] = df.date_crawled.dt.month
df['day_crawled'] = df.date_crawled.dt.day
df['time_crawled'] = df.date_crawled.dt.hour
# drop unnecessary columns
df.drop(['date_crawled', 'date_created', 'last_seen'], axis=1, inplace=True)
# visual of data
df.head()
price | vehicle_type | registration_year | gearbox | power | model | mileage | registration_month | fuel_type | brand | not_repaired | number_of_pictures | postal_code | year_crawled | month_crawled | day_crawled | time_crawled | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 480 | NaN | 1993 | manual | 0 | golf | 150000 | 0 | petrol | volkswagen | NaN | 0 | 70435 | 2016 | 3 | 24 | 11 |
1 | 18300 | coupe | 2011 | manual | 190 | NaN | 125000 | 5 | gasoline | audi | yes | 0 | 66954 | 2016 | 3 | 24 | 10 |
2 | 9800 | suv | 2004 | auto | 163 | grand | 125000 | 8 | gasoline | jeep | NaN | 0 | 90480 | 2016 | 3 | 14 | 12 |
3 | 1500 | small | 2001 | manual | 75 | golf | 150000 | 6 | petrol | volkswagen | no | 0 | 91074 | 2016 | 3 | 17 | 16 |
4 | 3600 | small | 2008 | manual | 69 | fabia | 90000 | 7 | gasoline | skoda | no | 0 | 60437 | 2016 | 3 | 31 | 17 |
# look at the columns names
df.columns
Index(['price', 'vehicle_type', 'registration_year', 'gearbox', 'power', 'model', 'mileage', 'registration_month', 'fuel_type', 'brand', 'not_repaired', 'number_of_pictures', 'postal_code', 'year_crawled', 'month_crawled', 'day_crawled', 'time_crawled'], dtype='object')
We see many missing values in categorical columns, which would need to be filled for some of our machine learning models. Vehicle type may be imputed based on the mode of each model and brand of the car. The columns gearbox, model, fuel type, and not repaired will be filled based on the mode of the respective columns. Issues may arise where rows have missing values in multiple columns. Such rows will just be dropped.
# correlation of categories
df.corr()
price | registration_year | power | mileage | registration_month | number_of_pictures | postal_code | year_crawled | month_crawled | day_crawled | time_crawled | |
---|---|---|---|---|---|---|---|---|---|---|---|
price | 1.000000 | 0.026916 | 0.158872 | -0.333199 | 0.110581 | NaN | 0.076055 | NaN | -0.007529 | -0.003374 | 0.017560 |
registration_year | 0.026916 | 1.000000 | -0.000828 | -0.053447 | -0.011619 | NaN | -0.003459 | NaN | -0.001598 | 0.001513 | -0.001617 |
power | 0.158872 | -0.000828 | 1.000000 | 0.024002 | 0.043380 | NaN | 0.021665 | NaN | 0.000750 | -0.001815 | -0.000617 |
mileage | -0.333199 | -0.053447 | 0.024002 | 1.000000 | 0.009571 | NaN | -0.007698 | NaN | 0.000245 | 0.001013 | -0.011518 |
registration_month | 0.110581 | -0.011619 | 0.043380 | 0.009571 | 1.000000 | NaN | 0.013995 | NaN | 0.003955 | -0.002674 | 0.012123 |
number_of_pictures | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
postal_code | 0.076055 | -0.003459 | 0.021665 | -0.007698 | 0.013995 | NaN | 1.000000 | NaN | -0.010531 | 0.004484 | 0.016590 |
year_crawled | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
month_crawled | -0.007529 | -0.001598 | 0.000750 | 0.000245 | 0.003955 | NaN | -0.010531 | NaN | 1.000000 | -0.572049 | -0.000910 |
day_crawled | -0.003374 | 0.001513 | -0.001815 | 0.001013 | -0.002674 | NaN | 0.004484 | NaN | -0.572049 | 1.000000 | 0.006911 |
time_crawled | 0.017560 | -0.001617 | -0.000617 | -0.011518 | 0.012123 | NaN | 0.016590 | NaN | -0.000910 | 0.006911 | 1.000000 |
# summary statistics of data
df.describe()
price | registration_year | power | mileage | registration_month | number_of_pictures | postal_code | year_crawled | month_crawled | day_crawled | time_crawled | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 354369.000000 | 354369.000000 | 354369.000000 | 354369.000000 | 354369.000000 | 354369.0 | 354369.000000 | 354369.0 | 354369.000000 | 354369.000000 | 354369.000000 |
mean | 4416.656776 | 2004.234448 | 110.094337 | 128211.172535 | 5.714645 | 0.0 | 50508.689087 | 2016.0 | 4.406150 | 14.609021 | 14.909665 |
std | 4514.158514 | 90.227958 | 189.850405 | 37905.341530 | 3.726421 | 0.0 | 25783.096248 | 0.0 | 2.804298 | 10.190826 | 5.422938 |
min | 0.000000 | 1000.000000 | 0.000000 | 5000.000000 | 0.000000 | 0.0 | 1067.000000 | 2016.0 | 1.000000 | 3.000000 | 0.000000 |
25% | 1050.000000 | 1999.000000 | 69.000000 | 125000.000000 | 3.000000 | 0.0 | 30165.000000 | 2016.0 | 3.000000 | 4.000000 | 12.000000 |
50% | 2700.000000 | 2003.000000 | 105.000000 | 150000.000000 | 6.000000 | 0.0 | 49413.000000 | 2016.0 | 3.000000 | 16.000000 | 16.000000 |
75% | 6400.000000 | 2008.000000 | 143.000000 | 150000.000000 | 9.000000 | 0.0 | 71083.000000 | 2016.0 | 5.000000 | 24.000000 | 19.000000 |
max | 20000.000000 | 9999.000000 | 20000.000000 | 150000.000000 | 12.000000 | 0.0 | 99998.000000 | 2016.0 | 12.000000 | 31.000000 | 23.000000 |
# number of picture values
df.number_of_pictures.value_counts(dropna=False)
0 354369 Name: number_of_pictures, dtype: int64
# drop unnecessary column
df.drop('number_of_pictures', axis=1, inplace=True)
Unnecessary column, so we will drop it.
# dropping rows where vehicle type and model are missing
mask = (df['vehicle_type'].isnull() & df['model'].isnull())
df.drop(df[mask].index, inplace=True)
# Check for proper implementation
(df['vehicle_type'].isnull() & df['model'].isnull()).sum()
0
Dropped rows where vehicle type and model values were both missing.
# petrol and gasoline are the same
df.fuel_type = df.fuel_type.replace('gasoline', 'petrol')
Gasoline and petrol are the same thing. Gasoline is just the American way of saying petrol.
# number of cars with registration years below 1960
df.query("registration_year < 1960").head()
price | vehicle_type | registration_year | gearbox | power | model | mileage | registration_month | fuel_type | brand | not_repaired | postal_code | year_crawled | month_crawled | day_crawled | time_crawled | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15 | 450 | small | 1910 | NaN | 0 | ka | 5000 | 0 | petrol | ford | NaN | 24148 | 2016 | 11 | 3 | 21 |
1928 | 7000 | suv | 1945 | manual | 48 | other | 150000 | 2 | petrol | volkswagen | no | 58135 | 2016 | 3 | 25 | 15 |
2273 | 1800 | convertible | 1925 | NaN | 0 | NaN | 5000 | 1 | NaN | sonstige_autos | no | 79288 | 2016 | 3 | 15 | 21 |
3333 | 10500 | sedan | 1955 | manual | 30 | other | 60000 | 0 | petrol | ford | NaN | 53498 | 2016 | 3 | 15 | 21 |
6629 | 0 | small | 1910 | NaN | 0 | NaN | 5000 | 1 | other | sonstige_autos | NaN | 93105 | 2016 | 2 | 4 | 13 |
# filtering outliers of dataset
df = df.query("(registration_year > 1960) and (registration_year <= 2019)")
We will remove outliers from the dataset based on registration years. The first automobile in europe was made in 1897, yet, it is reasonable to assume that cars from that period would not be on the Rusty Bargain app. Furthermore, the number of cars in the dataset from before 1960 is limited, so we will not be losing much data. Also, registration years above 2019 would make those cars registered decades in the future. Removing these outliers will make our models perform better.
# looking for outliers
df.registration_year.value_counts()
2000 23177 1999 22717 2005 21566 2001 20115 2006 19830 2003 19754 2004 19598 2002 19087 1998 17880 2007 17142 2008 15358 1997 14649 2009 14460 1996 10823 2010 10723 2011 9884 1995 9170 2017 9015 2016 8318 2012 7166 1994 4840 2013 4114 1993 3434 2018 3386 1992 3001 1991 2761 2014 2519 1990 2501 1989 1317 2015 1192 1988 941 1985 684 1987 666 1986 537 1980 482 1983 430 1984 399 1982 316 1978 285 1979 283 1970 272 1981 263 1972 213 1976 175 1977 173 1973 173 1974 168 1971 167 1975 143 1969 118 1966 111 1968 108 1967 99 1965 82 1964 64 1963 57 1962 46 1961 41 2019 20 Name: registration_year, dtype: int64
# boxplot of registration
fig1 = px.box(df.registration_year)
fig1