Yachay is an open-source machine learning community with decades worth of natural language data from media, the dark web, legal proceedings, and government publications. They have cleaned and annotated the data, and created a geolocation detection tool. They are looking for developers interested in contributing and improving on the project. We are given a dataset of tweets, and another dataset of coordinates, upon which we will create a neural network to predict coordinates from text.
# import libraries
import pandas as pd
import plotly_express as px
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder, StandardScaler
from sklearn.compose import make_column_transformer
import seaborn as sns
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input, concatenate
from keras.optimizers import Adam
import tensorflow as tf
from numpy import genfromtxt
import torch
import transformers
from tqdm.auto import tqdm
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import math
from sklearn.metrics.pairwise import haversine_distances
from math import radians
from transformers import XLMRobertaTokenizerFast
import joblib
from math import ceil
# show graphs in html
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"
# read dataset
df_main = pd.read_csv('data\Main_Dataset.csv', parse_dates=['timestamp'], index_col=['timestamp'])
# sort by timestamp
df_main.sort_index(inplace=True)
# look at dataset
df_main.head()
id | text | user_id | cluster_id | |
---|---|---|---|---|
timestamp | ||||
2021-02-01 12:14:04 | 262304 | he was accused of being a thief when entering ... | 8.301517e+08 | 345 |
2021-02-01 12:19:19 | 480231 | can you blame him they are delicious | 1.817364e+08 | 1775 |
2021-02-01 12:21:52 | 241532 | damn sholl is a new month ain’t it | 2.383825e+08 | 288 |
2021-02-01 12:28:10 | 324986 | ain’t felt this way inna min | 4.598618e+08 | 603 |
2021-02-01 12:35:05 | 541682 | golf central is practically unwatchable this m... | 4.041955e+09 | 2318 |
# check timestamp
df_main.index.is_monotonic
False
# look at column information
df_main.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 604206 entries, 2021-02-01 12:14:04 to NaT Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 604206 non-null int64 1 text 604206 non-null object 2 user_id 604206 non-null float64 3 cluster_id 604206 non-null int64 dtypes: float64(1), int64(2), object(1) memory usage: 23.0+ MB
# looking for missing values
df_main.isna().sum()
id 0 text 0 user_id 0 cluster_id 0 dtype: int64
# looking for duplicates
df_main.duplicated().sum()
0
# data with missing index
df_main.index.isna().sum()
12794
# percentage of data with missing index
df_main.index.isna().sum() / len(df_main) * 100
2.1174897303237636
# looking at missing data with missing index
df_main[df_main.index.isna()].head()
id | text | user_id | cluster_id | |
---|---|---|---|---|
timestamp | ||||
NaT | 80822 | josh in to close it out.“filthy”in the 6th. ba... | 2.968974e+09 | 26 |
NaT | 80828 | it passed but it’s beautiful | 1.549382e+07 | 26 |
NaT | 80830 | pure sunshine flowerreport toronto | 1.292381e+07 | 26 |
NaT | 80832 | you beat me to it. so tired of the population ... | 1.649766e+07 | 26 |
NaT | 80838 | i’m on a boat toronto island time | 1.924758e+09 | 26 |
Overall, the main dataset is fairly clean. We loaded the data as a timeseries, and parsed the dates. This dataframe contains most of the features we need to train a our model. The data that is missing is limited to timestamps, while the other columns of this data is present. As the missing data represents 2% of the entire dataset, and becase we are unable to impute the timestamps, we will drop these rows.
# load cluster data
df_cl = pd.read_csv('data/Clusters_Coordinates.csv')
# look at dataset
df_cl.head()
cluster_id | lat | lng | |
---|---|---|---|
0 | 2 | 34.020789 | -118.411907 |
1 | 3 | 31.168893 | -100.076888 |
2 | 8 | 29.838495 | -95.446487 |
3 | 9 | 40.780709 | -73.968542 |
4 | 16 | 40.004866 | -75.117998 |
# looking at column info
df_cl.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 850 entries, 0 to 849 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 cluster_id 850 non-null int64 1 lat 850 non-null float64 2 lng 850 non-null float64 dtypes: float64(2), int64(1) memory usage: 20.0 KB
# looking for missing values
df_cl.isna().sum()
cluster_id 0 lat 0 lng 0 dtype: int64
Cluster coordinates dataframe contains the cluster id as well as the latitutde and longitude data. This dataframe is clean with no missing values. We will merge the two dataframes before conducting EDA.
# visual of data before feature engineering
df_main.head()
id | text | user_id | cluster_id | |
---|---|---|---|---|
timestamp | ||||
2021-02-01 12:14:04 | 262304 | he was accused of being a thief when entering ... | 8.301517e+08 | 345 |
2021-02-01 12:19:19 | 480231 | can you blame him they are delicious | 1.817364e+08 | 1775 |
2021-02-01 12:21:52 | 241532 | damn sholl is a new month ain’t it | 2.383825e+08 | 288 |
2021-02-01 12:28:10 | 324986 | ain’t felt this way inna min | 4.598618e+08 | 603 |
2021-02-01 12:35:05 | 541682 | golf central is practically unwatchable this m... | 4.041955e+09 | 2318 |
# Making timestamp features
def make_features(data):
data['year'] = data.index.year
data['month'] = data.index.month
data['week'] = data.index.isocalendar().week
data['day'] = data.index.day
data['day_of_week'] = data.index.day_of_week
data['day_of_year'] = data.index.day_of_year
data['hour'] = data.index.hour
data['minute'] = data.index.minute
data['second'] = data.index.second
make_features(df_main)
# new features added
df_main.head()
id | text | user_id | cluster_id | year | month | week | day | day_of_week | day_of_year | hour | minute | second | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
timestamp | |||||||||||||
2021-02-01 12:14:04 | 262304 | he was accused of being a thief when entering ... | 8.301517e+08 | 345 | 2021.0 | 2.0 | 5 | 1.0 | 0.0 | 32.0 | 12.0 | 14.0 | 4.0 |
2021-02-01 12:19:19 | 480231 | can you blame him they are delicious | 1.817364e+08 | 1775 | 2021.0 | 2.0 | 5 | 1.0 | 0.0 | 32.0 | 12.0 | 19.0 | 19.0 |
2021-02-01 12:21:52 | 241532 | damn sholl is a new month ain’t it | 2.383825e+08 | 288 | 2021.0 | 2.0 | 5 | 1.0 | 0.0 | 32.0 | 12.0 | 21.0 | 52.0 |
2021-02-01 12:28:10 | 324986 | ain’t felt this way inna min | 4.598618e+08 | 603 | 2021.0 | 2.0 | 5 | 1.0 | 0.0 | 32.0 | 12.0 | 28.0 | 10.0 |
2021-02-01 12:35:05 | 541682 | golf central is practically unwatchable this m... | 4.041955e+09 | 2318 | 2021.0 | 2.0 | 5 | 1.0 | 0.0 | 32.0 | 12.0 | 35.0 | 5.0 |
# merge main and cluster coordinates
df = df_main.merge(df_cl, on='cluster_id', sort=True)
# new merged dataset
df.head()
id | text | user_id | cluster_id | year | month | week | day | day_of_week | day_of_year | hour | minute | second | lat | lng | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4080 | i moved all day yesterday slept for 2 hours la... | 2.688113e+07 | 2 | 2021.0 | 2.0 | 5 | 2.0 | 1.0 | 33.0 | 8.0 | 23.0 | 23.0 | 34.020789 | -118.411907 |
1 | 10213 | in the vortex 16x20 acrylicpainting | 8.080801e+07 | 2 | 2021.0 | 2.0 | 5 | 2.0 | 1.0 | 33.0 | 8.0 | 28.0 | 8.0 | 34.020789 | -118.411907 |
2 | 12514 | pes21 is free lmfao | 1.286578e+18 | 2 | 2021.0 | 2.0 | 5 | 2.0 | 1.0 | 33.0 | 8.0 | 28.0 | 19.0 | 34.020789 | -118.411907 |
3 | 10843 | ha yeah there’s no way of ever really knowing ... | 1.759526e+07 | 2 | 2021.0 | 2.0 | 5 | 2.0 | 1.0 | 33.0 | 8.0 | 31.0 | 49.0 | 34.020789 | -118.411907 |
4 | 16316 | . shut the fuck up you fake ass nerd. | 1.081937e+08 | 2 | 2021.0 | 2.0 | 5 | 2.0 | 1.0 | 33.0 | 8.0 | 33.0 | 42.0 | 34.020789 | -118.411907 |
# drop missing values
df.dropna(inplace=True)
# missing values
df.isna().sum()
id 0 text 0 user_id 0 cluster_id 0 year 0 month 0 week 0 day 0 day_of_week 0 day_of_year 0 hour 0 minute 0 second 0 lat 0 lng 0 dtype: int64
# shape of dataset
df.shape
(591412, 15)
We merged the datasets on cluster id. We then dropped all rows with the missing timestamp data. We are left with a total of close to 600,000 rows of data.
# summary statistics
df.describe()
id | user_id | cluster_id | year | month | week | day | day_of_week | day_of_year | hour | minute | second | lat | lng | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 591412.000000 | 5.914120e+05 | 591412.000000 | 591412.0 | 591412.000000 | 591412.000000 | 591412.000000 | 591412.000000 | 591412.000000 | 591412.000000 | 591412.000000 | 591412.000000 | 591412.000000 | 591412.000000 |
mean | 303001.854719 | 3.368510e+17 | 886.351569 | 2021.0 | 6.766528 | 27.205567 | 15.575266 | 2.928047 | 190.367018 | 11.887419 | 29.533308 | 29.478575 | 34.433504 | -92.672279 |
std | 174405.411743 | 5.209262e+17 | 895.607222 | 0.0 | 2.434100 | 10.535272 | 8.408276 | 2.018068 | 73.657062 | 7.792265 | 17.330192 | 17.359607 | 7.609369 | 16.107334 |
min | 0.000000 | 6.070000e+02 | 2.000000 | 2021.0 | 2.000000 | 5.000000 | 1.000000 | 0.000000 | 32.000000 | 0.000000 | 0.000000 | 0.000000 | 13.189300 | -158.069430 |
25% | 152142.750000 | 1.448996e+08 | 105.000000 | 2021.0 | 8.000000 | 31.000000 | 8.000000 | 1.000000 | 214.000000 | 4.000000 | 15.000000 | 14.000000 | 30.215828 | -100.076888 |
50% | 304025.500000 | 8.397085e+08 | 502.000000 | 2021.0 | 8.000000 | 32.000000 | 16.000000 | 3.000000 | 224.000000 | 14.000000 | 30.000000 | 30.000000 | 34.182160 | -90.079239 |
75% | 453573.250000 | 8.556889e+17 | 1567.000000 | 2021.0 | 8.000000 | 33.000000 | 23.000000 | 5.000000 | 233.000000 | 19.000000 | 45.000000 | 44.000000 | 40.624274 | -79.850739 |
max | 604205.000000 | 1.431375e+18 | 2996.000000 | 2021.0 | 9.000000 | 35.000000 | 31.000000 | 6.000000 | 244.000000 | 23.000000 | 59.000000 | 59.000000 | 61.235042 | -52.829425 |
# number of unique users
df.user_id.nunique()
41143
# number of unique clusters
df.cluster_id.nunique()
850
# number of unique latitudes
df.lat.nunique()
811
# number of unique longitudes
df.lng.nunique()
828
# skew of data
df.skew()
C:\Users\XIX\AppData\Local\Temp\ipykernel_23156\3355588477.py:2: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.
id -0.013441 user_id 0.998650 cluster_id 0.794588 year 0.000000 month -1.439251 week -1.408017 day -0.000922 day_of_week 0.061273 day_of_year -1.408617 hour -0.182448 minute -0.005914 second -0.006647 lat -0.597998 lng -0.664069 dtype: float64
# correlation of data
px.imshow(df.corr(), text_auto=True, aspect='auto')
# distributions of columns
columns = ['month', 'week', 'day', 'day_of_week', 'day_of_year', 'hour', 'minute', 'second']
for column in columns:
px.histogram(df[column], title='Distribution of '+ str.upper(column).replace('_', ' '), labels={'value': str(column).replace('_', ' ')}).show()