My First Kaggle Kernel
Costa Rican Household Poverty Prediction
Objective
Many social programs have a hard time making sure the right people are given enough aid. It’s especially tricky when a program focuses on the poorest segment of the population. The world’s poorest typically can’t provide the necessary income and expense records to prove that they qualify. The objective is to establish new algorithms which correctly identifies economic conditions of a household so that proper resources are made available to them. Dataset used for this problem contains Costa Rican household characteristics.
Problem Description
This is a supervised multi-class classification machine learning problem:
- Supervised: provided with the labels for the training data
- Multi-class classification: Labels are discrete values with 4 classes
Dataset can be downloaded from here. For more details, check the Kaggle description here.
%matplotlib inline
import os
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from matplotlib import pyplot as plt
# To remove the limit on the number of rows displayed by pandas
pd.set_option("display.max_rows", None)
# Read csv files in pandas dataframe
testDf = pd.read_csv('../input/test.csv')
trainDf = pd.read_csv('../input/train.csv')
print("Training dataset basic information")
print("Rows: {}".format(len(trainDf)))
print("Columns: {}".format(len(trainDf.columns)))
trainDf.head()
Training dataset basic information
Rows: 9557
Columns: 143
Id | v2a1 | hacdor | rooms | hacapo | v14a | refrig | v18q | v18q1 | r4h1 | ... | SQBescolari | SQBage | SQBhogar_total | SQBedjefe | SQBhogar_nin | SQBovercrowding | SQBdependency | SQBmeaned | agesq | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ID_279628684 | 190000.0 | 0 | 3 | 0 | 1 | 1 | 0 | NaN | 0 | ... | 100 | 1849 | 1 | 100 | 0 | 1.000000 | 0.0 | 100.0 | 1849 | 4 |
1 | ID_f29eb3ddd | 135000.0 | 0 | 4 | 0 | 1 | 1 | 1 | 1.0 | 0 | ... | 144 | 4489 | 1 | 144 | 0 | 1.000000 | 64.0 | 144.0 | 4489 | 4 |
2 | ID_68de51c94 | NaN | 0 | 8 | 0 | 1 | 1 | 0 | NaN | 0 | ... | 121 | 8464 | 1 | 0 | 0 | 0.250000 | 64.0 | 121.0 | 8464 | 4 |
3 | ID_d671db89c | 180000.0 | 0 | 5 | 0 | 1 | 1 | 1 | 1.0 | 0 | ... | 81 | 289 | 16 | 121 | 4 | 1.777778 | 1.0 | 121.0 | 289 | 4 |
4 | ID_d56d6f5f5 | 180000.0 | 0 | 5 | 0 | 1 | 1 | 1 | 1.0 | 0 | ... | 121 | 1369 | 16 | 121 | 4 | 1.777778 | 1.0 | 121.0 | 1369 | 4 |
5 rows × 143 columns
print("Test dataset basic information")
print("Rows: {}".format(len(testDf)))
print("Columns: {}".format(len(testDf.columns)))
testDf.head()
Test dataset basic information
Rows: 23856
Columns: 142
Id | v2a1 | hacdor | rooms | hacapo | v14a | refrig | v18q | v18q1 | r4h1 | ... | age | SQBescolari | SQBage | SQBhogar_total | SQBedjefe | SQBhogar_nin | SQBovercrowding | SQBdependency | SQBmeaned | agesq | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ID_2f6873615 | NaN | 0 | 5 | 0 | 1 | 1 | 0 | NaN | 1 | ... | 4 | 0 | 16 | 9 | 0 | 1 | 2.25 | 0.25 | 272.25 | 16 |
1 | ID_1c78846d2 | NaN | 0 | 5 | 0 | 1 | 1 | 0 | NaN | 1 | ... | 41 | 256 | 1681 | 9 | 0 | 1 | 2.25 | 0.25 | 272.25 | 1681 |
2 | ID_e5442cf6a | NaN | 0 | 5 | 0 | 1 | 1 | 0 | NaN | 1 | ... | 41 | 289 | 1681 | 9 | 0 | 1 | 2.25 | 0.25 | 272.25 | 1681 |
3 | ID_a8db26a79 | NaN | 0 | 14 | 0 | 1 | 1 | 1 | 1.0 | 0 | ... | 59 | 256 | 3481 | 1 | 256 | 0 | 1.00 | 0.00 | 256.00 | 3481 |
4 | ID_a62966799 | 175000.0 | 0 | 4 | 0 | 1 | 1 | 1 | 1.0 | 0 | ... | 18 | 121 | 324 | 1 | 0 | 1 | 0.25 | 64.00 | NaN | 324 |
5 rows × 142 columns
# Add null Target column to test
testDf['Target'] = np.nan
data = trainDf.append(testDf, ignore_index = True)
1. Find and fix missing feature values
# 1. Find missing values in training and test dataset
def findColumnsWithNan(df):
cols = df.columns[df.isna().any()]
print("Number of columns with Nan: {}".format(len(cols)))
print("Column names: {}".format(cols))
print("-" * 80)
for col in cols:
print("Column: [{}] missing {} values.".format(col, len(df[df[col].isna() == True])))
print("Analysis of training dataset...")
findColumnsWithNan(trainDf)
Analysis of training dataset...
Number of columns with Nan: 5
Column names: Index(['v2a1', 'v18q1', 'rez_esc', 'meaneduc', 'SQBmeaned'], dtype='object')
--------------------------------------------------------------------------------
Column: [v2a1] missing 6860 values.
Column: [v18q1] missing 7342 values.
Column: [rez_esc] missing 7928 values.
Column: [meaneduc] missing 5 values.
Column: [SQBmeaned] missing 5 values.
print()
print("Analysis of test dataset...")
findColumnsWithNan(testDf)
Analysis of test dataset...
Number of columns with Nan: 6
Column names: Index(['v2a1', 'v18q1', 'rez_esc', 'meaneduc', 'SQBmeaned', 'Target'], dtype='object')
--------------------------------------------------------------------------------
Column: [v2a1] missing 17403 values.
Column: [v18q1] missing 18126 values.
Column: [rez_esc] missing 19653 values.
Column: [meaneduc] missing 31 values.
Column: [SQBmeaned] missing 31 values.
Column: [Target] missing 23856 values.
1.1. Fix missing values of v2a1
It means Monthly rent payment. To find what Nan means, v2a1 is compared with ‘tipovivi’ feature which gives information whether the house is rented/completely paid off etc.
tipovivi1 =1 own and fully paid house
tipovivi2 =1 own, paying in installments”
tipovivi3 =1 rented
tipovivi4 =1 precarious
tipovivi5 =1 other(assigned, borrowed)”
‘v2a1’ is replaced with 0, wherever tipovivi1=1, and all other missing values are left which will be imputed later. This means that if the house is fully owned by household then they don’t pay any rent.
data.loc[(data['tipovivi1'] == 1) & (data['v2a1'].isna()), 'v2a1'] = 0
print("Missing values after replacing: {}".format(len(data.loc[data['v2a1'].isna()])))
Missing values after replacing: 3419
1.2. Fix missing values of v18q1
It means number of tablets household owns. After careful analysis of household members, it can concluded that NaN means household does not own a tablet. We replace NaN with 0.
data.loc[data['v18q1'].isna(), 'v18q1'] = 0
print("Missing values after replacing: {}".format(len(data.loc[data['v18q1'].isna()])))
Missing values after replacing: 0
1.3 Fix missing values of rez_esc
It means years behind in school. From the discussions on Kaggle, it can be concluded that this value is defined only for people whose age is between 7 and 19. So the missing values can be updated to 0 using this criteria. Age of an individual is in the column appropriately named ‘age’.
data.loc[(data['age'] < 7) & (data['rez_esc'].isna()), 'rez_esc'] = 0
data.loc[(data['age'] > 19) & (data['rez_esc'].isna()), 'rez_esc'] = 0
print("Missing values after replacing: {}".format(len(data.loc[data['rez_esc'].isna()])))
Missing values after replacing: 1161
1.4 Fix missing values of meaneduc
It means average years of education for adults (18+). This implies that if the age of an individual is less than 18 and the value is NaN, then we can replace it with 0. Other NaN are left to be imputed.
data.loc[data['age'] < 19 & data['meaneduc'].isna(), 'meaneduc'] = 0
print("Missing values after replacing: {}".format(len(data.loc[data['meaneduc'].isna()])))
Missing values after replacing: 35
1.5 Fix missing values of SQBmeaned
It means square of the mean years of education of adults (>=18) in the household. It is highly correlated with feature ‘age’ and there is no real need of it. Hence, this feature is dropped from the dataset.
data.drop('SQBmeaned', inplace=True, axis=1)
print("Total number of columns left: {}".format(len(data.columns)))
Total number of columns left: 142
2. Find outliers
Inter-quartile range(IQR) is used to identify outliers in the dataset. IQR is the difference between the 75th and 25th percentile of the data. It is measure of dispersion along the lines of standard deviation. During this analysis, features were found which had incosistent values like integer and boolean string (yes/no) mixed together. These needs to be removed and is the main focus of the next step.
for cols in data.columns[1:]:
if cols in ['idhogar', 'dependency', 'edjefe', 'edjefa']:
continue
percentile75 = np.percentile(data[cols].fillna(0), 75)
percentile25 = np.percentile(data[cols].fillna(0), 25)
threshold = (percentile75 - percentile25) * 1.5
lower, upper = (percentile25 - threshold), (percentile75 + threshold)
# identify outliers
outliers = data.loc[(data[cols] < lower) & (data[cols] > upper)]
if len(outliers) > 0:
print('Feature: {}. Identified outliers: {}'.format(cols, len(outliers)))
3. Find incosistent values
When finding outliers, following three features (‘dependency’, ‘edjefe’, ‘edjefa’) were found to have incosistent values. To take care of this, ‘yes’ is replaced with 1 and ‘no’ is replaced with 0. Also, to make sure each value in the feature are of the same data type, features are converted to float.
for col in ['dependency', 'edjefe', 'edjefa']:
data.loc[data[col] == 'yes', col] = 1.0
data.loc[data[col] == 'no', col] = 0.0
data[col] = pd.to_numeric(data[col])
4. Remove correlated features
Highly correlated feature pairs are redundant, one of the pairs is selected to be removed.
corrMat = data.corr()
plt.figure(figsize=(30, 10))
sns.heatmap(corrMat.iloc[:10, :10])
def featuresToDrop(corrMatrix):
"""
To remove correlated features, used this gem of a code from here:
https://chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features
"""
# Select upper triangle of correlation matrix
upper = corrMatrix.where(np.triu(np.ones(corrMatrix.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.95
return [column for column in upper.columns if any(upper[column] > 0.95)]
toDrop = featuresToDrop(corrMat)
data.drop(toDrop, inplace=True, axis=1)
print("Correlated features which are dropped: {}".format(toDrop))
Correlated features which are dropped: ['tamhog', 'hhsize', 'hogar_total', 'SQBage', 'SQBhogar_total', 'agesq']
5. Feature engineering
Definition: “Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.”
5.1 Add aggregated features (min, max, std, sum)
features = list(data.drop(columns = ['Id', 'idhogar', 'Target']).columns)
aggDf = data.drop(columns='Target').groupby('idhogar').agg(['min', 'max', 'sum', 'count', 'std'])
# Rename the columns
new_col = []
for c in aggDf.columns.levels[0]:
for stat in aggDf.columns.levels[1]:
new_col.append('{}-{}'.format(c, stat))
aggDf.columns = new_col
toDrop = featuresToDrop(aggDf.corr())
aggDf.drop(toDrop, inplace=True, axis=1)
data = data.merge(aggDf, on='idhogar', how ='left')
print('Training feature shape: ', data.shape)
Training feature shape: (33413, 525)
5.2 Add features per household
data['phones-per-capita'] = data['qmobilephone'] / data['tamviv']
data['tablets-per-capita'] = data['v18q1'] / data['tamviv']
data['rooms-per-capita'] = data['rooms'] / data['tamviv']
data['rent-per-capita'] = data['v2a1'] / data['tamviv']
6. Feature imputing and scaling
Scaling means to transform data in such a way that they fit within a range say 0-100 or 0-1. We will be using min-max scaler to transform the feature to be in the range 0-1.
# Labels for training
trainTarget = np.array(list(data[data['Target'].notnull()]['Target'].astype(np.uint8)))
submission = data.loc[data['Target'].isnull(), 'Id'].to_frame()
# Extract the training data
trainData = data[data['Target'].notnull()].drop(columns = ['Id', 'idhogar', 'Target'])
testData = data[data['Target'].isnull()].drop(columns = ['Id', 'idhogar', 'Target'])
# Impute training and test data
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
trainData = imputer.fit_transform(trainData)
testData = imputer.transform(testData)
# Scale training and test data
scaler = MinMaxScaler()
trainData = scaler.fit_transform(trainData)
testData = scaler.transform(testData)
7. Machine Learning model
7.1 Random Forest Classifier
import lightgbm
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score, make_scorer
from sklearn.model_selection import StratifiedKFold
model = RandomForestClassifier(n_estimators=100, random_state=10,
n_jobs = -1)
scorer = make_scorer(f1_score, greater_is_better=True, average = 'macro')
# 10 fold cross validation
cv_score = cross_val_score(model, trainData, trainTarget, cv=10, scoring=scorer)
print('10 Fold Cross Validation F1 Score = {} with std = {}'.format(round(cv_score.mean(), 4), round(cv_score.std(), 4)))
10 Fold Cross Validation F1 Score = 0.335 with std = 0.0356
7.2 Random Forest Classifier
This extends the random forest classifier used above. It uses tratified K-Folds cross-validation, since this problem has imbalance in the target label.
kfold = 5
kf = StratifiedKFold(n_splits=kfold, shuffle=True)
model = RandomForestClassifier(n_estimators=100, random_state=10,
n_jobs = -1)
predicts_result = []
for idx, (train_index, test_index) in enumerate(kf.split(trainData, trainTarget)):
print("Fold: {}".format(idx + 1))
X_train, X_val = trainData[train_index], trainData[test_index]
y_train, y_val = trainTarget[train_index], trainTarget[test_index]
model.fit(X_train, y_train)
predicts_result.append(model.predict(testData))
submission['Target'] = np.array(predicts_result).mean(axis=0).round().astype(int)
submission.to_csv('submission.csv', index=False)
Fold: 1
Fold: 2
Fold: 3
Fold: 4
Fold: 5
model = lightgbm.LGBMClassifier(max_depth=-1, learning_rate=0.1, objective='multiclass',
random_state=None, silent=True, metric='None',
n_jobs=4, n_estimators=5000, class_weight='balanced',
colsample_bytree = 0.93, min_child_samples = 95, num_leaves = 14, subsample = 0.96)
kfold = 5
kf = StratifiedKFold(n_splits=kfold, shuffle=True)
predicts_result = []
for idx, (train_index, test_index) in enumerate(kf.split(trainData, trainTarget)):
print("Fold: {}".format(idx + 1))
X_train, X_val = trainData[train_index], trainData[test_index]
y_train, y_val = trainTarget[train_index], trainTarget[test_index]
model.fit(X_train, y_train, verbose=100)
predicts_result.append(model.predict(testData))
submission['Target'] = np.array(predicts_result).mean(axis=0).round().astype(int)
submission.to_csv('submission.csv', index=False)
print("Completed!")
Fold: 1
Fold: 2
Fold: 3
Fold: 4
Fold: 5
Completed!
This kernel produces a score of 0.411 (macro f1), which would have given a rank of 214-215 on Kaggle leaderboard for this competition. Kaggle notebook can be found here: https://www.kaggle.com/akverma/costa-rican-poverty-analysis