Missing Data Imputation

 Data Preprocessing

Missing Data Imputation

Why are their Missing values?? Survey--Depression Survey

  1. They hesitate to put down the information
  2. Survey informations are not that valid
  3. Men--salary
  4. Women---age
  5. People may have died----NAN

What are the different types of Missing Data?

  1. Missing Completely at Random, MCAR:
 A variable is missing completely at random (MCAR) if the probability of being missing is the same for all the observations. When data is MCAR, there is absolutely no relationship between the data missing and any other values, observed or missing, within the dataset. In other words, those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than other.

e.g. 
df[df['Embarked'].isnull()]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
616211Icard, Miss. Ameliefemale38.00011357280.0B28NaN
82983011Stone, Mrs. George Nelson (Martha Evelyn)female62.00011357280.0B28NaN

2.Missing Data Not At Random(MNAR):
 Systematic missing Values There is absolutely some relationship between the data missing and any other values, observed or missing, within the dataset.

#replacing the null values with 1 and remaining values with 0
import numpy as np
df['cabin_null']=np.where(df['Cabin'].isnull(),1,0)

##finding the percentage of null values
df['cabin_null'].mean()
0.7710437710437711


We can observe that people with cabin details have survived most
df.groupby(['Survived'])['cabin_null'].mean()

Survived
0    0.876138
1    0.602339
Name: cabin_null, dtype: float64

3.Missing At Random(MAR)

Men---hide their salary
Women---hide their age

Methods to handle the missing values:

1. Mean/ Median/Mode replacement
2. Random Sample Imputation
3. Capturing NAN values with a new feature
4. End of Distribution imputation
5. Arbitrary imputation
6. Frequent categories imputation

1.Mean/ MEdian /Mode imputation

When should we apply? 

Mean/median imputation has the assumption that the data are missing completely at random(MCAR). We solve this by replacing the NAN with the most frequent occurrence of the variables

e.g:

df=pd.read_csv('titanic.csv',usecols=['Age','Fare','Survived'])
df.head()
SurvivedAgeFare
0022.07.2500
1138.071.2833
2126.07.9250
3135.053.1000
4035.08.0500

##the percentage of missing values
df.isnull().mean()
Survived    0.000000
Age         0.198653
Fare        0.000000
dtype: float64

#creating a function impute missing values with median in new column
def impute_nan(df,variable,median):
    df[variable+"_median"]=df[variable].fillna(median)

median= df.Age.median()
impute_nan(df,'Age',median)
df.head()
SurvivedAgeFareAge_median
0022.07.250022.0
1138.071.283338.0
2126.07.925026.0
3135.053.100035.0
4035.08.050035.0

#checking the standard deviation 
print(df['Age'].std()) print(df['Age_median'].std())
14.526497332334042
13.019696550973201

#plotting graph for the median
fig = plt.figure()
ax = fig.add_subplot(111)
df['Age'].plot(kind='kde', ax=ax)
df.Age_median.plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

Advantages And Disadvantages of Mean/Median Imputation

Advantages

  1. Easy to implement(Robust to outliers)
  2. Faster way to obtain the complete dataset

Disadvantages
  1. Change or Distortion in the original variance
  2. Impacts Correlation

2.Random Sample Imputation

Aim: 

Random sample imputation consists of taking random observation from the dataset and we use this observation to replace the nan values

When should it be used? 

It assumes that the data are missing completely at random(MCAR)


def impute_nan(df,variable,median):
    df[variable+"_median"]=df[variable].fillna(median)
    df[variable+"_random"]=df[variable]
    ##It will have the random sample to fill the na
    random_sample=df[variable].dropna().sample(df[variable].isnull().sum(),random_state=0)
    ##pandas need to have same index in order to merge the dataset
    random_sample.index=df[df[variable].isnull()].index
    df.loc[df[variable].isnull(),variable+'_random']=random_sample
median = df.Age.median()
impute_nan(df,'Age',median)
df.head()
SurvivedAgeFareAge_medianAge_random
0022.07.250022.022.0
1138.071.283338.038.0
2126.07.925026.026.0
3135.053.100035.035.0
4035.08.050035.035.0


import matplotlib.pyplot as plt %matplotlib inline
fig = plt.figure()
ax = fig.add_subplot(111)
df['Age'].plot(kind='kde', ax=ax)
df.Age_median.plot(kind='kde', ax=ax, color='red')
df.Age_random.plot(kind='kde', ax=ax, color='green')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')


Advantages
  1. Easy To implement
  2. There is less distortion in variance
Disadvantage
  1. Every situation randomness wont work

3.Capturing NAN values with a new feature

It works well if the data are not missing completely at random

import numpy as np
df['Age_NAN']=np.where(df['Age'].isnull(),1,0)
df.head()

SurvivedAgeFareAge_NAN
0022.07.25000
1138.071.28330
2126.07.92500
3135.053.10000
4035.08.05000

df['Age'].fillna(df.Age.median(),inplace=True)

Advantages

  1. Easy to implement
  2. Captures the importance of missing values
Disadvantages
  1. Creating Additional Features(Curse of Dimensionality)


4.End of Distribution imputation

extreme=df.Age.mean()+3*df.Age.std()
import seaborn as sns
sns.boxplot('Age',data=df)
def impute_nan(df,variable,median,extreme):
    df[variable+"_end_distribution"]=df[variable].fillna(extreme)
    df[variable].fillna(median,inplace=True)

impute_nan(df,'Age',df.Age.median(),extreme)
df['Age'].hist(bins=50)

df['Age_end_distribution'].hist(bins=50)

sns.boxplot('Age_end_distribution',data=df)


5.Arbitrary Value Imputation


This technique was derived from kaggle competition It consists of replacing NAN by an arbitrary value


def impute_nan(df,variable):
    df[variable+'_zero']=df[variable].fillna(0)
    df[variable+'_hundred']=df[variable].fillna(100)

df['Age'].hist(bins=50)

Advantages

  • Easy to implement
  • Captures the importance of missingess if there is one

Disadvantages

  • Distorts the original distribution of the variable
  • If missingess is not important, it may mask the predictive power of the original variable by distorting its distribution
  • Hard to decide which value to use

How To Handle Categorical Missing Values:
1.Frequent Category Imputation
df=pd.read_csv('loan.csv', usecols=['BsmtQual','FireplaceQu','GarageType','SalePrice'])
df.shape
(1460, 4)

df.isnull().sum()

BsmtQual 37 FireplaceQu 690 GarageType 81 SalePrice 0 dtype: int64

df.isnull().mean().sort_values(ascending=True)
SalePrice      0.000000
BsmtQual       0.025342
GarageType     0.055479
FireplaceQu    0.472603
dtype: float64

Compute the frequency with every feature


df['BsmtQual'].value_counts().plot.bar()

df.groupby(['BsmtQual'])['BsmtQual'].count().sort_values(ascending=False).plot.bar()


df['GarageType'].value_counts().plot.bar()

df['FireplaceQu'].value_counts().plot.bar()

df['GarageType'].value_counts().index[0]
'Attchd'

df['GarageType'].mode()[0]
'Attchd'

def impute_nan(df,variable):
    most_frequent_category=df[variable].mode()[0]
    df[variable].fillna(most_frequent_category,inplace=True)

for feature in ['BsmtQual','FireplaceQu','GarageType']:
    impute_nan(df,feature)

df.isnull().mean()
BsmtQual       0.0
FireplaceQu    0.0
GarageType     0.0
SalePrice      0.0
dtype: float64

Advantages

  1. Easy To implement
  2. Fater way to implement 
Disadvantages
  1. Since we are using the more frequent labels, it may use them in an over respresented way, if there are many nan's
  2. It distorts the relation of the most frequent label

2.Adding a variable to capture NAN
import numpy as np
df['BsmtQual_Var']=np.where(df['BsmtQual'].isnull(),1,0)
df.head()

BsmtQualFireplaceQuGarageTypeSalePriceBsmtQual_Var
0GdNaNAttchd2085000
1GdTAAttchd1815000
2GdTAAttchd2235000
3TAGdDetchd1400000
4GdTAAttchd2500000

df['BsmtQual'].mode()[0]
'TA'

df['BsmtQual'].fillna(frequent,inplace=True)
df.head()
BsmtQualFireplaceQuGarageTypeSalePriceBsmtQual_Var
0GdNaNAttchd2085000
1GdTAAttchd1815000
2GdTAAttchd2235000
3TAGdDetchd1400000
4GdTAAttchd2500000

df['FireplaceQu_Var']=np.where(df['FireplaceQu'].isnull(),1,0)
frequent=df['FireplaceQu'].mode()[0]
df['FireplaceQu'].fillna(frequent,inplace=True)
df.head()
BsmtQualFireplaceQuGarageTypeSalePriceBsmtQual_VarFireplaceQu_Var
0GdGdAttchd20850001
1GdTAAttchd18150000
2GdTAAttchd22350000
3TAGdDetchd14000000
4GdTAAttchd25000000

Suppose if you have more frequent categories, we just replace NAN with a new category


def impute_nan(df,variable):
    df[variable+"newvar"]=np.where(df[variable].isnull(),"Missing",df[variable])

for feature in ['BsmtQual','FireplaceQu','GarageType']:
    impute_nan(df,feature)

df.head()
SalePriceBsmtQualnewvarFireplaceQunewvarGarageTypenewvar
0208500GdMissingAttchd
1181500GdTAAttchd
2223500GdTAAttchd
3140000TAGdDetchd
4250000GdTAAttchd

Comments