Missing Data Imputation

Data Preprocessing

Why are their Missing values?? Survey--Depression Survey

They hesitate to put down the information
Survey informations are not that valid
Men--salary
Women---age
People may have died----NAN

What are the different types of Missing Data?

Missing Completely at Random, MCAR:

A variable is missing completely at random (MCAR) if the probability of being missing is the same for all the observations. When data is MCAR, there is absolutely no relationship between the data missing and any other values, observed or missing, within the dataset. In other words, those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than other.

e.g.
df[df['Embarked'].isnull()]

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
61	62	1	1	Icard, Miss. Amelie	female	38.0	0	0	113572	80.0	B28	NaN
829	830	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62.0	0	0	113572	80.0	B28	NaN

2.Missing Data Not At Random(MNAR):

Systematic missing Values There is absolutely some relationship between the data missing and any other values, observed or missing, within the dataset.

#replacing the null values with 1 and remaining values with 0

import numpy as np
df['cabin_null']=np.where(df['Cabin'].isnull(),1,0)

##finding the percentage of null values
df['cabin_null'].mean()

0.7710437710437711

We can observe that people with cabin details have survived most

df.groupby(['Survived'])['cabin_null'].mean()

Survived
0    0.876138
1    0.602339
Name: cabin_null, dtype: float64

3.Missing At Random(MAR)

Men---hide their salary
Women---hide their age

Methods to handle the missing values:

1. Mean/ Median/Mode replacement
2. Random Sample Imputation
3. Capturing NAN values with a new feature
4. End of Distribution imputation
5. Arbitrary imputation
6. Frequent categories imputation

1.Mean/ MEdian /Mode imputation

When should we apply?

Mean/median imputation has the assumption that the data are missing completely at random(MCAR). We solve this by replacing the NAN with the most frequent occurrence of the variables

e.g:

df=pd.read_csv('titanic.csv',usecols=['Age','Fare','Survived'])
df.head()

	Survived	Age	Fare
0	0	22.0	7.2500
1	1	38.0	71.2833
2	1	26.0	7.9250
3	1	35.0	53.1000
4	0	35.0	8.0500

##the percentage of missing values
df.isnull().mean()

Survived    0.000000
Age         0.198653
Fare        0.000000
dtype: float64

#creating a function impute missing values with median in new column

def impute_nan(df,variable,median):
    df[variable+"_median"]=df[variable].fillna(median)

median= df.Age.median()

impute_nan(df,'Age',median)

df.head()

	Survived	Age	Fare	Age_median
0	0	22.0	7.2500	22.0
1	1	38.0	71.2833	38.0
2	1	26.0	7.9250	26.0
3	1	35.0	53.1000	35.0
4	0	35.0	8.0500	35.0

#checking the standard deviation

print(df['Age'].std()) print(df['Age_median'].std())

14.526497332334042
13.019696550973201

#plotting graph for the median

fig = plt.figure()
ax = fig.add_subplot(111)
df['Age'].plot(kind='kde', ax=ax)
df.Age_median.plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

Advantages And Disadvantages of Mean/Median Imputation

Advantages

Easy to implement(Robust to outliers)
Faster way to obtain the complete dataset

Disadvantages

Change or Distortion in the original variance
Impacts Correlation

2.Random Sample Imputation

Aim:

Random sample imputation consists of taking random observation from the dataset and we use this observation to replace the nan values

When should it be used?

It assumes that the data are missing completely at random(MCAR)

def impute_nan(df,variable,median):
    df[variable+"_median"]=df[variable].fillna(median)
    df[variable+"_random"]=df[variable]
    ##It will have the random sample to fill the na
    random_sample=df[variable].dropna().sample(df[variable].isnull().sum(),random_state=0)
    ##pandas need to have same index in order to merge the dataset
    random_sample.index=df[df[variable].isnull()].index
    df.loc[df[variable].isnull(),variable+'_random']=random_sample

median = df.Age.median()

impute_nan(df,'Age',median)

df.head()

	Survived	Age	Fare	Age_median	Age_random
0	0	22.0	7.2500	22.0	22.0
1	1	38.0	71.2833	38.0	38.0
2	1	26.0	7.9250	26.0	26.0
3	1	35.0	53.1000	35.0	35.0
4	0	35.0	8.0500	35.0	35.0

import matplotlib.pyplot as plt %matplotlib inline

fig = plt.figure()
ax = fig.add_subplot(111)
df['Age'].plot(kind='kde', ax=ax)
df.Age_median.plot(kind='kde', ax=ax, color='red')
df.Age_random.plot(kind='kde', ax=ax, color='green')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')



Advantages
Easy To implement
There is less distortion in variance
Disadvantage
Every situation randomness wont work

3.Capturing NAN values with a new feature

It works well if the data are not missing completely at random

import numpy as np
df['Age_NAN']=np.where(df['Age'].isnull(),1,0)

df.head()

 Survived Age Fare Age_NAN
0 0 22.0 7.2500 0
1 1 38.0 71.2833 0
2 1 26.0 7.9250 0
3 1 35.0 53.1000 0
4 0 35.0 8.0500 0

	Survived	Age	Fare
0	0	22.0	7.2500
1	1	38.0	71.2833
2	1	26.0	7.9250
3	1	35.0	53.1000
4	0	35.0	8.0500

df['Age'].fillna(df.Age.median(),inplace=True)

Advantages
Easy to implement
Captures the importance of missing values
Disadvantages
Creating Additional Features(Curse of Dimensionality)


4.End of Distribution imputation

extreme=df.Age.mean()+3*df.Age.std()
import seaborn as sns
sns.boxplot('Age',data=df)
def impute_nan(df,variable,median,extreme):
    df[variable+"_end_distribution"]=df[variable].fillna(extreme)
    df[variable].fillna(median,inplace=True)

impute_nan(df,'Age',df.Age.median(),extreme)
df['Age'].hist(bins=50)

df['Age_end_distribution'].hist(bins=50)

sns.boxplot('Age_end_distribution',data=df)


5.Arbitrary Value Imputation

This technique was derived from kaggle competition It consists of replacing NAN by an arbitrary value

def impute_nan(df,variable):
    df[variable+'_zero']=df[variable].fillna(0)
    df[variable+'_hundred']=df[variable].fillna(100)

df['Age'].hist(bins=50)


Advantages
Easy to implement
Captures the importance of missingess if there is one
Disadvantages
Distorts the original distribution of the variable
If missingess is not important, it may mask the predictive power of the original variable by distorting its distribution
Hard to decide which value to use

How To Handle Categorical Missing Values:
1.Frequent Category Imputationdf=pd.read_csv('loan.csv', usecols=['BsmtQual','FireplaceQu','GarageType','SalePrice'])
df.shape
(1460, 4)

df.isnull().sum()

BsmtQual        37
FireplaceQu    690
GarageType      81
SalePrice        0
dtype: int64

df.isnull().mean().sort_values(ascending=True)
SalePrice      0.000000
BsmtQual       0.025342
GarageType     0.055479
FireplaceQu    0.472603
dtype: float64

Compute the frequency with every feature

df['BsmtQual'].value_counts().plot.bar()

df.groupby(['BsmtQual'])['BsmtQual'].count().sort_values(ascending=False).plot.bar()

df['GarageType'].value_counts().plot.bar()

df['FireplaceQu'].value_counts().plot.bar()

df['GarageType'].value_counts().index[0]

'Attchd'

df['GarageType'].mode()[0]

'Attchd'

def impute_nan(df,variable):
    most_frequent_category=df[variable].mode()[0]
    df[variable].fillna(most_frequent_category,inplace=True)

for feature in ['BsmtQual','FireplaceQu','GarageType']:
    impute_nan(df,feature)

df.isnull().mean()

BsmtQual       0.0
FireplaceQu    0.0
GarageType     0.0
SalePrice      0.0
dtype: float64

Advantages
Easy To implement
Fater way to implement 
Disadvantages
Since we are using the more frequent labels, it may use them in an over respresented way, if there are many nan's
It distorts the relation of the most frequent label

2.Adding a variable to capture NAN

import numpy as np
df['BsmtQual_Var']=np.where(df['BsmtQual'].isnull(),1,0)
df.head()

 BsmtQual FireplaceQu GarageType SalePrice BsmtQual_Var
0 Gd NaN Attchd 208500 0
1 Gd TA Attchd 181500 0
2 Gd TA Attchd 223500 0
3 TA Gd Detchd 140000 0
4 Gd TA Attchd 250000 0

df['BsmtQual'].mode()[0]

'TA'

df['BsmtQual'].fillna(frequent,inplace=True)
df.head()

	BsmtQual	FireplaceQu	GarageType	SalePrice
0	Gd	NaN	Attchd	208500
1	Gd	TA	Attchd	181500
2	Gd	TA	Attchd	223500
3	TA	Gd	Detchd	140000
4	Gd	TA	Attchd	250000

df['FireplaceQu_Var']=np.where(df['FireplaceQu'].isnull(),1,0)
frequent=df['FireplaceQu'].mode()[0]
df['FireplaceQu'].fillna(frequent,inplace=True)

df.head()

	BsmtQual	FireplaceQu	GarageType	SalePrice	FireplaceQu_Var
0	Gd	Gd	Attchd	208500	1
1	Gd	TA	Attchd	181500	0
2	Gd	TA	Attchd	223500	0
3	TA	Gd	Detchd	140000	0
4	Gd	TA	Attchd	250000	0

Suppose if you have more frequent categories, we just replace NAN with a new category

def impute_nan(df,variable):
    df[variable+"newvar"]=np.where(df[variable].isnull(),"Missing",df[variable])

for feature in ['BsmtQual','FireplaceQu','GarageType']:
    impute_nan(df,feature)

df.head()

	SalePrice	BsmtQualnewvar	FireplaceQunewvar	GarageTypenewvar
0	208500	Gd	Missing	Attchd
1	181500	Gd	TA	Attchd
2	223500	Gd	TA	Attchd
3	140000	TA	Gd	Detchd
4	250000	Gd	TA	Attchd

Search This Blog

Python Notes