Missing Data Imputation
Data Preprocessing
Missing Data Imputation
Why are their Missing values?? Survey--Depression Survey
- They hesitate to put down the information
- Survey informations are not that valid
- Men--salary
- Women---age
- People may have died----NAN
What are the different types of Missing Data?
- Missing Completely at Random, MCAR:
e.g.
df[df['Embarked'].isnull()]
df[df['Embarked'].isnull()]

PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
61 | 62 | 1 | 1 | Icard, Miss. Amelie | female | 38.0 | 0 | 0 | 113572 | 80.0 | B28 | NaN |
829 | 830 | 1 | 1 | Stone, Mrs. George Nelson (Martha Evelyn) | female | 62.0 | 0 | 0 | 113572 | 80.0 | B28 | NaN |
2.Missing Data Not At Random(MNAR):
Systematic missing Values There is absolutely some relationship between the data missing and any other values, observed or missing, within the dataset.
#replacing the null values with 1 and remaining values with 0
import numpy as np df['cabin_null']=np.where(df['Cabin'].isnull(),1,0) ##finding the percentage of null values df['cabin_null'].mean()
0.7710437710437711
We can observe that people with cabin details have survived most
df.groupby(['Survived'])['cabin_null'].mean()
Survived 0 0.876138 1 0.602339 Name: cabin_null, dtype: float64
3.Missing At Random(MAR)
Men---hide their salary Women---hide their age
Methods to handle the missing values:
1. Mean/ Median/Mode replacement 2. Random Sample Imputation 3. Capturing NAN values with a new feature 4. End of Distribution imputation 5. Arbitrary imputation 6. Frequent categories imputation
1.Mean/ MEdian /Mode imputation
When should we apply?
Mean/median imputation has the assumption that the data are missing completely at random(MCAR). We solve this by replacing the NAN with the most frequent occurrence of the variables
e.g:
df=pd.read_csv('titanic.csv',usecols=['Age','Fare','Survived']) df.head()
Survived | Age | Fare | |
---|---|---|---|
0 | 0 | 22.0 | 7.2500 |
1 | 1 | 38.0 | 71.2833 |
2 | 1 | 26.0 | 7.9250 |
3 | 1 | 35.0 | 53.1000 |
4 | 0 | 35.0 | 8.0500 |
##the percentage of missing values df.isnull().mean()
Survived 0.000000
Age 0.198653
Fare 0.000000
dtype: float64
#creating a function impute missing values with median in new column
def impute_nan(df,variable,median): df[variable+"_median"]=df[variable].fillna(median)
median= df.Age.median()
impute_nan(df,'Age',median)
df.head()
Survived | Age | Fare | Age_median | |
---|---|---|---|---|
0 | 0 | 22.0 | 7.2500 | 22.0 |
1 | 1 | 38.0 | 71.2833 | 38.0 |
2 | 1 | 26.0 | 7.9250 | 26.0 |
3 | 1 | 35.0 | 53.1000 | 35.0 |
4 | 0 | 35.0 | 8.0500 | 35.0 |
#checking the standard deviation
print(df['Age'].std())
print(df['Age_median'].std())
14.526497332334042 13.019696550973201
#plotting graph for the median
fig = plt.figure() ax = fig.add_subplot(111) df['Age'].plot(kind='kde', ax=ax) df.Age_median.plot(kind='kde', ax=ax, color='red') lines, labels = ax.get_legend_handles_labels() ax.legend(lines, labels, loc='best')
Advantages And Disadvantages of Mean/Median Imputation
Advantages
- Easy to implement(Robust to outliers)
- Faster way to obtain the complete dataset
Disadvantages
- Change or Distortion in the original variance
- Impacts Correlation
2.Random Sample Imputation
Aim:
Random sample imputation consists of taking random observation from the dataset and we use this observation to replace the nan values
When should it be used?
It assumes that the data are missing completely at random(MCAR)
def impute_nan(df,variable,median): df[variable+"_median"]=df[variable].fillna(median) df[variable+"_random"]=df[variable] ##It will have the random sample to fill the na random_sample=df[variable].dropna().sample(df[variable].isnull().sum(),random_state=0) ##pandas need to have same index in order to merge the dataset random_sample.index=df[df[variable].isnull()].index df.loc[df[variable].isnull(),variable+'_random']=random_sample
median = df.Age.median()
impute_nan(df,'Age',median)
df.head()
Survived | Age | Fare | Age_median | Age_random | |
---|---|---|---|---|---|
0 | 0 | 22.0 | 7.2500 | 22.0 | 22.0 |
1 | 1 | 38.0 | 71.2833 | 38.0 | 38.0 |
2 | 1 | 26.0 | 7.9250 | 26.0 | 26.0 |
3 | 1 | 35.0 | 53.1000 | 35.0 | 35.0 |
4 | 0 | 35.0 | 8.0500 | 35.0 | 35.0 |
import matplotlib.pyplot as plt
%matplotlib inline
fig = plt.figure() ax = fig.add_subplot(111) df['Age'].plot(kind='kde', ax=ax) df.Age_median.plot(kind='kde', ax=ax, color='red') df.Age_random.plot(kind='kde', ax=ax, color='green') lines, labels = ax.get_legend_handles_labels() ax.legend(lines, labels, loc='best')
Advantages
- Easy To implement
- There is less distortion in variance
Disadvantage
- Every situation randomness wont work
3.Capturing NAN values with a new feature
It works well if the data are not missing completely at random
import numpy as np df['Age_NAN']=np.where(df['Age'].isnull(),1,0)
df.head()
Survived Age Fare Age_NAN 0 0 22.0 7.2500 0 1 1 38.0 71.2833 0 2 1 26.0 7.9250 0 3 1 35.0 53.1000 0 4 0 35.0 8.0500 0
df['Age'].fillna(df.Age.median(),inplace=True)
Advantages
- Easy to implement
- Captures the importance of missing values
Disadvantages
- Creating Additional Features(Curse of Dimensionality)
4.End of Distribution imputation
extreme=df.Age.mean()+3*df.Age.std()import seaborn as sns sns.boxplot('Age',data=df)
def impute_nan(df,variable,median,extreme): df[variable+"_end_distribution"]=df[variable].fillna(extreme) df[variable].fillna(median,inplace=True)
impute_nan(df,'Age',df.Age.median(),extreme)
df['Age'].hist(bins=50)
df['Age_end_distribution'].hist(bins=50)
sns.boxplot('Age_end_distribution',data=df)
5.Arbitrary Value Imputation
This technique was derived from kaggle competition It consists of replacing NAN by an arbitrary value
def impute_nan(df,variable): df[variable+'_zero']=df[variable].fillna(0) df[variable+'_hundred']=df[variable].fillna(100)
df['Age'].hist(bins=50)
Advantages
- Easy to implement
- Captures the importance of missingess if there is one
Disadvantages
- Distorts the original distribution of the variable
- If missingess is not important, it may mask the predictive power of the original variable by distorting its distribution
- Hard to decide which value to use
How To Handle Categorical Missing Values:
1.Frequent Category Imputation
df=pd.read_csv('loan.csv', usecols=['BsmtQual','FireplaceQu','GarageType','SalePrice'])
df.shape
(1460, 4)
df.isnull().sum()
BsmtQual 37
FireplaceQu 690
GarageType 81
SalePrice 0
dtype: int64
df.isnull().mean().sort_values(ascending=True)
SalePrice 0.000000
BsmtQual 0.025342
GarageType 0.055479
FireplaceQu 0.472603
dtype: float64
Compute the frequency with every feature
df['BsmtQual'].value_counts().plot.bar()
df.groupby(['BsmtQual'])['BsmtQual'].count().sort_values(ascending=False).plot.bar()
df['GarageType'].value_counts().plot.bar()
df['FireplaceQu'].value_counts().plot.bar()
'Attchd'
df['GarageType'].mode()[0]
'Attchd'
def impute_nan(df,variable): most_frequent_category=df[variable].mode()[0] df[variable].fillna(most_frequent_category,inplace=True)
for feature in ['BsmtQual','FireplaceQu','GarageType']: impute_nan(df,feature)
df.isnull().mean()
BsmtQual 0.0 FireplaceQu 0.0 GarageType 0.0 SalePrice 0.0 dtype: float64
Advantages
Disadvantages
- Easy To implement
- Fater way to implement
- Since we are using the more frequent labels, it may use them in an over respresented way, if there are many nan's
- It distorts the relation of the most frequent label
2.Adding a variable to capture NAN
import numpy as np df['BsmtQual_Var']=np.where(df['BsmtQual'].isnull(),1,0)df.head()
BsmtQual | FireplaceQu | GarageType | SalePrice | BsmtQual_Var | |
---|---|---|---|---|---|
0 | Gd | NaN | Attchd | 208500 | 0 |
1 | Gd | TA | Attchd | 181500 | 0 |
2 | Gd | TA | Attchd | 223500 | 0 |
3 | TA | Gd | Detchd | 140000 | 0 |
4 | Gd | TA | Attchd | 250000 | 0 |
df['BsmtQual'].mode()[0]
'TA'
df['BsmtQual'].fillna(frequent,inplace=True)df.head()
BsmtQual | FireplaceQu | GarageType | SalePrice | BsmtQual_Var | |
---|---|---|---|---|---|
0 | Gd | NaN | Attchd | 208500 | 0 |
1 | Gd | TA | Attchd | 181500 | 0 |
2 | Gd | TA | Attchd | 223500 | 0 |
3 | TA | Gd | Detchd | 140000 | 0 |
4 | Gd | TA | Attchd | 250000 | 0 |
df['FireplaceQu_Var']=np.where(df['FireplaceQu'].isnull(),1,0) frequent=df['FireplaceQu'].mode()[0] df['FireplaceQu'].fillna(frequent,inplace=True)
df.head()
BsmtQual | FireplaceQu | GarageType | SalePrice | BsmtQual_Var | FireplaceQu_Var | |
---|---|---|---|---|---|---|
0 | Gd | Gd | Attchd | 208500 | 0 | 1 |
1 | Gd | TA | Attchd | 181500 | 0 | 0 |
2 | Gd | TA | Attchd | 223500 | 0 | 0 |
3 | TA | Gd | Detchd | 140000 | 0 | 0 |
4 | Gd | TA | Attchd | 250000 | 0 | 0 |
Suppose if you have more frequent categories, we just replace NAN with a new category
def impute_nan(df,variable): df[variable+"newvar"]=np.where(df[variable].isnull(),"Missing",df[variable])
for feature in ['BsmtQual','FireplaceQu','GarageType']: impute_nan(df,feature)
df.head()
SalePrice | BsmtQualnewvar | FireplaceQunewvar | GarageTypenewvar | |
---|---|---|---|---|
0 | 208500 | Gd | Missing | Attchd |
1 | 181500 | Gd | TA | Attchd |
2 | 223500 | Gd | TA | Attchd |
3 | 140000 | TA | Gd | Detchd |
4 | 250000 | Gd | TA | Attchd |
Comments
Post a Comment