Mock Interview sessions

 For 15th November session


1.Data Profiling

For a given data set when we need to do some EDA this data profiling will be used. Data profiling helps us calculate the following things

Essentials: type, unique values, missing values

Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range

Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness

Most frequent values

Histogram 

Correlation highlighting highly correlated variables (Spearsman & Pearson matrices)

import pandas_profiling

df=pd.read_csv('train.csv')

pandas_profiling.ProfileReport(df)

2.Data Visualization


Heatmap

Tabular form, cross examination, multi variate and color coding.


3.Standardization/Normalization

Transformation of Features

Why Transformation of Features Are Required?

  1. Linear Regression---Gradient Descent ----Global Minima
  2. Algorithms like KNN,K Means,Hierarichal Clustering--- Eucledian Distance

Every Point has some vectors and Directiom

Deep Learning Techniques(Standardization, Scaling) 1.ANN--->GLobal Minima, Gradient 2.CNN 3.RNN

0-255 pixels

Types Of Transformation

  1. Normalization And Standardization
  2. Scaling to Minimum And Maximum values
  3. Scaling To Median And Quantiles
  4. Guassian Transformation Logarithmic Transformation Reciprocal Trnasformation Square Root Transformation Exponential Trnasformation Box Cox Transformation

Different scalers 

1. Standard Scalers(most used scaler in ML to convert the skewed distribution into normal distribution)
2. Min Max Scaler (used in deep learning to tranform pixel values)
3.Robust Scaler

It is used to scale the feature to median and quantiles Scaling using median and quantiles consists of substracting the median to all the observations, and then dividing by the interquantile difference. The interquantile difference is the difference between the 75th and 25th quantile:

IQR = 75th quantile - 25th quantile

X_scaled = (X - X.median) / IQR

4.Handling Null Values

Why are their Missing values?? Survey--Depression Survey

  1. They hesitate to put down the information
  2. Survey informations are not that valid
  3. Men--salary
  4. Women---age
  5. People may have died----NAN

What are the different types of Missing Data?


Missing Completely at Random, MCAR:
 A variable is missing completely at random (MCAR) if the probability of being missing is the same for all the observations. When data is MCAR, there is absolutely no relationship between the data missing and any other values, observed or missing, within the dataset. In other words, those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than other.

Missing Data Not At Random(MNAR): 
Systematic missing Values There is absolutely some relationship between the data missing and any other values, observed or missing, within the dataset.

Missing At Random(MAR)


Handling Missing Values:

1.Mean/Median/Mode Imputation

When should we apply?
Mean/median imputation has the assumption that the data are missing completely at random(MCAR). We solve this by replacing the NAN with the most frequent occurance of the variables

Advantages

  1. Easy to implement(Robust to outliers)
  2. Faster way to obtain the complete datase

Disadvantages
  1. Change or Distortion in the original variance
  2. Impacts Correlation


2.Random Sample Imputation

Aim: Random sample imputation consists of taking random observation from the dataset and we use this observation to replace the nan values

When should it be used? 

It assumes that the data are missing completely at random(MCAR)

Advantages
  1. Easy To implement
  2. There is less distortion in variance
Disadvantage
  1. Every situation randomness wont work

3.Capturing NAN values with a new feature

It works well if the data are not missing completely at random


Advantages

  1. Easy to implement
  2. Captures the importance of missing values
Disadvantages
  1. Creating Additional Features(Curse of Dimensionality)

4. Arbitrary Value Imputation

this technique was derived from kaggle competition It consists of replacing NAN by an arbitrary value


Advantages

  • Easy to implement
  • Captures the importance of missingess if there is one

Disadvantages

  • Distorts the original distribution of the variable
  • If missingess is not important, it may mask the predictive power of the original variable by distorting its distribution
  • Hard to decide which value to use


How To Handle Categroical Missing Values?

Frequent Category Imputation

Suppose if you have more frequent categories, we just replace NAN with a new category

Advantages

  1. Easy To implement
  2. Fater way to implement 

Disadvantages
  1. Since we are using the more frequent labels, it may use them in an over respresented way, if there are many nan's
  2. It distorts the relation of the most frequent label

5.Handling Imbalanced dataset


6. Correlation and Multi collinearity


Case Study:

Predicting online channel acceptance with social media:

What are the factors that contribute to the scope of information propagation and how accurately predict its variation

Generative Approach:

characterizing and modelling the process of content becoming more popular in social network

excellent interpretability

poor prediction of variation in real-world cascade size


Feature based approach:

Solving regression or classification task with a learning algorithm and a set of predicting features

provides framework for both prediction and explanation

NO CONSENSUS ON WHAT FEATURES ARE THE MOST ESSENTIAL IS ESTABLISHED

features of an author of a tweet along with features

NON LINEAR ALGORITHMS OUTPERFORM LINEAR ONES

THE HIGHEST PROPORTION OF VARIANCE IS EXPLAINED BY THE MODEL IN THE OUTCOME IS LESS THAN 50%


Why social media is so interesting?

very short(140 char)

lexically divergent

abbreviated

multi lingual

Why language in social media is so challenging?

Highly dimensional latent space

Data drift(trends changes)

User activeness     generalization

Topical sparsity    relationships, politics not discussed all the time

Dynamic streaming nature


NLP, ML for computational social science 

correlating user properties and emotions on twitter

ML methods

Online Streaming predictions

Dynamic learning and interference from mixed streams


Research Method

1.Surveys to be conducted

which website helps the customer buy more products

more relevant, less relavant, use less

2.Respondent characteristics

different age groups

different buying capacity

3. products showcased in the survey

second hand car, books, pubs and cameras, computers, fitness centres


Social media classification:

Whole social media will be divided into different groups or products

Relational classifiers

Buying of product depends on the neighbours


Eigen Values, Eigen vectors

PCA

ICA

Autoencoder

Multifactor dimensionality reduction































































Comments