Mock Interview sessions

For 15th November session

1.Data Profiling

For a given data set when we need to do some EDA this data profiling will be used. Data profiling helps us calculate the following things

Essentials: type, unique values, missing values

Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range

Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness

Most frequent values

Histogram

Correlation highlighting highly correlated variables (Spearsman & Pearson matrices)

import pandas_profiling

df=pd.read_csv('train.csv')

pandas_profiling.ProfileReport(df)

2.Data Visualization

Heatmap

Tabular form, cross examination, multi variate and color coding.

3.Standardization/Normalization

Transformation of Features

Why Transformation of Features Are Required?

Linear Regression---Gradient Descent ----Global Minima
Algorithms like KNN,K Means,Hierarichal Clustering--- Eucledian Distance

Every Point has some vectors and Directiom

Deep Learning Techniques(Standardization, Scaling) 1.ANN--->GLobal Minima, Gradient 2.CNN 3.RNN

0-255 pixels

Types Of Transformation

Normalization And Standardization
Scaling to Minimum And Maximum values
Scaling To Median And Quantiles
Guassian Transformation Logarithmic Transformation Reciprocal Trnasformation Square Root Transformation Exponential Trnasformation Box Cox Transformation

Different scalers

1. Standard Scalers(most used scaler in ML to convert the skewed distribution into normal distribution)

2. Min Max Scaler (used in deep learning to tranform pixel values)

3.Robust Scaler

It is used to scale the feature to median and quantiles Scaling using median and quantiles consists of substracting the median to all the observations, and then dividing by the interquantile difference. The interquantile difference is the difference between the 75th and 25th quantile:

IQR = 75th quantile - 25th quantile

X_scaled = (X - X.median) / IQR

4.Handling Null Values

Why are their Missing values?? Survey--Depression Survey

They hesitate to put down the information
Survey informations are not that valid
Men--salary
Women---age
People may have died----NAN

What are the different types of Missing Data?

Missing Completely at Random, MCAR:

A variable is missing completely at random (MCAR) if the probability of being missing is the same for all the observations. When data is MCAR, there is absolutely no relationship between the data missing and any other values, observed or missing, within the dataset. In other words, those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than other.

Missing Data Not At Random(MNAR):

Systematic missing Values There is absolutely some relationship between the data missing and any other values, observed or missing, within the dataset.

Missing At Random(MAR)

Handling Missing Values:

1.Mean/Median/Mode Imputation

When should we apply?

Mean/median imputation has the assumption that the data are missing completely at random(MCAR). We solve this by replacing the NAN with the most frequent occurance of the variables

Advantages

Easy to implement(Robust to outliers)
Faster way to obtain the complete datase

Disadvantages

Change or Distortion in the original variance
Impacts Correlation

2.Random Sample Imputation

Aim: Random sample imputation consists of taking random observation from the dataset and we use this observation to replace the nan values

When should it be used?

It assumes that the data are missing completely at random(MCAR)

Advantages

Easy To implement
There is less distortion in variance

Disadvantage

Every situation randomness wont work

3.Capturing NAN values with a new feature

It works well if the data are not missing completely at random

Advantages

Easy to implement
Captures the importance of missing values

Disadvantages

Creating Additional Features(Curse of Dimensionality)

4. Arbitrary Value Imputation

this technique was derived from kaggle competition It consists of replacing NAN by an arbitrary value

Advantages

Easy to implement
Captures the importance of missingess if there is one

Disadvantages

Distorts the original distribution of the variable
If missingess is not important, it may mask the predictive power of the original variable by distorting its distribution
Hard to decide which value to use

How To Handle Categroical Missing Values?

Frequent Category Imputation

Suppose if you have more frequent categories, we just replace NAN with a new category

Advantages

Easy To implement
Fater way to implement

Disadvantages

Since we are using the more frequent labels, it may use them in an over respresented way, if there are many nan's
It distorts the relation of the most frequent label

5.Handling Imbalanced dataset

6. Correlation and Multi collinearity

Case Study:

Predicting online channel acceptance with social media:

What are the factors that contribute to the scope of information propagation and how accurately predict its variation

Generative Approach:

characterizing and modelling the process of content becoming more popular in social network

excellent interpretability

poor prediction of variation in real-world cascade size

Feature based approach:

Solving regression or classification task with a learning algorithm and a set of predicting features

provides framework for both prediction and explanation

NO CONSENSUS ON WHAT FEATURES ARE THE MOST ESSENTIAL IS ESTABLISHED

features of an author of a tweet along with features

NON LINEAR ALGORITHMS OUTPERFORM LINEAR ONES

THE HIGHEST PROPORTION OF VARIANCE IS EXPLAINED BY THE MODEL IN THE OUTCOME IS LESS THAN 50%

Why social media is so interesting?

very short(140 char)

lexically divergent

abbreviated

multi lingual

Why language in social media is so challenging?

Highly dimensional latent space

Data drift(trends changes)

User activeness generalization

Topical sparsity relationships, politics not discussed all the time

Dynamic streaming nature

NLP, ML for computational social science

correlating user properties and emotions on twitter

ML methods

Online Streaming predictions

Dynamic learning and interference from mixed streams

Research Method

1.Surveys to be conducted

which website helps the customer buy more products

more relevant, less relavant, use less

2.Respondent characteristics

different age groups

different buying capacity

3. products showcased in the survey

second hand car, books, pubs and cameras, computers, fitness centres

Social media classification:

Whole social media will be divided into different groups or products

Relational classifiers

Buying of product depends on the neighbours

Eigen Values, Eigen vectors

PCA

ICA

Autoencoder

Multifactor dimensionality reduction

Search This Blog

Python Notes