Mock Interview sessions
For 15th November session
1.Data Profiling
For a given data set when we need to do some EDA this data profiling will be used. Data profiling helps us calculate the following things
Essentials: type, unique values, missing values
Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Most frequent values
Histogram
Correlation highlighting highly correlated variables (Spearsman & Pearson matrices)
import pandas_profiling
df=pd.read_csv('train.csv')
pandas_profiling.ProfileReport(df)
2.Data Visualization
3.Standardization/Normalization
Transformation of Features
Why Transformation of Features Are Required?
- Linear Regression---Gradient Descent ----Global Minima
- Algorithms like KNN,K Means,Hierarichal Clustering--- Eucledian Distance
Every Point has some vectors and Directiom
Deep Learning Techniques(Standardization, Scaling) 1.ANN--->GLobal Minima, Gradient 2.CNN 3.RNN
0-255 pixels
Types Of Transformation
- Normalization And Standardization
- Scaling to Minimum And Maximum values
- Scaling To Median And Quantiles
- Guassian Transformation Logarithmic Transformation Reciprocal Trnasformation Square Root Transformation Exponential Trnasformation Box Cox Transformation
IQR = 75th quantile - 25th quantile
X_scaled = (X - X.median) / IQR
4.Handling Null Values
Why are their Missing values?? Survey--Depression Survey
- They hesitate to put down the information
- Survey informations are not that valid
- Men--salary
- Women---age
- People may have died----NAN
What are the different types of Missing Data?
Missing At Random(MAR)
Advantages
- Easy to implement(Robust to outliers)
- Faster way to obtain the complete datase
- Change or Distortion in the original variance
- Impacts Correlation
2.Random Sample Imputation
Aim: Random sample imputation consists of taking random observation from the dataset and we use this observation to replace the nan values
When should it be used?
It assumes that the data are missing completely at random(MCAR)
Advantages
- Easy To implement
- There is less distortion in variance
Disadvantage
- Every situation randomness wont work
3.Capturing NAN values with a new feature
Advantages
- Easy to implement
- Captures the importance of missing values
Disadvantages
- Creating Additional Features(Curse of Dimensionality)
4. Arbitrary Value Imputation
this technique was derived from kaggle competition It consists of replacing NAN by an arbitrary value
Advantages
- Easy to implement
- Captures the importance of missingess if there is one
Disadvantages
- Distorts the original distribution of the variable
- If missingess is not important, it may mask the predictive power of the original variable by distorting its distribution
- Hard to decide which value to use
How To Handle Categroical Missing Values?
Advantages
- Easy To implement
- Fater way to implement
- Since we are using the more frequent labels, it may use them in an over respresented way, if there are many nan's
- It distorts the relation of the most frequent label
5.Handling Imbalanced dataset
6. Correlation and Multi collinearity
Case Study:
Predicting online channel acceptance with social media:
What are the factors that contribute to the scope of information propagation and how accurately predict its variation
Generative Approach:
characterizing and modelling the process of content becoming more popular in social network
excellent interpretability
poor prediction of variation in real-world cascade size
Feature based approach:
Solving regression or classification task with a learning algorithm and a set of predicting features
provides framework for both prediction and explanation
NO CONSENSUS ON WHAT FEATURES ARE THE MOST ESSENTIAL IS ESTABLISHED
features of an author of a tweet along with features
NON LINEAR ALGORITHMS OUTPERFORM LINEAR ONES
THE HIGHEST PROPORTION OF VARIANCE IS EXPLAINED BY THE MODEL IN THE OUTCOME IS LESS THAN 50%
Why social media is so interesting?
very short(140 char)
lexically divergent
abbreviated
multi lingual
Why language in social media is so challenging?
Highly dimensional latent space
Data drift(trends changes)
User activeness generalization
Topical sparsity relationships, politics not discussed all the time
Dynamic streaming nature
NLP, ML for computational social science
correlating user properties and emotions on twitter
ML methods
Online Streaming predictions
Dynamic learning and interference from mixed streams
Research Method
1.Surveys to be conducted
which website helps the customer buy more products
more relevant, less relavant, use less
2.Respondent characteristics
different age groups
different buying capacity
3. products showcased in the survey
second hand car, books, pubs and cameras, computers, fitness centres
Social media classification:
Whole social media will be divided into different groups or products
Relational classifiers
Buying of product depends on the neighbours
Eigen Values, Eigen vectors
PCA
ICA
Autoencoder
Multifactor dimensionality reduction
Comments
Post a Comment