Feature selection where data is skewed in shape (classification model) - machine-learning

I have data where few features having skewed which is right-tailed in shape. should I need to transform these data to make standard normal distribution and check the prediction or should I drop these features and try predicting the model?
what should I need to do? because we need transformation whenever we use the regression model but
in the classification model, do we need to do the transformation?

This transformation is designed to help your optimizer converge, and this does not depend on whether your performing classification or regression, so you might want to consider a log transformation to try to make your data normally distributed.

Related

Which SMOTE algorithm should I use for Augmentation of Time Series dataset?

I am working on a Time Series Dataset where i want to do forcasting and prediction both. So, if you have any suggestion please share. Thank You!
T-Smote
This allows one to both impute fully missing observations to allow uniform time series classification across the entire data and, in special cases, to impute individually missing features. To do so, we slightly generalize the well-known class imbalance algorithm SMOTE to allow component wise nearest neighbor interpolation that preserves correlations when there are no missing features. We visualize the method in the simplified setting of 2-dimensional uncoupled harmonic oscillators. Next, we use tSMOTE to train an Encoder/Decoder long-short term memory (LSTM) model with Logistic Regression for predicting and classifying distinct trajectories of different 2D oscillators.

LDA as the dimension reduction before or after partitioning

I am doing a classification and I have this question about using LDA just for dimension reduction:
Shall the LDA be applied on whole feature matrix including train and test data and then (after reducing the dimension of data) do the partitioning of feature matrix to provide train and test sets for classification? Is it true?
Then, suppose we need to partition the data before applying the LDA. How is it possible to do the classification on the test data using the Matlab's internal classifiers like kNN and SVM?
You should generate the LDA on the train and afterwards apply it on the test set as well.
The reason is that you wan't to check how your entire processing chain performs on unseen data. If you generate the LDA model on train/test it might be that otherwise less important information might disappear.
Actually if you determine the number of dimensions you should go for a train/test/validation split. Where you determine the optimal number of dimension on train/test. Then build LDA+Model on train and test merged and evaluate on validation.

Data Science Scaling/Normalization real case

When do data pre-processing, it is suggested to do either scaling or normalization. It is easy to do it when you have data on your hand. You have all the data and can do it right away. But after the model built and run, does the first data that comes in need to be scaled or normalized? If it needed, it only one single row how to scale or normalize it? How do we know what is the min/max/mean/stdev from each feature? And how is the incoming data is the min/max/mean each feature?
Please advise
First of all you should know when to use scaling and normalization.
Scaling - scaling is nothing but to transform your features to comparable magnitudes.Let say if you have features like person's income and you noticed that some have value of order 10^3 and some have 10^6.Now if you model your problem with this features then algorithms like KNN, Ridge Regression will give higher weight to higher magnitude of such attributes.To prevent this you need to first scale your features.Min-Max scaler is one of the most used scaling.
Mean Normalisation -
If after examining the distribution of the feature and you found that feature is not centered around zero then for the algorithm like svm where objective function already assumes zero mean and same order variance, we could have problem in modeling.So here you should do Mean Normalisation.
Standardization - For the algorithm like svm, neural network, logistic regression it is necessary to have a variance of the feature in the same order.So why don't we make it to one.So in standardization, we make the distribution of features to zero mean and unit variance.
Now let's try to answer your question in terms of training and testing set.
So let's say you are training your model on 50k dataset and testing on 10k dataset.
For the above three transformations, the standard approach says that you should fit any normalizer or scaler to only training dataset and use only transform for the testing dataset.
In our case, if we want to use standardization then we will first fit our standardizer on 50k training dataset and then used to transform it 50k training dataset and also testing dataset.
Note - We shouldn't fit our standardizer to test dataset, in place of we will use already fitted standardizer to transform testing dataset.
Yes, you need to apply normalization to the input data, else the model will predict nonsense.
You also have to save the normalization coefficients that were used during training, or from training data. Then you have to apply the same coefficients to incoming data.
For example if you use min-max normalization:
f_n = (f - min(f)) / (max(f) - min_(f))
Then you need to save the min(f) and max(f) in order to perform normalization for new data.

Using probabilities from a model as features to another

I'd like to use the probability output from a model as features to another model.
For instance, I want to determine what kind of bird is on a picture, I want to use a CNN, train it and then use the probability result with other data, like size and weight from the bird, and feed it to a svm.
Do I need to use training and testing set for extracting these probabilities using the CNN? Should I devide my dataset into folds and then extract the probabilities for each different testing fold or can I just train and test on all my data and save the probabilities?
A test set is intended to validate your classifier reaches its goals, or alternatively to set hyper-parameters. In this case, you're not interested in the output of the CNN, as it's just an intermediate layer in the bigger picture.
Having said that, you're apparently not back-propagating SVM errors through its inputs. That's the consequence of a two-stage model. If you did, you'd be optimizing the CNN for use as input to that particular SVM.

Applying PCA before sending data to SVM

Before applying SVM on my data I want to reduce its dimension by PCA. Should I separate the Train data and Test data then apply PCA on each of them separately or apply PCA on both sets combined then separate them?
Actually both provided answers are only partially right. The crucial part here is what is the exact problem you are trying to solve. There are two basic possible settings which can be considered, and both are valid under some assumptions.
Case 1
You have some data (which you splitted to train and test) and in the future you will get more data coming from the same distribution.
If this is the case, you should fit PCA on train data, then SVM on its projection, and for testing you just apply already fitted PCA followed by already fitted SVM, and you do exactly the same for new data that will come. This way your test error (under some "size assumptions" should approximate your expected error).
Case 2
You have some data (which you splitted train and test) and in the future you will obtain a big chunk of unlabeled data and you will be able to fit your model then.
In such a case, you fit PCA on whole data provided, learn SVM on labeled part (train set) and evaluate on test set. This way, once new data arrives you can fit PCA using both your data and new ones, and then - train SVM on your old data (as this is the only one having labels). Under the assumption that again - data comes from the same distributions, everything is correct here. You use more data to fit PCA only to have a better estimator (maybe your data is really high dimensional and PCA fails with small sample?).
You should do them separately. If you run pca on both sets combined then you are going to introduce a bias in your svn. The goal of the test set is to see how your algorithm will perform without prior knowledge of the data.
Learn the Projection Matrix of PCA on the train set and use this to reduce the dimensions of the test data.
One benifit is this way you don't have to rely on collecting sufficient data in the test set if you are applying your classifier for actual run time where test data comes one sample at a time.
Also I think separate train and test PCA will fail.Why?
Think of PCA as giving you features, and then you learn a classifier over these features. If over time your data shifts, then the test features you get using PCA would be different, and you don't have a classifier trained on these features. Even if the set of directions/features of the PCA remain same but their order varies your classifier still fails.

Resources