Data Science Scaling/Normalization real case - machine-learning

When do data pre-processing, it is suggested to do either scaling or normalization. It is easy to do it when you have data on your hand. You have all the data and can do it right away. But after the model built and run, does the first data that comes in need to be scaled or normalized? If it needed, it only one single row how to scale or normalize it? How do we know what is the min/max/mean/stdev from each feature? And how is the incoming data is the min/max/mean each feature?
Please advise

First of all you should know when to use scaling and normalization.
Scaling - scaling is nothing but to transform your features to comparable magnitudes.Let say if you have features like person's income and you noticed that some have value of order 10^3 and some have 10^6.Now if you model your problem with this features then algorithms like KNN, Ridge Regression will give higher weight to higher magnitude of such attributes.To prevent this you need to first scale your features.Min-Max scaler is one of the most used scaling.
Mean Normalisation -
If after examining the distribution of the feature and you found that feature is not centered around zero then for the algorithm like svm where objective function already assumes zero mean and same order variance, we could have problem in modeling.So here you should do Mean Normalisation.
Standardization - For the algorithm like svm, neural network, logistic regression it is necessary to have a variance of the feature in the same order.So why don't we make it to one.So in standardization, we make the distribution of features to zero mean and unit variance.
Now let's try to answer your question in terms of training and testing set.
So let's say you are training your model on 50k dataset and testing on 10k dataset.
For the above three transformations, the standard approach says that you should fit any normalizer or scaler to only training dataset and use only transform for the testing dataset.
In our case, if we want to use standardization then we will first fit our standardizer on 50k training dataset and then used to transform it 50k training dataset and also testing dataset.
Note - We shouldn't fit our standardizer to test dataset, in place of we will use already fitted standardizer to transform testing dataset.

Yes, you need to apply normalization to the input data, else the model will predict nonsense.
You also have to save the normalization coefficients that were used during training, or from training data. Then you have to apply the same coefficients to incoming data.
For example if you use min-max normalization:
f_n = (f - min(f)) / (max(f) - min_(f))
Then you need to save the min(f) and max(f) in order to perform normalization for new data.

Related

How to deal with dataset of different features?

I am working to create an MLP model on a CEA Classification Dataset (Binary Classification). Each sample contains different 4 features, such as resistance and other values, each in its own range (resistance in hundreds, another in micros, etc.). I am still new to machine learning and this is the first real model to build. How can I deal with such data? I have tried feeding each sample to the neural network with a sigmoid activation function, but I am not getting accurate results. My assumption to deal with this kind of data is to scale it? If so, what are some resources which are useful to look at, since I do not quite understand when is scaling required.
Scaling your data can be an important step in building a machine-learning model, especially when working with neural networks. Scaling can help to ensure that all of the features in your dataset are on a similar scale, which can make it easier for the model to learn.
There are a few different ways to scale your data, such as normalization and standardization. Normalization is the process of scaling the data so that it has a minimum value of 0 and a maximum value of 1. Standardization is the process of scaling the data so that it has a mean of 0 and a standard deviation of 1.
When working with your CEA Classification dataset, it might be helpful to try both normalization and standardization to see which one works better for your specific dataset. You can use scikit-learn library's preprocessing functions like MinMaxScaler() and StandardScaler() for normalization and standardization respectively.
Additionally, it might be helpful to try different activation functions, such as ReLU or LeakyReLU, to see if they lead to more accurate results. Also, you can try adding more layers and neurons in your neural network to see if it improves the performance.
It's also important to remember that feature engineering, which includes the process of selecting the most important features, can be more important than scaling.

Imbalanced classification: order of oversampling vs. scaling features?

When performing classification (for example, logistic regression) with an imbalanced dataset (e.g., fraud detection), is it best to scale/zscore/standardize the features before over-sampling the minority class, or to balance the classes before scaling features?
Secondly, does the order of these steps affect how features will eventually be interpreted (when using all data, scaled+balanced, to train a final model)?
Here's an example:
Scale first:
Split data into train/test folds
Calculate mean/std using all training (imbalanced) data; scale the training data using these calculations
Oversample minority class in the training data (e.g, using SMOTE)
Fit logistic regression model to training data
Use mean/std calculations to scale the test data
Predict class with imbalanced test data; assess acc/recall/precision/auc
Oversample first
Split data into train/test folds
Oversample minority class in the training data (e.g, using SMOTE)
Calculate mean/std using balanced training data; scale the training data using these calculations
Fit logistic regression model to training data
Use mean/std calculations to scale the test data
Predict class with imbalanced test data; assess acc/recall/precision/auc
Since under- and over-sampling techniques often depend on k-NN or k-means-like algorithms which use the Euclidean distance between data points, it is safer to scale before resampling. However, in real-life, the order hardly matters.
You may have meant it implicitly, but you need to apply the mean/std to scale the training data as well, and that needs to happen before you fit the model.
Barring that point, there isn't a definitive answer on this. The best thing would be to simply try both and see which works best for your data.
For you own understanding of the model on the resulting data, you may want to instead play with computing the mean and standard deviation of the minority and majority classes. If they have similar statistics, then we wouldn't expect much of a difference between scale first or over-sample first.
If the means and standard deviations are very different, the results may differ significantly. But that may also mean the problem has greater separation, and you may expect a higher classification accuracy.

Applying PCA before sending data to SVM

Before applying SVM on my data I want to reduce its dimension by PCA. Should I separate the Train data and Test data then apply PCA on each of them separately or apply PCA on both sets combined then separate them?
Actually both provided answers are only partially right. The crucial part here is what is the exact problem you are trying to solve. There are two basic possible settings which can be considered, and both are valid under some assumptions.
Case 1
You have some data (which you splitted to train and test) and in the future you will get more data coming from the same distribution.
If this is the case, you should fit PCA on train data, then SVM on its projection, and for testing you just apply already fitted PCA followed by already fitted SVM, and you do exactly the same for new data that will come. This way your test error (under some "size assumptions" should approximate your expected error).
Case 2
You have some data (which you splitted train and test) and in the future you will obtain a big chunk of unlabeled data and you will be able to fit your model then.
In such a case, you fit PCA on whole data provided, learn SVM on labeled part (train set) and evaluate on test set. This way, once new data arrives you can fit PCA using both your data and new ones, and then - train SVM on your old data (as this is the only one having labels). Under the assumption that again - data comes from the same distributions, everything is correct here. You use more data to fit PCA only to have a better estimator (maybe your data is really high dimensional and PCA fails with small sample?).
You should do them separately. If you run pca on both sets combined then you are going to introduce a bias in your svn. The goal of the test set is to see how your algorithm will perform without prior knowledge of the data.
Learn the Projection Matrix of PCA on the train set and use this to reduce the dimensions of the test data.
One benifit is this way you don't have to rely on collecting sufficient data in the test set if you are applying your classifier for actual run time where test data comes one sample at a time.
Also I think separate train and test PCA will fail.Why?
Think of PCA as giving you features, and then you learn a classifier over these features. If over time your data shifts, then the test features you get using PCA would be different, and you don't have a classifier trained on these features. Even if the set of directions/features of the PCA remain same but their order varies your classifier still fails.

Input matches no features in training set; how much more training data do I need?

I am new to Text Mining. I am working on Spam filter. I did text cleaning, removed stop words. n-grams are my features. So I build a frequency matrix and build model using Naive Bayes. I have very limited set of training data, so I am facing the following problem.
When a sentence comes to me for classification and if none of its features match with the existing features in training then my frequency vector has only zeros.
When I send this vector for classification, I obviously get a useless result.
What can be ideal size of training data to expect better results?
Generally, the more data you have, the better. You will get diminishing returns at some point. It is often a good idea to see if your training set size is a problem by plotting the cross validation performance while varying the size of the training set. In scikit-learn has an example of this type of "learning curve."
Scikit-learn Learning Curve Example
You may consider bringing in outside sample posts to increase the size of your training set.
As you grow your training set, you may want to try reducing the bias of your classifier. This could be done by adding n-gram features, or switching to a logistic regression or SVM model.
When a sentence comes to me for classification and if none of its features match with the existing features in training then my frequency vector has only zeros.
You should normalize your input so that it forms some kind of rough distribution around 0. A common method is to do this tranformation:
input_signal = (feature - feature_mean) / feature_stddev
Then all zeroes would only happen if all features were exactly at the mean.

What is the best/preferred approach to implement Maximum Likelihood Estimation for large data sets in GBs

I have a data-set in Gigabytes(GB) and want to estimate the parameters for missing values in that.
There is an algorithm called MLE(Maximum-likelihood Estimation) in machine learning that can be used for it.
Since R might not work on such a large data-set,so which library will be best to use for it?
By wiki:MLE:
In statistics, maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters.
Generally you need two steps before you can apply MLE:
obtain a dataset
identify a statistical model
At this time, if you can obtain an analytic form of solution for the MLE estimate, just stream your data to the mle-estimate calculation, e.g., for gaussian distribution, to estimate mean, you just accumulate the sum, and keep the count and the sample mean will be your mle-estimate.
However, when the model involves many parameters and its pdf is highly non-linear. In such situations, the MLE estimate must be sought numerically using nonlinear optimization algorithms. If your data size is huge, try stochastic gradient descent, the true gradient is approximated by a gradient at a single example. As the algorithm sweeps through the training set, it performs the update formula for each training example. So that you can still stream your data one at a time to your update program in multiple sweeps fashion. In this way, memory constraint should not be a problem at all.

Resources