Do you apply min max scaling separately on training and test data? - machine-learning

While applying min max scaling to normalize your features, do you apply min max scaling on the entire dataset before splitting it into training, validation and test data?
Or do you split first and then apply min max on each set, using the min and max values from that specific set?
Lastly , when making a prediction on a new input, should the features of that input be normalized using the min, max values from the training data before being fed into the network?

Split it, then scale. Imagine it this way: you have no idea what real-world data looks like, so you couldn't scale the training data to it. Your test data is the surrogate for real-world data, so you should treat it the same way.
To reiterate: Split, scale your training data, then use the scaling from your training data on the testing data.

Related

Performance metrics after downsampling

I am working on a binary classification problem with an imbalanced dataset. I have decided to downsample the majority class and I’m wondering what the best approach is when calculating performance metrics on a model that has been trained on a downsampled dataset.
I noticed that the sklearn.metrics.precision_score and sklearn.metrics.recal_score functions have a sample_weight attribute. Is the purpose of this attribute to supply a weight for the downsampled class relative to the ratio in which I downsampled?
For example, if I had 1,000,000 samples for the negative class and I decided to downsample to 100,000, would I set the sample_weight attribute to be equal to 1,000,000 / 100,000 = 10 for the negative class?

Scaling data with large range in Machine learning preprocessing

I am very much new to Machine Learning.
And I am trying to apply ML on data containing nearly 50 features. Some features have range from 0 to 1000000 and some have range from 0 to 100 or even less than that. Now when I use feature scaling by using MinMaxScaler for range (0,1) I think features having large range scales down to very small values and this might affect me to give good predictions.
I would like to know if there is some efficient way to do scaling so that all the features are scaled appropriately.
I also tried standared scaler but accuracy did not improve.
Also Can I use different scaling function for some features and another for remaining features.
Thanks in advance!
Feature scaling, or data normalization, is an important part of training a machine learning model. It is generally recommended that the same scaling approach is used for all features. If the scales for different features are wildly different, this can have a knock-on effect on your ability to learn (depending on what methods you're using to do it). By ensuring standardized feature values, all features are implicitly weighted equally in their representation.
Two common methods of normalization are:
Rescaling (also known as min-max normalization):
where x is an original value, and x' is the normalized value. For example, suppose that we have the students' weight data, and the students' weights span [160 pounds, 200 pounds]. To rescale this data, we first subtract 160 from each student's weight and divide the result by 40 (the difference between the maximum and minimum weights).
Mean normalization
where x is an original value, and x' is the normalized value.

Future-proofing feature scaling in machine learning?

I have a question about how feature scaling works after training a model.
Let's say a neural network model predicts the height of a tree by training on outside temperature.
The lowest outside temperature in my training data is 60F and the max is 100F. I scale the temperature between 0 and 1 and train the model. I save the model for future predictions. Two months later, I want to predict on some new data. But this time the min and max temperatures in my test data are -20F and 50F, respectively.
How does the trained model deal with this? The range I imposed the scaling on in the training set to generate my trained model does not match the test data range.
What would prevent me from hard-coding a range to scale to that I know the data will always be within, say from -50F to 130F? The problem I see here is if I have a model with many features. If I impose a different hard scale to each feature, using feature scaling is essentially pointless, is it not?
Different scales won't work. Your model trains for one scale, it learns one scale, if you change the scale, your model will still think it's the same scale and make very shifted predictions.
Training again will overwrite what was learned before.
So, yes, hardcode your scaling (preferentially directly on your data, not inside the model).
And for a quality result, train with all the data you can gather.

Using PCA trained on a large data set for a smaller data set

Can I use a pca subspace trained on, say, eight features and one thousand time points to evaluate a single reading? That is, if I keep, say, the top six components, my transformation matrix will be 8x6 and using this to transform test data that is the same size as the training data would give me an 6x1000 vector.
But what if I want to look for anomalies at each time point independently? That is, can rather than use an 8x1000 test set, can I use 1000 separate transformation on 8x1 dimensional test vectors and get the same result? This vector will get transformed into the exact same spot as if it were the first row in a much larger data matrix, but the distance of that one vector from the principal axis doesn't appear to be meaningful. When I perform this same procedure on the truncated reference data, this distance isn't zero either, only the sum of all distances over the entire reference data set is zero. So if I can't show that the reference data is not "anomalous", how can I use this on test data?
Is it the case that the size of the data "object" used to train pca is the size of object that can be evaluated with it?
Thanks for any help you can give.

How can I train a naivebayes classifier incrementally?

Using Accord.NET I've created a NaiveBayes classifier. It will classify a pixel based on 6 or so sets of image processing results. My images are 5MP, so a training set of 50 images creates a very large set of training data.
6 int array per pixel * 5 million pixels * 50 images.
Instead of trying to store all that data in memory, is there a way to incrementally train the NaiveBayes classifier? Calling Learn() multiple times overwrites the old data each time rather than adding to it.
Right now is not possible to train a Naive Bayes model incrementally using Accord.NET.
However, since all that Naive Bayes is going to do is to try to fit some distributions to your data, and since your data has very few dimensions, maybe you could try to learn your model on a subsample of your data rather than all of it at once.
When you go loading images to build your training set, you can try to randomly discard x% of the pixels in each image. You can also plot the classifier accuracy for different values of x to find the best balance between memory and accuracy for your model (hint: for such a small model and this large amount of training data, I expect that it wont make that much of a difference even if you dropped 50% of your data).

Resources