I'm starting to learn about neural networks and I came across data normalisation. I understand the need for it but I don't quite know what to do with my data once my model is trained and in the field.
Let say I take my input data, subtract its mean and divide by the standard deviation. Then I take that as inputs and I train my neural network.
Once in the field, what do I do with my input sample on which I want a prediction?
Do I need to keep my training data mean and standard deviation and use that to normalise?
Correct. The mean and standard deviation that you use to normalize the training data will be the same that you use to normalize the testing data (i.e, don't compute a mean and standard deviation for the test data).
Hopefully this link will give you more helpful info: http://cs231n.github.io/neural-networks-2/
An important point to make about the preprocessing is that any preprocessing statistics (e.g. the data mean) must only be computed on the training data, and then applied to the validation / test data. E.g. computing the mean and subtracting it from every image across the entire dataset and then splitting the data into train/val/test splits would be a mistake. Instead, the mean must be computed only over the training data and then subtracted equally from all splits (train/val/test).
Related
I am learning to use autoencoders for anomaly detection. I am using the model provided on the keras website, with data that I have created. My ‘correct data’ used for training the model is a single frequency sine wave with random amplitudes and my ‘anomaly data’ used for testing the model is the original data, with a sine wave of a different frequency superimposed.
In the time domain the model works very well. However, in the frequency domain (I did a fft), because the data is almost all zeros with either 1 or 2 nonzero values, the sensitivity of the loss function to these data points is extremely low. The model is ultimately able to detect the different between the correct data and the anomalies with 100% precision but the loss is near 0 in all cases. If we had a scenario where instead of the non-dominant frequencies being 0, they were simply a small number, the ML model would surely have no chance of making the correct prediction. Please see the following graphs
Example of correct data + prediction
Example of anomaly data + prediction
Histogram of predictions, showing clear separation into 2 group
What can I do to make my model loss more sensitive to the desired data? One idea I have is to remove the zero data. However, this means that my input data length is not fixed. I don’t even know if the CNN- can handle this. I don’t want to switch to an LSTM as the convolution is important to me so that the network can handle various frequencies without being explicitly trained on those frequencies.
What other options do I have? Perhaps a loss function which can give higher importance to these values? I do understand that if have noisy data I can clean it but I am still concerned that it will not insufficient for less 'black and white' scenarios.
When do data pre-processing, it is suggested to do either scaling or normalization. It is easy to do it when you have data on your hand. You have all the data and can do it right away. But after the model built and run, does the first data that comes in need to be scaled or normalized? If it needed, it only one single row how to scale or normalize it? How do we know what is the min/max/mean/stdev from each feature? And how is the incoming data is the min/max/mean each feature?
Please advise
First of all you should know when to use scaling and normalization.
Scaling - scaling is nothing but to transform your features to comparable magnitudes.Let say if you have features like person's income and you noticed that some have value of order 10^3 and some have 10^6.Now if you model your problem with this features then algorithms like KNN, Ridge Regression will give higher weight to higher magnitude of such attributes.To prevent this you need to first scale your features.Min-Max scaler is one of the most used scaling.
Mean Normalisation -
If after examining the distribution of the feature and you found that feature is not centered around zero then for the algorithm like svm where objective function already assumes zero mean and same order variance, we could have problem in modeling.So here you should do Mean Normalisation.
Standardization - For the algorithm like svm, neural network, logistic regression it is necessary to have a variance of the feature in the same order.So why don't we make it to one.So in standardization, we make the distribution of features to zero mean and unit variance.
Now let's try to answer your question in terms of training and testing set.
So let's say you are training your model on 50k dataset and testing on 10k dataset.
For the above three transformations, the standard approach says that you should fit any normalizer or scaler to only training dataset and use only transform for the testing dataset.
In our case, if we want to use standardization then we will first fit our standardizer on 50k training dataset and then used to transform it 50k training dataset and also testing dataset.
Note - We shouldn't fit our standardizer to test dataset, in place of we will use already fitted standardizer to transform testing dataset.
Yes, you need to apply normalization to the input data, else the model will predict nonsense.
You also have to save the normalization coefficients that were used during training, or from training data. Then you have to apply the same coefficients to incoming data.
For example if you use min-max normalization:
f_n = (f - min(f)) / (max(f) - min_(f))
Then you need to save the min(f) and max(f) in order to perform normalization for new data.
Before applying SVM on my data I want to reduce its dimension by PCA. Should I separate the Train data and Test data then apply PCA on each of them separately or apply PCA on both sets combined then separate them?
Actually both provided answers are only partially right. The crucial part here is what is the exact problem you are trying to solve. There are two basic possible settings which can be considered, and both are valid under some assumptions.
Case 1
You have some data (which you splitted to train and test) and in the future you will get more data coming from the same distribution.
If this is the case, you should fit PCA on train data, then SVM on its projection, and for testing you just apply already fitted PCA followed by already fitted SVM, and you do exactly the same for new data that will come. This way your test error (under some "size assumptions" should approximate your expected error).
Case 2
You have some data (which you splitted train and test) and in the future you will obtain a big chunk of unlabeled data and you will be able to fit your model then.
In such a case, you fit PCA on whole data provided, learn SVM on labeled part (train set) and evaluate on test set. This way, once new data arrives you can fit PCA using both your data and new ones, and then - train SVM on your old data (as this is the only one having labels). Under the assumption that again - data comes from the same distributions, everything is correct here. You use more data to fit PCA only to have a better estimator (maybe your data is really high dimensional and PCA fails with small sample?).
You should do them separately. If you run pca on both sets combined then you are going to introduce a bias in your svn. The goal of the test set is to see how your algorithm will perform without prior knowledge of the data.
Learn the Projection Matrix of PCA on the train set and use this to reduce the dimensions of the test data.
One benifit is this way you don't have to rely on collecting sufficient data in the test set if you are applying your classifier for actual run time where test data comes one sample at a time.
Also I think separate train and test PCA will fail.Why?
Think of PCA as giving you features, and then you learn a classifier over these features. If over time your data shifts, then the test features you get using PCA would be different, and you don't have a classifier trained on these features. Even if the set of directions/features of the PCA remain same but their order varies your classifier still fails.
I have a backpropagation neural network that I have created and coded it in Q with a Kdb+ database.
I am pre-processing data into the network with normalization into the form of [0,1], the network is trained on and predicts future moving averages on a large data set split into 60:20:20 respectively.
Normalization formula:
processed data: (0.8*(VALn - MINn)/(MAXn - MINn))+0.1
VALn = unprocessed data value
MAXn = max of data set
MINn = min of data set
How do I go about normalizing new data into the final trained network?
Would I run new inputs through the above formula keeping the MIN and MAX values from the training set?
Thanks
You should keep the same MAXn and MINn, since changing it at test time would mean that you are changing how raw data is mapped to processed data. For a quick check, try the preprocessing with different MAXn and MINn, then try to predict the training cases. You will get lower performance since the normalized data do not look as they did before.
Note that if you happen to have data in test set that is higher/lower than MAXn/MINn, then those data will not be in range [0,1] after normalization. This is generally okay if there is not too many of these cases. It simply means the neural net is seeing data that is little out of previously seen range.
Yes, for prediction you should use the same normalization formula, as you use for normalization of training data.
I have this 5-5-2 backpropagation neural network I'm training, and after reading this awesome article by LeCun I started to put in practice some of the ideas he suggests.
Currently I'm evaluating it with a 10-fold cross-validation algorithm I made myself, which goes basically like this:
for each epoch
for each possible split (training, validation)
train and validate
end
compute mean MSE between all k splits
end
My inputs and outputs are standardized (0-mean, variance 1) and I'm using a tanh activation function. All network algorithms seem to work properly: I used the same implementation to approximate the sin function and it does it pretty good.
Now, the question is as the title implies: should I standardize each train/validation set separately or do I simply need to standardize the whole dataset once?
Note that if I do the latter, the network doesn't produce meaningful predictions, but I prefer having a more "theoretical" answer than just looking at the outputs.
By the way, I implemented it in C, but I'm also comfortable with C++.
You will most likely be better off standardizing each training set individually. The purpose of cross-validation is to get a sense for how well your algorithm generalizes. When you apply your network to new inputs, the inputs will not be ones that were used to compute your standardization parameters. If you standardize the entire data set at once, you are ignoring the possibility that a new input will fall outside the range of values over which you standardized.
So unless you plan to re-standardize every time you process a new input (which I'm guessing is unlikely), you should only compute the standardization parameters for the training set of the partition being evaluated. Furthermore, you should compute those parameters only on the training set of the partition, not the validation set (i.e., each of the 10-fold partitions will use 90% of the data to calculate standardization parameters).
So you assume the inputs are normally distribution and are subtracting the mean, dividing by standard deviation, to get N(0,1) distributed inputs?
Yes I agree with #bogatron that you standardize each training set separately, but I would more strongly say it's a "must" to not use the validation set data too. The problem is not values outside the range in the training set; this is fine, the transformation to a standard normal is still defined for any value. You can't compute mean / standard deviation overa ll the data because you can't in any way use the validation data in the training set, even if just via this statistic.
It should further be emphasized that you use the mean from the training set with the validation set, not the mean from the validation set. It has to be the same transformation of features that was used during training. It would not be valid to transform the validation set differently.