I'm normalizing and rescaling my training set with:
# zero mean
feat = (feat - feat.mean()) / feat.std()
# scale between -1, 1
feat = ((feat - feat.min()) / (feat.max() - feat.min())) * 2 - 1
This works great. I transform the test set in the exact same way, using the mean, STD, min, max from the training set. This works fine if the mean and max in the test set are the same as the training set. However, if the range of the untransformed feature in the test set is different, then I'll have values beyond -1, 1 after rescaling. How can this be addressed?
If a large proportion of your test inputs are coming in with values higher or lower than the extremes that you used to train the model, then you should ideally retrain your model, since your train and test distributions are different.
For unusual (outlier) like test instances, you could clip the values to be between train max/min for minmax scaling.
In case of normalizing, your test can be any value, you would just get a large z-score for extremes.
I think the only way is to normalize your data with the min and max of all data (training and testing set toghether).
Related
I have a dataset with some outliers, which are 10 or 100 times greater than the normal values. I cannot throw out these rows, and I want to normalize this data in an interval [0, 1]
First of all, here's what I thought to do:
Simply rank my dataset's rows and use the ranked positions as variable to normalize. Since we have a uniform distribution here, it is easy. The problem is that the value's differences are not measured, so values with a large difference could have similar normalized values if there aren't intermediate value examples in this dataset
Use sklearn.preprocessing.RobustScaler method. But I got normalized values between -0.4 and 300. It is still not good to normalize something in this scale
Distribute normalized values between 0 and 0.8 in a linear way for all values where quantile <= 0.8, and distribute the values between 0.8 and 1.0 among the remaining values in a similar way to the ranking strategy I mentioned above
Make a 1D Kmeans algorithm to locate all near values and get a cluster with non-outlier values. For these values, I just distribute normalized values between 0 and the quantile value it represents, simply by doing (value - mean) / (max - min), and for the remaining outlier values, I distribute the range between values greater than the quantile and 1 with the ranking strategy
Create a filter function, like a sigmoid, and multiply values by it. Smaller values remain unchanged, but the outlier's values are approximated to non-outlier values. Then, I normalize it. But how can I design this sigmoid's parameters?
First of all, I would like to get some feedbacks about these strategies, what do you think about them?
And also, how is this problem normally solved? Is there any references to recommend?
Thank you =)
I have a set of inputs that has 5000ish features with values that vary from 0.005 to 9000000. Each of the features has similar values (a feature with a value of 10ish will not also have a value of 0.1ish)
I am trying to apply linear regression to this data set, however, the wide range of input values is inhibiting effective gradient descent.
What is the best way to handle this variance? If normalization is best, please include details on the best way to implement this normalization.
Thanks!
Simply perform it as a pre-processing step. You can do it as following:
1) Calculate mean values for each of the features in the training set and store it. Be careful, do not mess up feature mean and sample mean, so you will have a vector of size [number_of_features (5000ish)].
2) Calculate std. for each feature in the training set and store it. Size of [number_of_feature] as well
3) Update each training and testing entry as:
updated = (original_vector - mean_vector)/ std_vector
That's it!
The code will look like:
# train_data shape [train_length,5000]
# test_data [test_length, 5000]
mean = np.mean(train_data,1)
std = np.std(train_data,1)
normalized_train_data = (train_data - mean)/ std
normalized_test_data = (test_data - mean)/ std
Here is the setup:
test_observations : 6,767;
train_observations: 73,268;
train/test batch_size = 50;
How should I set the batch_size, test_iter, test_interval, max_iter?
Thank you!
So your validation size is 6,767 and your validation batch size is 50.
your test_iter = validation set/ validation_batch_size = 6,767/50 = 135 (approx.) so that it will almost cover the validation set. and test interval, you can choose any value - its the amount of iterations after which your network will test the performance on the validation set. For larger network the use values like 5k for test_interval. for your network test_interval of 1000 seems to be fine.
for finding max_iter, you have to choose the number of epochs you want to go, i.e., number of times you want to cover your training size (lets say 2 for this- choose this number wisely not to overfit network). And one more thing there is no implementation of epoch in caffe currently but its effect can be seen from this formula.
max_iter = #epochs * (training set/training_batch_size) = 2 * (73,268/50) = 29,000 (approx). so that it will go over your training set twice, and after training for 1k images, it will validate on your 6,767 images for optimization.
I've been playing with some SVM implementations and I am wondering - what is the best way to normalize feature values to fit into one range? (from 0 to 1)
Let's suppose I have 3 features with values in ranges of:
3 - 5.
0.02 - 0.05
10-15.
How do I convert all of those values into range of [0,1]?
What If, during training, the highest value of feature number 1 that I will encounter is 5 and after I begin to use my model on much bigger datasets, I will stumble upon values as high as 7? Then in the converted range, it would exceed 1...
How do I normalize values during training to account for the possibility of "values in the wild" exceeding the highest(or lowest) values the model "seen" during training? How will the model react to that and how I make it work properly when that happens?
Besides scaling to unit length method provided by Tim, standardization is most often used in machine learning field. Please note that when your test data comes, it makes more sense to use the mean value and standard deviation from your training samples to do this scaling. If you have a very large amount of training data, it is safe to assume they obey the normal distribution, so the possibility that new test data is out-of-range won't be that high. Refer to this post for more details.
You normalise a vector by converting it to a unit vector. This trains the SVM on the relative values of the features, not the magnitudes. The normalisation algorithm will work on vectors with any values.
To convert to a unit vector, divide each value by the length of the vector. For example, a vector of [4 0.02 12] has a length of 12.6491. The normalised vector is then [4/12.6491 0.02/12.6491 12/12.6491] = [0.316 0.0016 0.949].
If "in the wild" we encounter a vector of [400 2 1200] it will normalise to the same unit vector as above. The magnitudes of the features is "cancelled out" by the normalisation and we are left with relative values between 0 and 1.
[predicted_label, accuracy, decision_values/prob_estimates] = svmpredict(testing_label_vector, testing_instance_matrix, model [, 'libsvm_options']);
1. I am using libsvm for image classification in matlab. What does
testing_label_vector, testing_instance_matrix, decision_values/prob_estimates, most importantly, accuracy in "svmpredict"
mean?
2. If I am using it for testing to obtain accuracy value, Do I have to
know the values for testing_label_vector?
(1)
testing_label: are the true labels of the data on which you want to test
testing_instance_matrix: is the data on which you want to test, one per row. The label of each data point is in testing_label.
decision_values: are the decision values
accuracy: is what percentage of the predicted labels agrees with the real labels
(2)
Yes. You certainly need ground truth to compute accuracy.