Intuition behind standard deviation as a threshold and why - machine-learning

I have a set of input output training data, few samples are
Input output
[1 0 0 0 0] [1 0 1 0 0]
[1 1 0 0 1] [1 1 0 0 0]
[1 0 1 1 0] [1 1 0 1 0]
and so on. I need to apply standard deviation on the entire output as a threshold. So, I calculate the mean standard deviation for the output. The application is that the model when presented this data should be able to learn and predict the output. There is a condition in my objective function design which is the distance = sum of the sqrt of the euclidean distance between model output and the desired target, corresponding to an input should be less than a threshold.
My question is how should I justify the use of threshold? Is it justified ? I read this article article which says that it is common to take standard deviation as the threshold.
For my case, what does it mean taking the standard deviation of the output of the training data?

There is no intuition/philosophy behind std deviation (or variance), statisticians like these measures purely because they are mathematically easy to work with due to various nice properties. See https://math.stackexchange.com/questions/875034/does-expected-absolute-deviation-or-expected-absolute-deviation-range-exist
There are quite a few other ways to perform various forms of outliar detection, belief revision, etc, but they can be more mathematically challenging to work with.

I am not sure this idea applies. You are looking at the definition of standard deviation for a univariate value, but your output is multivariate. There are multivariate analogs, but, it's not clear why you need to apply it here.
It sounds like you are minimizing squared error, or Euclidean distance, between the output and known correct output. That's fine, and makes me think you're predicting the multivariate output shown here. What is the threshold doing then? input is less than what measure of what from what?

Related

Implementing a neural network classifier for my data, but is it solveable this way?

I will try to explain what the problem is.
I have 5 materials, each composed of 3 different minerals of a set of 10 different minerals. For each material I have measured the inensity vs wavelength. And each Intensity vs wavelength vector can be mapped into a binary vector of ones and zeros corresponding to the minerals the material is composed of.
So material 1 has an intensity of [0.51 0.53 0.57 0.68...... ] measured at different wavelengths [470 480 490 500 510 ......] and a binary vector
[1 0 0 0 1 0 0 1 0 0]
and so on for each material.
For each material I have 5000 examples, so 25000 examples for all. Each example will have a 'similar' intensity vs wavelength behaviour but will give the 'same' binary vector.
I want to design a NN classifier so that if I give it as an input the intensity vs wavelength, it gives me the corresponding binary vector.
The intensity vs wavelength has a length of 450 so I will have 450 units in the input layer
the binary vector has a length of 10, so 10 output neurons
the hidden layer/s will have as a beginning 200 neurons.
Can I simly design a NN classifier this way, and would it solve the problem, or I need something else?
You can do that, however, be aware to use the right cost and output layer activation functions. In your case, you should use sigmoid units for your outer layer and binary-cross-entropy as a cost function.
Another way to go about this would be to use one-hot encoding so that you can use normal multi-class classification (will probably not make sense since your output is probably sparse).

Normalization of data before activation function

I am Altering this tutorial to matlab where I am trying to classify to 1/0 class. each of my data points x is of dimension 30, that is it has 30 features. This is my first NN.
My problem is, when I try to calculate a1=np.tanh(z1) or in matlab a1 = tanh(z1); I am getting either 1 or -1 values since |z1|>2.
Should I Normalize the values?
Is there any promises I missed in the tutorial to stay within the -2 < z1 < 2 range?
Am I correct in assuming it is a problem step out of boundaries?
Input values should always be normalized, usually to the [0, 1] range. The network might not train otherwise.
Another thing that is worth noting is that you are using tanh as activation, and this function saturates at the extremes, which means zero gradient. Other activation functions like the ReLU (max(0, x)) don't have this problem. It is worth a try.

How to predict a continuous dependent variable that expresses target class probabilities?

My samples can either belong to class 0 or class 1 but for some of my samples I only have a probability available for them to belong to class 1. So far I've discretized my target variable by applying a threshold i.e. all y >= t I assigned to class 1 and I've discarded all samples that have non-zero probability to belong to class 1. Then I fitted a linear SVM to the data using scitkit-learn.
Of cause this way I through away quite a bit of the training data. One idea I had was to omit the discretization and use regression instead but usually it's not a good idea to approach classification by regression as for example it doesn't guarantee predicted values to be in the interval [0,1].
By the way the nature of my features x is similar as for some of them I also only have probabilities for the respective feature to be present. For the error it didn't make a big difference if I discretized my features in the same way I discretized the dependent variable.
You might be able to approximate this using sample weighting - assign a sample to the class which has the highest probability, but weight that sample by the probability of it actually belonging. Many of the scikit-learn estimators allow for this.
Example:
X = [1, 2, 3, 4] -> class 0 with probability .7 would become X = [1, 2, 3, 4] y = [0] with sample weight of .7 . You might also normalize so the sample weights are between 0 and 1 (since your probabilities and sample weights will only be from .5 to 1. in this scheme). You could also incorporate non-linear penalties to "strengthen" the influence of high probability samples.

How to interpret SVM-light results

I'm using SVM-light as its written in tutorial to classify data into 2 classes:
Train file:
+1 6357:1 8984:1 11814:1 15465:1 16031:1
+1 6357:1 7629:0.727 7630:42 7631:0.025
-1 6357:1 11814:1 11960:1 13973:1
...
And test file:
0 6357:1 8984:1 11814:1 15465:1
0 6357:1 7629:1.08 7630:33 7631:0.049 7632:0.03
0 6357:1 7629:0.069 7630:6 7631:0.016
...
By executing svm_learn.exe train_file model -> svm_classify.exe test_file model output I get some kind of unexpected values in output:
-1.0016219
-1.0016328
-1.0016218
-0.99985838
-0.99985853
Isn't it should be exactly +1 or -1 as classes in train file? Or some kind of float number between -1 and +1 to manually choose a 0 as a solution for classifying or some another number, but as for me it's pretty unexpected situation when all of the numbers are just close to -1 and some of them even less.
UPD1: It's said that if the result number is negative then its class -1, if it's positive - +1. Still questioning what does this value after the sign mean? I've just started exploring SVM so it may be an easy or stupid question :) And if I get pretty bad prediction what steps should I take - another kernels? Or maybe some other options to make SVM-light more relevant to my data?
Short answer: just take the sign of the result
Longer answer:
A SVM takes an input and returns a real-valued output (which is what you are seeing).
On the training data, the learning algorithm tries to set the output to be >= +1 for all positive examples and <= -1 for all negative examples. Such points have no error. This gap between -1 and +1 is the "margin." Points in "no-man's land" between -1 and +1 and points on the completely wrong side (like a negative point with an output of >+1) are errors (which the learning algorithm is trying to minimize over the training data).
So, when testing, if the result is less than -1, you can be reasonably certain it is a negative example. If it is greater than +1, you can be reasonably certain it is a positive example. If it is in between, then the SVM is pretty uncertain about it. Usually, you must make a decision (and cannot say "I don't know") and so people use 0 as the cut-off between positive and negative labels.

Normalizing feature values for SVM

I've been playing with some SVM implementations and I am wondering - what is the best way to normalize feature values to fit into one range? (from 0 to 1)
Let's suppose I have 3 features with values in ranges of:
3 - 5.
0.02 - 0.05
10-15.
How do I convert all of those values into range of [0,1]?
What If, during training, the highest value of feature number 1 that I will encounter is 5 and after I begin to use my model on much bigger datasets, I will stumble upon values as high as 7? Then in the converted range, it would exceed 1...
How do I normalize values during training to account for the possibility of "values in the wild" exceeding the highest(or lowest) values the model "seen" during training? How will the model react to that and how I make it work properly when that happens?
Besides scaling to unit length method provided by Tim, standardization is most often used in machine learning field. Please note that when your test data comes, it makes more sense to use the mean value and standard deviation from your training samples to do this scaling. If you have a very large amount of training data, it is safe to assume they obey the normal distribution, so the possibility that new test data is out-of-range won't be that high. Refer to this post for more details.
You normalise a vector by converting it to a unit vector. This trains the SVM on the relative values of the features, not the magnitudes. The normalisation algorithm will work on vectors with any values.
To convert to a unit vector, divide each value by the length of the vector. For example, a vector of [4 0.02 12] has a length of 12.6491. The normalised vector is then [4/12.6491 0.02/12.6491 12/12.6491] = [0.316 0.0016 0.949].
If "in the wild" we encounter a vector of [400 2 1200] it will normalise to the same unit vector as above. The magnitudes of the features is "cancelled out" by the normalisation and we are left with relative values between 0 and 1.

Resources