How to work out z-score normalisation? - machine-learning

I am confused as how to do z-score normalisation. I have found the equation to do this, required the mean and standard deviation, but I'm not sure how to work this out given my situation.
I have 2 classifiers in my system. To use the scores together, I know that I need to normalise them because they will differ in scales, etc. I wish to use z-score normalisation for this. My question is, given the 2 scores from the two classifiers, what do I need to do with the scores to z-score normalise them? I want to able to combine/compare them.
My (probably flawed!) understanding is that for a classifier score set we use the mean and the standard deviation. But we can't always assume we will already have a score set to get the mean and standard deviation from, can we?

To compute the z-scores of a given set of numbers you need to compute the sample mean and the sample deviation. From each score subtract the mean and divide the standard deviation.
Consider the set of numbers below where each observation is test scores ranging from 0 to 100.
{40, 50, 60, 55, 70, 80, 90}
If you wanted to compare them another set of test scores where the test scores ranged from 0 to 250 such as:
{100, 115, 214, 50, 200, 80, 90}
You couldn't directly compare the compare them. I.e. a score of 80 in the second set is clearly worse than than 80 in the first set (80/250 vs 80/100). One way to do this is using z-scores. They are computing as follows:
Find the mean
mean of the first set is: 63.57143
mean of the second set is: 121.2857
Subtract the sample mean from each score. This will give you a set of numbers that are centered on zero.
{-23.571429, -13.571429, -3.571429, -8.571429, 6.428571, 16.428571, 26.428571}
{-21.285714, -6.285714, 92.714286, -71.285714, 78.714286, -41.285714, -31.285714}
Compute the standard deviation from the original set and divide the "centered" scores by that number:
Set 1 sigma = 17.49149
Set 2 sigma = 61.98041
This is computed to be:
{-1.3475937, -0.7758873, -0.2041809. -0.4900341, 0.3675256, 0.9392320, 1.5109384}
{-0.3434265, -0.1014145, 1.4958643, -1.1501330, 1.2699865, -0.6661091, -0.5047678}
Now you have two sets of numbers are directly comparable. A value of zero means that it is the set's average value. A value of 1 standard deviation above the set's average. A value of -1 means that its is one standard deviation below the average and so on.

Related

Do I need to add ReLU function before last layer to predict a positive value?

I am developing a model using linear regression to predict the age. I know that the age is from 0 to 100 and it is a possible value. I used conv 1 x 1 in the last layer to predict the real value. Do I need to add a ReLU function after the output of convolution 1x1 to guarantee the predicted value is a positive value? Currently, I did not add ReLU and some predicted value becomes negative value like -0.02 -0.4…
There's no compelling reason to use an activation function for the output layer; typically you just want to use a reasonable/suitable loss function directly with the penultimate layer's output. Specifically, a RELU doesn't solve your problem (or at most only solves 'half' of it) since it can still predict above 100. In this case -predicting a continuous outcome- there's a few standard loss functions like squared error or L1-norm.
If you really want to use an activation function for this final layer and are concerned about always predicting within a bounded interval, you could always try scaling up the sigmoid function (to between 0 and 100). However, there's nothing special about sigmoid here - any bounded function, ex. any CDF of a signed, continuous random variable, could be similarly used. Though for optimization, something easily differentiable is important.
Why not start with something simple like squared-error loss? It's always possible to just 'clamp' out-of-range predictions to within [0-100] (we can give this a fancy name like 'doubly RELU') when you need to actually make predictions (as opposed to during training/testing), but if you're getting lots of such errors, the model might have more fundamental problems.
Even for a regression problem, it can be good (for optimisation) to use a sigmoid layer before the output (giving a prediction in the [0:1] range) followed by a denormalization (here if you think maximum age is 100, just multiply by 100)
This tip is explained in this fast.ai course.
I personally think these lessons are excellent.
You should use a sigmoid activation function, and then normalize the targets outputs to the [0, 1] range. This solves both issues of being positive and with a limit.
You can easily then denormalize the neural network outputs to get an output in the [0, 100] range.

How should zero standard deviation in one of the features be handled in multi-variate gaussian distribution

I am using multi-variate guassian distribution to analyze abnormality.
This is how the training set looks
19-04-16 05:30:31 1 0 0 377816 305172 5567044 0 0 0 14 62 75 0 0 100 0 0
<Date> <time> <--------------------------- ------- Features --------------------------->
Lets say one of the above features do not change, they remain zero.
Calculation mean = mu
mu = mean(X)'
Calculating sigma2 as
sigma2 = ((1/m) * (sum((X - mu') .^ 2)))'
Probability of individual feature in each data set is calculated using standard gaussian formula as
For a particular feature, if all values come out to be zero, then mean (mu) is also zero. Subsequently sigma2 will also be zero.
Thereby when I calculate the probability through gaussian distribution, I would get a "device by zero" problem.
However, in test sets, this feature value can fluctuate and I would like term that as a an abnormality. How, should this be handled? I dont want to ignore such a feature.
So - the problem occurs every time when you have a variable which is constant. But then approximating it by a Normal Distribution has absolutely no sense. The whole information about such variable is contained in only one value - and this is an intuition why this division by 0 phenomenon occurs.
In case when you know that there are these fluctuations in your variable not observed in a training set - you could simply set a variance of such variable not to be lesser than a certain value. You could apply a function max(variance(X), eps) instead of a classic variance definition. Then - you will be sure that no division by 0 occurs.

Normalizing feature values for SVM

I've been playing with some SVM implementations and I am wondering - what is the best way to normalize feature values to fit into one range? (from 0 to 1)
Let's suppose I have 3 features with values in ranges of:
3 - 5.
0.02 - 0.05
10-15.
How do I convert all of those values into range of [0,1]?
What If, during training, the highest value of feature number 1 that I will encounter is 5 and after I begin to use my model on much bigger datasets, I will stumble upon values as high as 7? Then in the converted range, it would exceed 1...
How do I normalize values during training to account for the possibility of "values in the wild" exceeding the highest(or lowest) values the model "seen" during training? How will the model react to that and how I make it work properly when that happens?
Besides scaling to unit length method provided by Tim, standardization is most often used in machine learning field. Please note that when your test data comes, it makes more sense to use the mean value and standard deviation from your training samples to do this scaling. If you have a very large amount of training data, it is safe to assume they obey the normal distribution, so the possibility that new test data is out-of-range won't be that high. Refer to this post for more details.
You normalise a vector by converting it to a unit vector. This trains the SVM on the relative values of the features, not the magnitudes. The normalisation algorithm will work on vectors with any values.
To convert to a unit vector, divide each value by the length of the vector. For example, a vector of [4 0.02 12] has a length of 12.6491. The normalised vector is then [4/12.6491 0.02/12.6491 12/12.6491] = [0.316 0.0016 0.949].
If "in the wild" we encounter a vector of [400 2 1200] it will normalise to the same unit vector as above. The magnitudes of the features is "cancelled out" by the normalisation and we are left with relative values between 0 and 1.

How are binary classifiers generalised to classify data into arbitrarily large sets?

How can algorithms which partition a space in to halves, such as Suport Vector Machines, be generalised to label data with labels from sets such as the integers?
For example, a support vector machine operates by constructing a hyperplane and then things 'above' the hyperplane take one label, and things below it take the other label.
How does this get generalised so that the labels are, for example, integers, or some other arbitrarily large set?
One option is the 'one-vs-all' approach, in which you create one classifier for each set you want to partition into, and select the set with the highest probability.
For example, say you want to classify objects with a label from {1,2,3}. Then you can create three binary classifiers:
C1 = 1 or (not 1)
C2 = 2 or (not 2)
C3 = 3 or (not 3)
If you run these classifiers on a new piece of data X, then they might return:
C1(X) = 31.6% chance of being in 1
C2(X) = 63.3% chance of being in 2
C3(X) = 89.3% chance of being in 3
Based on these outputs, you could classify X as most likely being from class 3. (The probabilities don't add up to 1 - that's because the classifiers don't know about each other).
If your output labels are ordered (with some kind of meaningful, rather than arbitrary ordering). For example, in finance you want to classify stocks into {BUY, SELL, HOLD}. Although you can't legitimately perform a regression on these (the data is ordinal rather than ratio data) you can assign the values of -1, 0 and 1 to SELL, HOLD and BUY and then pretend that you have ratio data. Sometimes this can give good results even though it's not theoretically justified.
Another approach is the Cramer-Singer method ("On the algorithmic implementation of multiclass kernel-based vector machines").
Svmlight implements it here: http://svmlight.joachims.org/svm_multiclass.html.
Classification into an infinite set (such as the set of integers) is called ordinal regression. Usually this is done by mapping a range of continuous values onto an element of the set. (see http://mlg.eng.cam.ac.uk/zoubin/papers/chu05a.pdf, Figure 1a)

Kohonen SOM Maps: Normalizing the input with unknown range

According to "Introduction to Neural Networks with Java By Jeff Heaton", the input to the Kohonen neural network must be the values between -1 and 1.
It is possible to normalize inputs where the range is known beforehand:
For instance RGB (125, 125, 125) where the range is know as values between 0 and 255:
1. Divide by 255: (125/255) = 0.5 >> (0.5,0.5,0.5)
2. Multiply by two and subtract one: ((0.5*2)-1)=0 >> (0,0,0)
The question is how can we normalize the input where the range is unknown like our height or weight.
Also, some other papers mention that the input must be normalized to the values between 0 and 1. Which is the proper way, "-1 and 1" or "0 and 1"?
You can always use a squashing function to map an infinite interval to a finite interval. E.g. you can use tanh.
You might want to use tanh(x * l) with a manually chosen l though, in order not to put too many objects in the same region. So if you have a good guess that the maximal values of your data are +/- 500, you might want to use tanh(x / 1000) as a mapping where x is the value of your object It might even make sense to subtract your guess of the mean from x, yielding tanh((x - mean) / max).
From what I know about Kohonen SOM, they specific normalization does not really matter.
Well, it might through specific choices for the value of parameters of the learning algorithm, but the most important thing is that the different dimensions of your input points have to be of the same magnitude.
Imagine that each data point is not a pixel with the three RGB components but a vector with statistical data for a country, e.g. area, population, ....
It is important for the convergence of the learning part that all these numbers are of the same magnitude.
Therefore, it does not really matter if you don't know the exact range, you just have to know approximately the characteristic amplitude of your data.
For weight and size, I'm sure that if you divide them respectively by 200kg and 3 meters all your data points will fall in the ]0 1] interval. You could even use 50kg and 1 meter the important thing is that all coordinates would be of order 1.
Finally, you could a consider running some linear analysis tools like POD on the data that would give you automatically a way to normalize your data and a subspace for the initialization of your map.
Hope this helps.

Resources