Correct way to do Min-Max normalization - opencv

I am implementing alphabet classification using opencv svm.
I have doubt in normalizing feature vector.
I have two ways of normalizing feature vector,
I need to find which is logically correct normalization method ??
Method 1
Suppose I have 3 feature vector as follows
[2, 3, 8, 5 ] -> image 1
[3, 5, 2, 5 ] -> image 2
[9, 3, 8, 5 ] -> image 3
And each value in feature vector is obtained by convolving the pixel with a kernal.
Currently I am finding maximum and minimum value of the each column and doing normalization based on that.
In the above case first column is [2, 3, 9]
min = 2
max = 9
and normalization of first column is done based on that. Likewise all other columns are normalized
Method 2
If the kernal is as follows
[-1 0 1]
[-1 0 1]
[-1 0 1]
then maximum and minimum value can obtained by convolving with above kernel is as follows (8 bit image- Intensity range: 0-255)
max val = 765
min val = -765
And normalize every value with above max min ?
Which is logically correct way to do normalization (method-1 or method-2) ?

The standard way to do it is method-1 (see the answer to this question). I also recommend you to read this paper for a good reference about svm training.
However, in you case, the range of all features computed with the same kernel will be similar , and method-1 may hurt more than it helps (for example by increasing noise of almost constant features).
So my advice would be : test both methods, and evalute performances to see what works best in your case.

Related

SVM's support vectors decision function representation

I am currently using SVM for my project with 'rbf' kernel.
What i understand from the theory is that the decision function value for the support vectors must be either +1 or -1. (if i use clf.decision_function(x))
But i find some support vectors, the decision function value is even 0.76, -0.88, 0.93 and so on.. (its not even closer to +1 or -1 like 0.99 nor -0.99).
What is wrong in this scenario? Or is my understanding wrong?
I guess there is no range limitation for the decision function value output in SVM.
The value of the decision function for those points, will be a high positive number for high-confidence positive decisions and have a low absolute value (near 0) for low-confidence decisions.
Source here
Code Example:
import numpy as np
from sklearn.svm import SVC
X = np.array([[-1, -1], [-2, -1], [0, 0], [0, 0], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2, 3, 3])
clf = SVC()
clf.fit(X, y)
print(clf.decision_function(X))
print(clf.predict(X))
Output:
# clf.decision_function(X)
array([[ 2.21034835, 0.96227609, -0.20427163],
[ 2.22222707, 0.84702504, -0.17843569],
[-0.16668475, 2.22222222, 0.83335142],
[-0.16668475, 2.22222222, 0.83335142],
[-0.20428472, 0.96227609, 2.21036024],
[-0.17841683, 0.84702504, 2.22221737]])
# clf.predict(X)
array([1, 1, 2, 2, 3, 3])
What SVM is interested is the sign of the decision. e.g., if the sign is negative, the point lies (say) left of the hyperplane. Similarly if the sign is positive, the point lies right of the hyperplane. The value determines how far is it from the hyperplane. Therefore -0.88 means the point is left of the hyperplane and having a distance 0.88. Near the point to the hyperplane, the chances of mis-classification can be considered higher.
Have a look here
To quote from scikit-learn:
the function values are proportional to the distance of the samples X
to the separating hyperplane.
sklearn uses soft margin SVMs. From the User Guide:
In general, when the problem isn’t linearly separable, the support vectors are the samples within the margin boundaries.
See also this sklearn example, where depending on C the margin changes and more or fewer points act as support vectors.
So support vectors can have a decision_function score of anything from -1 to +1. (In fact, misclassified points outside the margin will still be support vectors, and will have a score outside even that range. See https://stats.stackexchange.com/a/585832/232706)

Is there a way to implement a Neural Network able to work with vector target?

I'm trying to implement a Neural network model using keras, where the output is a vector of five elements.
Basically the target contains elements from 0 to 4 and nan. So I can have some targets like
[0,3,2,1,4] and others like [nan, 0, nan, 1 ,2]. The important thing is that the element in the vector are not repeated, only nan can.
One solution I tried was to use something like one hot encoder for the target, in this way I transformed a target in a 25 components vector, with all zeros and 1 in corrispondence of the number to map ( i.e. [nan, 0, nan, 1 ,2] -> [(0 , 0 ,0 ,0 ,0),(1,0,0,0,0),(0,0,0,0,0),(0,1,0,0,0)(0,0,1,0,0)] - i'm using the round brackets only to highlight groups of five element).
Any ideas please?
As far as I have understood, what you're trying to predict is a list of 5 elements, each of them takes a discrete value from the {nan, 0, 1, 2, 3, 4}.
What you'll need to do is training 5 neural networks (for each position of the list), each one predicts a value from the previous set, thus, you need to hot-encode the outputs, apply a softmax and select the highest probability for each of neural network.
when trying to predict the output list of a new sample, what you're going to do is predict every position, put them in the list and Voila !
def predict_sample(sample):
pos_0 = nn0.predict(sample)[0]
pos_1 = nn1.predict(sample)[0]
pos_2 = nn2.predict(sample)[0]
pos_3 = nn3.predict(sample)[0]
pos_4 = nn4.predict(sample)[0]
outp = [pos_0, pos_1, pos_2, pos_3, pos_4]
# if nan is encoded as 5 then:
outp[outp == 5] = np.nan
return outp
You cannot assume NaNs will be unique at prediction, only the data will affect that but what you can do if for example taking the second highest probability when already a NaN is predicted at a certain position of the list.

How to understand SpatialDropout1D and when to use it?

Occasionally I see some models are using SpatialDropout1D instead of Dropout. For example, in the Part of speech tagging neural network, they use:
model = Sequential()
model.add(Embedding(s_vocabsize, EMBED_SIZE,
input_length=MAX_SEQLEN))
model.add(SpatialDropout1D(0.2)) ##This
model.add(GRU(HIDDEN_SIZE, dropout=0.2, recurrent_dropout=0.2))
model.add(RepeatVector(MAX_SEQLEN))
model.add(GRU(HIDDEN_SIZE, return_sequences=True))
model.add(TimeDistributed(Dense(t_vocabsize)))
model.add(Activation("softmax"))
According to Keras' documentation, it says:
This version performs the same function as Dropout, however it drops
entire 1D feature maps instead of individual elements.
However, I am unable to understand the meaning of entrie 1D feature. More specifically, I am unable to visualize SpatialDropout1D in the same model explained in quora.
Can someone explain this concept by using the same model as in quora?
Also, under what situation we will use SpatialDropout1D instead of Dropout?
To make it simple, I would first note that so-called feature maps (1D, 2D, etc.) is our regular channels. Let's look at examples:
Dropout(): Let's define 2D input: [[1, 1, 1], [2, 2, 2]]. Dropout will consider every element independently, and may result in something like [[1, 0, 1], [0, 2, 2]]
SpatialDropout1D(): In this case result will look like [[1, 0, 1], [2, 0, 2]]. Notice that 2nd element was zeroed along all channels.
The noise shape
In order to understand SpatialDropout1D, you should get used to the notion of the noise shape. In plain vanilla dropout, each element is kept or dropped independently. For example, if the tensor is [2, 2, 2], each of 8 elements can be zeroed out depending on random coin flip (with certain "heads" probability); in total, there will be 8 independent coin flips and any number of values may become zero, from 0 to 8.
Sometimes there is a need to do more than that. For example, one may need to drop the whole slice along 0 axis. The noise_shape in this case is [1, 2, 2] and the dropout involves only 4 independent random coin flips. The first component will either be kept together or be dropped together. The number of zeroed elements can be 0, 2, 4, 6 or 8. It cannot be 1 or 5.
Another way to view this is to imagine that input tensor is in fact [2, 2], but each value is double-precision (or multi-precision). Instead of dropping the bytes in the middle, the layer drops the full multi-byte value.
Why is it useful?
The example above is just for illustration and isn't common in real applications. More realistic example is this: shape(x) = [k, l, m, n] and noise_shape = [k, 1, 1, n]. In this case, each batch and channel component will be kept independently, but each row and column will be kept or not kept together. In other words, the whole [l, m] feature map will be either kept or dropped.
You may want to do this to account for adjacent pixels correlation, especially in the early convolutional layers. Effectively, you want to prevent co-adaptation of pixels with its neighbors across the feature maps, and make them learn as if no other feature maps exist. This is exactly what SpatialDropout2D is doing: it promotes independence between feature maps.
The SpatialDropout1D is very similar: given shape(x) = [k, l, m] it uses noise_shape = [k, 1, m] and drops entire 1-D feature maps.
Reference: Efficient Object Localization Using Convolutional Networks
by Jonathan Tompson at al.

Intuition behind standard deviation as a threshold and why

I have a set of input output training data, few samples are
Input output
[1 0 0 0 0] [1 0 1 0 0]
[1 1 0 0 1] [1 1 0 0 0]
[1 0 1 1 0] [1 1 0 1 0]
and so on. I need to apply standard deviation on the entire output as a threshold. So, I calculate the mean standard deviation for the output. The application is that the model when presented this data should be able to learn and predict the output. There is a condition in my objective function design which is the distance = sum of the sqrt of the euclidean distance between model output and the desired target, corresponding to an input should be less than a threshold.
My question is how should I justify the use of threshold? Is it justified ? I read this article article which says that it is common to take standard deviation as the threshold.
For my case, what does it mean taking the standard deviation of the output of the training data?
There is no intuition/philosophy behind std deviation (or variance), statisticians like these measures purely because they are mathematically easy to work with due to various nice properties. See https://math.stackexchange.com/questions/875034/does-expected-absolute-deviation-or-expected-absolute-deviation-range-exist
There are quite a few other ways to perform various forms of outliar detection, belief revision, etc, but they can be more mathematically challenging to work with.
I am not sure this idea applies. You are looking at the definition of standard deviation for a univariate value, but your output is multivariate. There are multivariate analogs, but, it's not clear why you need to apply it here.
It sounds like you are minimizing squared error, or Euclidean distance, between the output and known correct output. That's fine, and makes me think you're predicting the multivariate output shown here. What is the threshold doing then? input is less than what measure of what from what?

multivariate random forest with opencv

Let's say we are trying to classify a pencil as healthy or not and we have two variables for this purpose: length and weight of the pencil. Now, what should I give to the training method of random forest implemented in opencv? I am really confused with this because I have two different data, both of them are numeric but their units are different. Below example will give a better sense:
Height (cm) Weight (gr) Healthy? (bool)
----------- ----------- ---------------
10 34 0
4 6 0
12 14 1
8 20 1
5 18 0
If I train a univariate random forest with only height, {10, 4, 12, 8, 5} and {0, 0, 1, 1, 0} vectors will be the parameters. However, what if I want to use both variables, what will be the parameters?
In Python the training data input can be fed into as a list of tuples, if you have multiple variables.

Resources