SVM's support vectors decision function representation

SVM's support vectors decision function representation - machine-learning

I am currently using SVM for my project with 'rbf' kernel.
What i understand from the theory is that the decision function value for the support vectors must be either +1 or -1. (if i use clf.decision_function(x))
But i find some support vectors, the decision function value is even 0.76, -0.88, 0.93 and so on.. (its not even closer to +1 or -1 like 0.99 nor -0.99).
What is wrong in this scenario? Or is my understanding wrong?

I guess there is no range limitation for the decision function value output in SVM.
The value of the decision function for those points, will be a high positive number for high-confidence positive decisions and have a low absolute value (near 0) for low-confidence decisions.
Source here
Code Example:
import numpy as np
from sklearn.svm import SVC
X = np.array([[-1, -1], [-2, -1], [0, 0], [0, 0], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2, 3, 3])
clf = SVC()
clf.fit(X, y)
print(clf.decision_function(X))
print(clf.predict(X))
Output:
# clf.decision_function(X)
array([[ 2.21034835, 0.96227609, -0.20427163],
[ 2.22222707, 0.84702504, -0.17843569],
[-0.16668475, 2.22222222, 0.83335142],
[-0.16668475, 2.22222222, 0.83335142],
[-0.20428472, 0.96227609, 2.21036024],
[-0.17841683, 0.84702504, 2.22221737]])
# clf.predict(X)
array([1, 1, 2, 2, 3, 3])

What SVM is interested is the sign of the decision. e.g., if the sign is negative, the point lies (say) left of the hyperplane. Similarly if the sign is positive, the point lies right of the hyperplane. The value determines how far is it from the hyperplane. Therefore -0.88 means the point is left of the hyperplane and having a distance 0.88. Near the point to the hyperplane, the chances of mis-classification can be considered higher.
Have a look here
To quote from scikit-learn:
the function values are proportional to the distance of the samples X
to the separating hyperplane.

sklearn uses soft margin SVMs. From the User Guide:
In general, when the problem isn’t linearly separable, the support vectors are the samples within the margin boundaries.
See also this sklearn example, where depending on C the margin changes and more or fewer points act as support vectors.
So support vectors can have a decision_function score of anything from -1 to +1. (In fact, misclassified points outside the margin will still be support vectors, and will have a score outside even that range. See https://stats.stackexchange.com/a/585832/232706)

Related

Vector coefficients based on similarity

I've been looking for a solution to create a recommendation system based on vectors similarity.
Basically, i have a few vectors per user for example:
User1: [0,3,7,8,5] , [3,5,8,2,4] , [1,5,3,9,4]
User2: [3,1,6,7,9] , [2,4,1,3,8] , [7,8,3,3,1]
For every vector i need to calculate a coefficient and based on that coefficient differentiate a vector from another. I've found formulas that would calculate coefficients based on similarity of 2 vectors which i don't really want that.I need a formula that would calculate a coefficient per vector and then i do some other calculations with those coefficients.Are there any good formulas for this?
Thanks

So going based off your response to my comment: I don't think there's a similarity coefficient measure that will do what you want. Let me explain why...
Similarity coefficients are functions f(x, y) -> c where x and y are vectors and c is a scalar. Note that f takes two parameters. f(x,y) = f(y,x), but f(x) is meaningless - its asking for the similarity of x relative to... nothing.
So what? We could just use a function g(x) = f(x, V) where V is a fixed vector. E.g. let V = [1, 1, ..., 1]. Now we have a monadic function that gives us a similarity value for every individual vector. But...
Knowing f(x,y) = c and f(x,z) = c' doesn't tell you a whole lot about f(y,z). Take vectors in 2-space, x = [1, 1], y = [0, 1], z = [1,0]. A similarity function symmetric in the two dimensions would say f(x,y) = f(x,z) but hopefully not = f(y,z) So our g function above isn't very useful, because knowing how similar two vectors are to V doesn't tell us much about how similar they are to each other.
So what can you do? I think a simple solution to your problem would be a variation of the k nearest neighbors algorithm. It allows you to find vectors close to a given vector (or, if you prefer to find clusters of vectors without specifying a given vector, look up clustering)
EDIT: inspiration from Yahya's answer: if your vectors are super huge and knn or clustering is too difficult, consider principle component analysis or some other method of cutting them down to size (reducing the number of dimensions) - just keep in mind whatever you do will likely be lossy

How to understand SpatialDropout1D and when to use it?

Occasionally I see some models are using SpatialDropout1D instead of Dropout. For example, in the Part of speech tagging neural network, they use:
model = Sequential()
model.add(Embedding(s_vocabsize, EMBED_SIZE,
input_length=MAX_SEQLEN))
model.add(SpatialDropout1D(0.2)) ##This
model.add(GRU(HIDDEN_SIZE, dropout=0.2, recurrent_dropout=0.2))
model.add(RepeatVector(MAX_SEQLEN))
model.add(GRU(HIDDEN_SIZE, return_sequences=True))
model.add(TimeDistributed(Dense(t_vocabsize)))
model.add(Activation("softmax"))
According to Keras' documentation, it says:
This version performs the same function as Dropout, however it drops
entire 1D feature maps instead of individual elements.
However, I am unable to understand the meaning of entrie 1D feature. More specifically, I am unable to visualize SpatialDropout1D in the same model explained in quora.
Can someone explain this concept by using the same model as in quora?
Also, under what situation we will use SpatialDropout1D instead of Dropout?

To make it simple, I would first note that so-called feature maps (1D, 2D, etc.) is our regular channels. Let's look at examples:
Dropout(): Let's define 2D input: [[1, 1, 1], [2, 2, 2]]. Dropout will consider every element independently, and may result in something like [[1, 0, 1], [0, 2, 2]]
SpatialDropout1D(): In this case result will look like [[1, 0, 1], [2, 0, 2]]. Notice that 2nd element was zeroed along all channels.

The noise shape
In order to understand SpatialDropout1D, you should get used to the notion of the noise shape. In plain vanilla dropout, each element is kept or dropped independently. For example, if the tensor is [2, 2, 2], each of 8 elements can be zeroed out depending on random coin flip (with certain "heads" probability); in total, there will be 8 independent coin flips and any number of values may become zero, from 0 to 8.
Sometimes there is a need to do more than that. For example, one may need to drop the whole slice along 0 axis. The noise_shape in this case is [1, 2, 2] and the dropout involves only 4 independent random coin flips. The first component will either be kept together or be dropped together. The number of zeroed elements can be 0, 2, 4, 6 or 8. It cannot be 1 or 5.
Another way to view this is to imagine that input tensor is in fact [2, 2], but each value is double-precision (or multi-precision). Instead of dropping the bytes in the middle, the layer drops the full multi-byte value.
Why is it useful?
The example above is just for illustration and isn't common in real applications. More realistic example is this: shape(x) = [k, l, m, n] and noise_shape = [k, 1, 1, n]. In this case, each batch and channel component will be kept independently, but each row and column will be kept or not kept together. In other words, the whole [l, m] feature map will be either kept or dropped.
You may want to do this to account for adjacent pixels correlation, especially in the early convolutional layers. Effectively, you want to prevent co-adaptation of pixels with its neighbors across the feature maps, and make them learn as if no other feature maps exist. This is exactly what SpatialDropout2D is doing: it promotes independence between feature maps.
The SpatialDropout1D is very similar: given shape(x) = [k, l, m] it uses noise_shape = [k, 1, m] and drops entire 1-D feature maps.
Reference: Efficient Object Localization Using Convolutional Networks
by Jonathan Tompson at al.

Multi Label classification with Sklearn

I have tried using the OneVsRest with Logistic Regression from Sklearn, but it gives empty labels for some samples (i.e. doesn't predict any out), even though I do not have any unlabelled training data.
Any idea what might be causing this or how to fix this?
clf = OneVsRestClassifier(LogisticRegression(multi_class='ovr',max_iter=1000,solver='lbfgs'))
clf.fit(X,Y)
self.classifier=clf
self.classifier.predict(test_data)

Whenever you are performing MultiLabel classification, according to the OneVsRestClassifier the targets need to be "a sequence of sequences of labels".
Moreover, depending on how you encode this labels you may get the following warning: "DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation."
So, neat way to encode your labels:
from sklearn import preprocessing
mlb = preprocessing.MultiLabelBinarizer()
Y = mlb.fit_transform([(1, 2), (1,2), (1,2),(4,)])
# this means sample one belongs to classes {1,2} and so on.
# Take into account the format if only one class is needed, (4,) not (4)
so Y turns out to be:
array([[1, 1, 0],
[1, 1, 0],
[1, 1, 0],
[0, 0, 1]])

Correct way to do Min-Max normalization

I am implementing alphabet classification using opencv svm.
I have doubt in normalizing feature vector.
I have two ways of normalizing feature vector,
I need to find which is logically correct normalization method ??
Method 1
Suppose I have 3 feature vector as follows
[2, 3, 8, 5 ] -> image 1
[3, 5, 2, 5 ] -> image 2
[9, 3, 8, 5 ] -> image 3
And each value in feature vector is obtained by convolving the pixel with a kernal.
Currently I am finding maximum and minimum value of the each column and doing normalization based on that.
In the above case first column is [2, 3, 9]
min = 2
max = 9
and normalization of first column is done based on that. Likewise all other columns are normalized
Method 2
If the kernal is as follows
[-1 0 1]
[-1 0 1]
[-1 0 1]
then maximum and minimum value can obtained by convolving with above kernel is as follows (8 bit image- Intensity range: 0-255)
max val = 765
min val = -765
And normalize every value with above max min ?
Which is logically correct way to do normalization (method-1 or method-2) ?

The standard way to do it is method-1 (see the answer to this question). I also recommend you to read this paper for a good reference about svm training.
However, in you case, the range of all features computed with the same kernel will be similar , and method-1 may hurt more than it helps (for example by increasing noise of almost constant features).
So my advice would be : test both methods, and evalute performances to see what works best in your case.

How to predict a continuous dependent variable that expresses target class probabilities?

My samples can either belong to class 0 or class 1 but for some of my samples I only have a probability available for them to belong to class 1. So far I've discretized my target variable by applying a threshold i.e. all y >= t I assigned to class 1 and I've discarded all samples that have non-zero probability to belong to class 1. Then I fitted a linear SVM to the data using scitkit-learn.
Of cause this way I through away quite a bit of the training data. One idea I had was to omit the discretization and use regression instead but usually it's not a good idea to approach classification by regression as for example it doesn't guarantee predicted values to be in the interval [0,1].
By the way the nature of my features x is similar as for some of them I also only have probabilities for the respective feature to be present. For the error it didn't make a big difference if I discretized my features in the same way I discretized the dependent variable.

You might be able to approximate this using sample weighting - assign a sample to the class which has the highest probability, but weight that sample by the probability of it actually belonging. Many of the scikit-learn estimators allow for this.
Example:
X = [1, 2, 3, 4] -> class 0 with probability .7 would become X = [1, 2, 3, 4] y = [0] with sample weight of .7 . You might also normalize so the sample weights are between 0 and 1 (since your probabilities and sample weights will only be from .5 to 1. in this scheme). You could also incorporate non-linear penalties to "strengthen" the influence of high probability samples.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart