Values greater than 1 in svm prediction file - machine-learning

I am using svm light to train a model for binary classification. Using the model, I tested some examples. I was surprised to see the output of the prediction file, it contains values greater than 1 as well as less than -1. I thought the range is [-1,1]. Am I doing something wrong?

It makes sense why the values are not bounded by the interval of [-1, 1] if you understand how the SVM works. An SVM tries to draw the line which separates the negative and positive data points while maximizing their distances from the line.
The values in the prediction file represent the distances of data from the SVM optimal hyperplane, where positive values are on the positive class side of the hyperplane and negative values are on the negative class side of the hyperplane. These distance can be arbitrarily large or small and are not bounded as can be seen by this image:
I've seen some SVM implementations such as Weka's implementation of Platt's SMO which normalize the values so that they are confidence values on the positive class bounded by the interval of [0, 1], but both ways work just fine for determining how confident an SVM is on a classification since a data point further from the hyperplane is more certain than one lying close to the hyperplane.

Related

I couldn’t understand 3 things in normalized inputs and initial weights video?

In this video https://www.udacity.com/course/viewer#!/c-ud730/l-6370362152/m-7119160655 it talks about zero mean and equal variance in our cross-entropy function I cannot understand where is zero mean and variance. Could someone give me an example to explain it? It also talks about initializing weights using normal distribution, could someone explain it to me how? And in the end it talks about taking derivatives with respect to weights and biases and then subtracting the values of weights and biases and moving in a loop. Could you explain this to me? I'm very confused!!
these are basic statistics questions: you can find many resources on them.
A quick summary:
zero mean: Calculate the mean of the data points by
Sum(data points) / Count(data points)
Equal variance: Calculate the variance of the two different datasets then apply normalization to their data points:
Each Data Point <- Data Point Value - Mean(data points for that dataset)
-----------------------------
Standard Deviation for that dataset
The standard deviation / variance for the two datasets will be different. By dividing each data point in the respective datasets by their corresponding standard deviation you get more easily comparable results.
E.g. if variance of dataset A is 25 and the variance of dataset B is 100: each data point in A is then diviced by 5 and the point in B are divided by 10. That allows cross entropy calculation to compare similar-amplitude values.

receiver operating characteristic (ROC) on a test set

The following image definitely makes sense to me.
Say you have a few trained binary classifiers A, B (B not much better than random guessing etc. ...) and a test set composed of n test samples to go with all those classifiers. Since Precision and Recall are computed for all n samples, those dots corresponding to classifiers make sense.
Now sometimes people talk about ROC curves and I understand that precision is expressed as a function of recall or simply plotted Precision(Recall).
I don't understand where does this variability come from, since you have a fixed number of test samples. Do you just pick some subsets of the test set and find precision and recall in order to plot them and hence many discrete values (or an interpolated line) ?
The ROC curve is well-defined for a binary classifier that expresses its output as a "score." The score can be, for example, the probability of being in the positive class, or it could also be the probability difference (or even the log-odds ratio) between probability distributions over each of the two possible outcomes.
The curve is obtained by setting the decision threshold for this score at different levels and measuring the true-positive and false-positive rates, given that threshold.
There's a good example of this process in Wikipedia's "Receiver Operating Characteristic" page:
For example, imagine that the blood protein levels in diseased people and healthy people are normally distributed with means of 2 g/dL and 1 g/dL respectively. A medical test might measure the level of a certain protein in a blood sample and classify any number above a certain threshold as indicating disease. The experimenter can adjust the threshold (black vertical line in the figure), which will in turn change the false positive rate. Increasing the threshold would result in fewer false positives (and more false negatives), corresponding to a leftward movement on the curve. The actual shape of the curve is determined by how much overlap the two distributions have.
If code speaks more clearly to you, here's the code in scikit-learn that computes an ROC curve given a set of predictions for each item in a dataset. The fundamental operation seems to be (direct link):
desc_score_indices = np.argsort(y_score, kind="mergesort")[::-1]
y_score = y_score[desc_score_indices]
y_true = y_true[desc_score_indices]
# accumulate the true positives with decreasing threshold
tps = y_true.cumsum()
fps = 1 + list(range(len(y_true))) - tps
return fps, tps, y_score
(I've omitted a bunch of code in there that deals with (common) cases of having weighted samples and when the classifier gives near-identical scores to multiple samples.) Basically the true labels are sorted in descending order by the score assigned to them by the classifier, and then their cumulative sum is computed, giving the true positive rate as a function of the score assigned by the classifier.
And here's an example showing how this gets used: http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
ROC curve just shows "How much sensitivity you will obtain if you increase FPR by some amount". Tradeoff between TPR and FPR. Variability comes from varying some parameter of classifier (For logistic regression case below - it is threshold value).
For example logistic regression gives you probability that object belongs to positive class (values in [0..1]), but it's just probability. It's not a class. So in general case you have to specify threshold for probability, above which you will classify object as positive. You can learn logistic regression, obtain from it probabilities of positive class for each object of your set, and then you just vary this threshold parameter, with some step from 0 to 1, by thresholding your probabilities (computed on previous step) with this threshold you will get class labels for every object, and compute TPR and FPR from this labels. Thus you will get TPR and FPR for every threshold. You can mark them on plot and eventually, after you compute (TPR,FPR) pairs for all thresholds - draw a line through them.
Also for linear binary classifiers you can think about this varying process as a process of choosing distance between decision line and positive (or negative, if you want) class cluster. If you move decision line far from positive class - you will classify more objects as a positive (because you increased positive class space), and at the same time you increased FPR by some value (because space of negative class decreased).

how to interpret the "soft" and "max" in the SoftMax regression?

I know the form of the softmax regression, but I am curious about why it has such a name? Or just for some historical reasons?
The maximum of two numbers max(x,y) could have sharp corners / steep edges which sometimes is an unwanted property (e.g. if you want to compute gradients).
To soften the edges of max(x,y), one can use a variant with softer edges: the softmax function. It's still a max function at its core (well, to be precise it's an approximation of it) but smoothed out.
If it's still unclear, here's a good read.
Let's say you have a set of scalars xi and you want to calculate a weighted sum of them, giving a weight wi to each xi such that the weights sum up to 1 (like a discrete probability). One way to do it is to set wi=exp(a*xi) for some positive constant a, and then normalize the weights to one. If a=0 you get just a regular sample average. On the other hand, for a very large value of a you get max operator, that is the weighted sum will be just the largest xi. Therefore, varying the value of a gives you a "soft", or a continues way to go from regular averaging to selecting the max. The functional form of this weighted average should look familiar to you if you already know what a SoftMax regression is.

How to visualizate svm weights in hog

In the original paper of HOG (Histogram of Oriented Gradients) http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf there are some images, which shows the hog representation of an image (Figure 6).In this figure the f, g part says "HOG descriptor weighted by respectively the positive and the negative SVM weights".
I don't understand what does this mean. I understand that when I train a SVM, I get a Weigth vector, and to classify, I have to use the features (HOG descriptors) as the input of the function. So what do they mean by positive and negative weigths? And how would I plot them like the paper?
Thanks in Advance.
The weights tell you how significant a specific element of the feature vector is for a given class. That means that if you see a high value in your feature vector you can lookup the corresponding weight
If the weight is a high positiv number it's more likely that your object is of the class
If your weight is a high negative number it's more likely that your object is NOT of the class
If your weight is close to zero this position is mostly irrelavant for the classification
Now your using those weights to scale the feature vector you have where the length of the gradients are mapped to the color-intensity. Because you can't display negative color intensities they decided to split the positive and negative visualization. In the visualizations you can now see which parts of the input-image contributes to the class (positiv) and which don't (negative).

Precision/recall for multiclass-multilabel classification

I'm wondering how to calculate precision and recall measures for multiclass multilabel classification, i.e. classification where there are more than two labels, and where each instance can have multiple labels?
For multi-label classification you have two ways to go
First consider the following.
is the number of examples.
is the ground truth label assignment of the example..
is the example.
is the predicted labels for the example.
Example based
The metrics are computed in a per datapoint manner. For each predicted label its only its score is computed, and then these scores are aggregated over all the datapoints.
Precision =
, The ratio of how much of the predicted is correct. The numerator finds how many labels in the predicted vector has common with the ground truth, and the ratio computes, how many of the predicted true labels are actually in the ground truth.
Recall =
, The ratio of how many of the actual labels were predicted. The numerator finds how many labels in the predicted vector has common with the ground truth (as above), then finds the ratio to the number of actual labels, therefore getting what fraction of the actual labels were predicted.
There are other metrics as well.
Label based
Here the things are done labels-wise. For each label the metrics (eg. precision, recall) are computed and then these label-wise metrics are aggregated. Hence, in this case you end up computing the precision/recall for each label over the entire dataset, as you do for a binary classification (as each label has a binary assignment), then aggregate it.
The easy way is to present the general form.
This is just an extension of the standard multi-class equivalent.
Macro averaged
Micro averaged
Here the are the true positive, false positive, true negative and false negative counts respectively for only the label.
Here $B$ stands for any of the confusion-matrix based metric. In your case you would plug in the standard precision and recall formulas. For macro average you pass in the per label count and then sum, for micro average you average the counts first, then apply your metric function.
You might be interested to have a look into the code for the mult-label metrics here , which a part of the package mldr in R. Also you might be interested to look into the Java multi-label library MULAN.
This is a nice paper to get into the different metrics: A Review on Multi-Label Learning Algorithms
The answer is that you have to compute precision and recall for each class, then average them together. E.g. if you classes A, B, and C, then your precision is:
(precision(A) + precision(B) + precision(C)) / 3
Same for recall.
I'm no expert, but this is what I have determined based on the following sources:
https://list.scms.waikato.ac.nz/pipermail/wekalist/2011-March/051575.html
http://stats.stackexchange.com/questions/21551/how-to-compute-precision-recall-for-multiclass-multilabel-classification
Let us assume that we have a 3-class multi classification problem with labels A, B and C.
The first thing to do is to generate a confusion matrix. Note that the values in the diagonal are always the true positives (TP).
Now, to compute recall for label A you can read off the values from the confusion matrix and compute:
= TP_A/(TP_A+FN_A)
= TP_A/(Total gold labels for A)
Now, let us compute precision for label A, you can read off the values from the confusion matrix and compute:
= TP_A/(TP_A+FP_A)
= TP_A/(Total predicted as A)
You just need to do the same for the remaining labels B and C. This applies to any multi-class classification problem.
Here is the full article that talks about how to compute precision and recall for any multi-class classification problem, including examples.
In python using sklearn and numpy:
from sklearn.metrics import confusion_matrix
import numpy as np
labels = ...
predictions = ...
cm = confusion_matrix(labels, predictions)
recall = np.diag(cm) / np.sum(cm, axis = 1)
precision = np.diag(cm) / np.sum(cm, axis = 0)
Simple averaging will do if the classes are balanced.
Otherwise, recall for each real class needs to be weighted by prevalence of the class, and precision for each predicted label needs to be weighted by the bias (probability) for each label. Either way you get Rand Accuracy.
A more direct way is to make a normalized contingency table (divide by N so table adds up to 1 for each combination of label and class) and add the diagonal to get Rand Accuracy.
But if classes aren't balanced, the bias remains and a chance corrected method such as kappa is more appropriate, or better still ROC analysis or a chance correct measure such as informedness (height above the chance line in ROC).

Resources