In the original paper of HOG (Histogram of Oriented Gradients) http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf there are some images, which shows the hog representation of an image (Figure 6).In this figure the f, g part says "HOG descriptor weighted by respectively the positive and the negative SVM weights".
I don't understand what does this mean. I understand that when I train a SVM, I get a Weigth vector, and to classify, I have to use the features (HOG descriptors) as the input of the function. So what do they mean by positive and negative weigths? And how would I plot them like the paper?
Thanks in Advance.
The weights tell you how significant a specific element of the feature vector is for a given class. That means that if you see a high value in your feature vector you can lookup the corresponding weight
If the weight is a high positiv number it's more likely that your object is of the class
If your weight is a high negative number it's more likely that your object is NOT of the class
If your weight is close to zero this position is mostly irrelavant for the classification
Now your using those weights to scale the feature vector you have where the length of the gradients are mapped to the color-intensity. Because you can't display negative color intensities they decided to split the positive and negative visualization. In the visualizations you can now see which parts of the input-image contributes to the class (positiv) and which don't (negative).
Related
I'd like to train a neural net to return a normalized color distance between two RGB pixels trained [at first] on a simple Euclidean distance with each color component being 0-255. there are 6 inputs (2 pixels x R,G,B components) and 1 output (a normalized 0-1 color distance). or should there be 48 binary inputs (2 pixels x 24 bits per px)?
the training set will probably consist of 1k - 10k randomly generated pixel pairs and their computed distances. i'm open to creating/chaining several simple nets rather than training one perfect one. e.g. splitting each by hue1/hue2 combo that can be cheaply determined in advance.
i'm new to ML and was hoping to get some advice since my intuition is basically zero for what type of net i need and ballpark for what the topology and params should be.
most supervised neural nets seem to do classification into buckets, but i don't quite understand how to get a net to predict a value that is not just one out of several categories. can i still use a simple feed-forward network for this task or something different?
thanks for any advice!
My FCN is trained to detect 10 different classes and produces an output of 500x500x10 with each of the final dimensions being the prediction probabilities for a different class.
Usually, I've seen using a uniform threshold, for instance 0.5, to binarize the probability matrices. However, in my case, this doesn't quite cut it because the IoU for some of the classes increases when the threshold is 0.3 and for other classes it is 0.8.
Hence, I don't have to arbitrarily pick the threshold for each class but rather use a more probabilistic approach to finalizing the threshold values. I thought of using CRFs but this also requires the thresholding to have already been done. Any ideas on how to proceed?
Example: consider an image of a forest with 5 different birds. Now im trying to output an image that has segmented the forest and the five birds, 6 classes, each with a separate label. The network outputs 6 confusion matrices indicating the confidence that a pixel falls into a particular class. Now, the correct answer for a pixel isnt always the class with the highest confidence value. Therefore, a one size fits all method or a max value method won't work.
CRF Postprocessing Approch
You don't need to set thresholds to use a CRF. I'm not familiar with any python libraries for CRFs, but in principle, what you need to define is:
A probability distribution of the 10 classes for each of the nodes
(pixels), which is simply the output of your network.
Pairwise potentials: 10*10 matrix, where element Aij denotes the "strength" of the configuration that one pixel is of class i and the other of class j. If you set the potentials to have a value alpha (alpha >> 1) in the diagonal and 1 elsewhere, then alpha is the regularization force that gives you consistency of the predictions (if pixel X is of class Y, then the neighboring pixels of X are more likely to be of the same class).
This is just one example of how you can define your CRF.
End to End NN Approach
Add a loss to your network that will penalize pixels that have neighbors of a different class. Please note that you will still end up with a tune-able parameter for the weight of the new regularization loss.
I have built a FCN for image segmentation. The object to be segmented is only very few pixels relatively to the image size (1024x1024). This results in that the accuracy is very high, even if I only train with 10 images instead of 18000 (my full training set).
My approach to solve this is to use some kind of weighted accuracy, so that the accuracy actually say something about the performance of identifying the small object (now it gets high accuracy since so many pixels are not the object and by not classifying anything the accuracy still gets high).
How do I decide the weight, anybody with some experience?
As you wrote, use a custom weight function which penalizes misclassification of underrepresented pixels more. You can get the weight by calculating the quotient between the number of object pixels versus all of the pixels in the image, or you can try it by hand - just make sure you follow the metrics which tell you the accuracy of object pixels. Hope it helps.
You can use infogain loss layer for a "weighted" loss.
The infogain loss is a generalization of the cross entropy loss commonly used. It is defined using a weight matrix H (of size L-by-L, where L is the number of classes):
L(p) = -H log(p)
Where p is a vector of class probabilities.
You can find more details on this loss here.
I am using svm light to train a model for binary classification. Using the model, I tested some examples. I was surprised to see the output of the prediction file, it contains values greater than 1 as well as less than -1. I thought the range is [-1,1]. Am I doing something wrong?
It makes sense why the values are not bounded by the interval of [-1, 1] if you understand how the SVM works. An SVM tries to draw the line which separates the negative and positive data points while maximizing their distances from the line.
The values in the prediction file represent the distances of data from the SVM optimal hyperplane, where positive values are on the positive class side of the hyperplane and negative values are on the negative class side of the hyperplane. These distance can be arbitrarily large or small and are not bounded as can be seen by this image:
I've seen some SVM implementations such as Weka's implementation of Platt's SMO which normalize the values so that they are confidence values on the positive class bounded by the interval of [0, 1], but both ways work just fine for determining how confident an SVM is on a classification since a data point further from the hyperplane is more certain than one lying close to the hyperplane.
I have trained a SVM and logistic regression classifier on my dataset. Both classifier provide a weight vector which is of the size of the number of features. I can use this weight vector to select the 10 most important features by just selecting the 10 features with the highest weights.
Should I use the absolute values of the weights, i.e. selecting the 10 features with the highest absolute values?
Second, this only works for SVM with linear kernel but not with RBF kernel as I have read. For non-linear kernel the weights are somehow no more linear. What is the exact reason that the weight vector cannot be used to determine the importance of features in case of non-linear kernel SVM?
As I answered to similar question, weight vector of any linear classifier indicates feature importance: simply because final value is a linear combination of feature values with weights as coefficients, so the bigger weight, the more impact to the final value is caused by the corresponding summand.
Thus, for linear classifier you can take features with biggest weights (not with biggest values of the feature itself, or the biggest product of weight and feature value).
It also explains why SVM with non-linear kernels like RBF don't have such a property: both feature values and weights are transformed into another space and you can't say that the bigger weight leads to bigger impact, see wiki.
If you need to select most important features for non-linear SVM, use special methods for feature selection, namely wrapper methods.