Bag of Words with HOG descriptors - machine-learning

I'm not quite sure how to implement the "Bag of Words" approach with HOG descriptors.
I've checked several sources which usually provide several steps to follow:
Compute the HOGs for the set of valid training images.
Apply an clustering algorithm to retrieve n centroids from the descriptors.
Perform some magic to create histograms with the frequency of the nearest centroids of the computed HOGs or use OpenCVs implementation to do this.
Train a linear SVM with the histograms
The step which involves magic (3) is not really clear. If I don't use OpenCV, how would I implement it?
The HOGs are vectors which are calculated cell-wise. So I have a vector for each cell. I could iterate over the vector and calculate the closest centroid for each element of the vector and create the histogram accordingly. Would this be a proper way to do it? But if so, I still have vectors of different sizes and no benefit from it.

Main steps can be expressed;
1- Extract features from your entire training set. (HOG feature for your aim)
2- Cluster those features into a vocabulary V; you get K distinct cluster centers.(K-Means, K-Medoid. Your hyperparameter will be K)
3- Encode each training image as a histogram of the number of times each vocabulary element shows up in the image. Each image is then represented by a length-K vector.
For example; first element of K maybe occurs 5 times, second element of K maybe occurs 10 times in your image. Doesn't matter at the end you will have a vector which has K elements.
K[0] = 5
k[1] = 10
....
....
K[n] = 3
4- Train the classifier using this vector. (Linear SVM)
When given a test image, extract the features. Now represent the test image as a histogram of the number of times each cluster center from V was closest to a feature in the test image. This is a length K vector again.

Related

Weights in eigenface approach

1) In eigenface approach the eigenfaces is a combination of elements from different faces. What are these elements?
2) The output face is an image composed of different eigenfaces with different weights. What does the weights of eigenfaces exactly mean? I know that the weight is percentage of eigenfacein the image, but what does it mean exactly, is mean the number of selected pixels?
Please study about PCA to understand what is the physical meaning of eigenfaces, when PCA is applied to an image. The answer lies in the understanding of eigenvectors and eigenvalues associated with PCA.
EigenFaces is based on Principal Component Analysis
Principal Component Analysis does dimensionality reduction and finds unique features in the training images and removes the similar features from the face images
By getting unique features our recognition task gets simpler
By using PCA you calculate the eigenvectors for your face image data
From these eigenvectors you calculate EigenFace of every training subject or you can say calculating EigenFace for every class in your data
So if you have 9 classes then the number of EigenFaces will be 9
The weight usually means how important something is
In EigenFaces weight of a particular EigenFace is a vector which just tells you how important that particular EigenFace is in contributing the MeanFace
Now if you have 9 EigenFaces then for every EigenFace you will get exactly one Weight vector which will be of N dimension where N is number of eigenvectors
So every element out N elements in one weight vector will tell you how important that particular eigenvector is for that corresponding EigenFace
The facial Recognition in EigenFaces is done by comparing the weights of training images and testing images with some kind of distance function
You can refer this github link: https://github.com/jayshah19949596/Computer-Vision-Course-Assignments/blob/master/EigenFaces/EigenFaces.ipynb
The code on the above link is a good documented code so If you know the basics you will understand the code

How to use principal component analysis (PCA) to speed up detection?

I am not sure whether I am applying PCA correctly or not! I have p features and n observations (instances). I put these in an nxp matrix X. I perform mean normalization and I get the normalized matrix B. I calculate the eigenvalues and eigenvectors of the pxp covariance matrix C=(1/(n-1))B*.B where * denotes the conjugate transpose.
The eigenvectors corresponding to the descendingly ordered eigenvalues are in a pxp matrix E. Let's say I want to reduce the number of attributes from p to k. I use the equation X_new=B.E_reduced where E_reduced is produced by choosing the first k columns of E. Here are my questions:
1) Should it be X_new=B.E_reduced or X_new=X.E_reduced?
2) Should I repeat the above calculations in the testing phase? If testing phase is similar to training phase, then no speed-up is gained because I have to calculate all the p features for each instance in the testing phase and PCA makes the algorithm slower because of eigenvector calculation overhead.
3) After applying PCA, I noticed that the accuracy decreased. Is this related to the number k (I set k=p/2) or the fact that I am using linear PCA instead of kernel PCA? What is the best way to choose the number k? I read that I can find the ratio of summation of k eigenvalues over the summation of all eigenvalues and make a decision based on this ratio.
You apply the multiplication to the centered data usually, so your projected data is also centered.
Never re-run PCA during testing. Only usenit on training data, and keep the shift vector and projection matrix. You need to apply exactly the same projection as during training, not recompute a new projection.
Decreased performance can have many reasons. E.g. did you also apply scaling using the roots of the eigenvalues? And what method did you use the first place?

Why is image segmentation needed for object detection?

What is the point of image segmentation algorithms like SLIC? Most object detection algorithms work over the entire set of (square) sub-images anyway.
The only conceivable benefit to segmenting the image is that now, the classifier has shape information available to it. Is that right?
Most classifiers I know of take rectangular input images. What classifiers allow you to pass variable sized image segments to them?
First, SLIC, and the kind of algorithms I'm guessing you refer to, are not segmentation algorithms, they are oversegmentation algorithms. There is a difference between those two terms. segmentation methods split the image in objects while oversegmentation methods split the image in small clusters (spatially adjacent group of pixels with similar characteristics), these clusters are usually called superpixels. See the image** of superpixels below:
Now, answering parts of your original question:
Why to use superpixels?
They reduce the dimensionality of your data/complexity of the problem. From N = w x h pixels to M superpixels, with M << N. This is, for an image composed of N = 320x480 = 153600 pixels, M = 400 or M = 800 superpixels seem to be enough to oversegment it. Now, letting for latter how to classify them, just consider how easier your problem has become reducing from N=100K to N=800 training examples to train/classify. The superpixels still represent your data properly, as they adhere to image boundaries.
They allow you to compute more complex features. With pixels, you can only compute some simple statistics on them, or use a filter-bank/feature extractor to extract some features in its vicinity. This however represents your pixel's neighbour very locally, without having in consideration the context. With superpixels, you can compute a superpixel descriptor from all the pixels that belong to it. This is, features are usually computed at pixel level as before, but, then features are merged into a superpixel descriptor by different methods. Some of the methods to do that are: simple mean of all pixels inside a superpixel, histograms, bag-of-words, correlation. As a simple example, imagine you only consider grayscale as a feature for your image/classifier, if you use simple pixels, all you have is pixel's intensity, which is very local and noisy. If you use superpixels, you can compute a histogram of the intensities of all the pixels inside, which describes much better the region than a single local intensity value.
They allow you to compute new features. Over superpixels you can compute some regional statistics (1st order as mean or variance or second order covariance). You can now extract some other information not available before (e.g. shape, length, diameter, area...).
How to use them?
Compute pixel features
Merge pixel features into superpixel descriptors
Classify/Optimize superpixel descriptors
In step 2., either by averaging, using histogram or using bag-of-words model, the superpixel descriptor is computed fixed-sized (e.g. 100 bins for histogram). Therefore, at the end you have reduced the X = N x K training data (N = 100K pixels times K features) to X = M x D (with M = 1K superpixels and D the length of the superpixel descriptor). Usually D > K but M << N, therefore you endup with some regional/more robust features that represent better your data with lower data dimensionality, which is great and reduces the complexity of your problem (classify, optimize) in average 2-3 orders of magnitude!
Conclusions
You can compute more complex (robust) features, but you have to be careful how, when and whatfor do you use superpixels as your data representation. You might lose some information (e.g. you lose your 2D grid lattices) or if you don't have enough training examples you might make the problem more difficult as the features are more complex and could be that you transform a linearly-separable problem into a non-linear one.
** Image from SLIC superpixels: http://ivrl.epfl.ch/research/superpixels

PCA + shape descriptors - OpenCV

I am working on a hand gesture recognition, where I want to recognize numbers (n numbers of fingers up). I am comparing different descriptors (convex hull, chain code, fourier and moments) and different classifiers (bayesian, kNN and SVM). I wanted to know if I can use PCA as an intermediate step to reduce to feature set. I am unable to figure out how to pass, say a chain code, to PCA input.

Feeding HOG into SVM: the HOG has 9 bins, but the SVM takes in a 1D matrix

In OpenCV, there is a CvSVM class which takes in a matrix of samples to train the SVM. The matrix is 2D, with the samples in the rows.
I created my own method to generate a histogram of oriented gradients (HOG) off of a video feed. To do this, I created a 9 channeled matrix to store the HOG, where each channel corresponds to an orientation bin. So in the end I have a 40x30 matrix of type CV_32FC(9).
Also made a visualisation for the HOG and it's working.
I don't see how I'm supposed to feed this matrix into the OpenCV SVM, because if I flatten it, I don't see how the SVM is supposed to learn a 9D hyperplane from 1D input data.
The SVM always takes in a single row of data per feature vector. The dimensionality of the feature vector is thus the length of the row. If you're dealing with 2D data, then there are 2 items per feature vector. Example of 2D data is on this webpage:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
code of an equivalent demo in OpenCV http://sites.google.com/site/btabibian/labbook/svmusingopencv
The point is that even though you're thinking of the histogram as 2D with 9-bin cells, the feature vector is in fact the flattened version of this. So it's correct to flatten it out into a long feature vector. The result for me was a feature vector of length 2304 (16x16x9) and I get 100% prediction accuracy on a small test set (i.e. it's probably slightly less than 100% but it's working exceptionally well).
The reason this works is that the SVM is working on a system of weights per item of the feature vector. So it doesn't have anything to do with the problem's dimension, the hyperplane is always in the same dimension as the feature vector. Another way of looking at it is to forget about the hyperplane and just view it as a bunch of weights for each item in the feature vector. In this case, it needs one weighting for every item, then it multiplies each item by its weighting and outputs the result.

Resources