I am working on a hand gesture recognition, where I want to recognize numbers (n numbers of fingers up). I am comparing different descriptors (convex hull, chain code, fourier and moments) and different classifiers (bayesian, kNN and SVM). I wanted to know if I can use PCA as an intermediate step to reduce to feature set. I am unable to figure out how to pass, say a chain code, to PCA input.
Related
In the paper CountNet: Estimating the Number of Concurrent Speakers Using Supervised Learning I recently read, it specified that the 3D volume output from a CNN layer must be reduced into a 2 dimensional sequence before entering the LSTM layer, why is that? What's wrong with using the 3 dimensional format?
The standard LSTM neural network assumes input of the following size:
[batch size] × [sequence length] × [feature dim]
The LSTM first multiplies each vector of size [feature dim] by a matrix, and then combines them in a fancy way. What's important here is that there's a vector per each example (the batch dimensions) and each timestep (the seq. length dimension). In a sense, this vector is first transformed by a matrix multiplication(s) (possibly involving some pointwise non-linearities, which don't change the shape, so I don't mention them) into a hidden state update, which is also a vector, and the updated hidden state vector is then used to produce the output (also a vector).
As you can see, the LSTM is designed to operate on vectors. You could design a Matrix-LSTM – an LSTM counterpart that assumes any or all of the following are matrices: the input, the hidden state, the output. That would require you to replace matrix-vector multiplications that process the input (or the state) by a generatlized linear operation that is able to turn any matrix into any other, which would be given by a rank-4 tensor, I believe. However, it'd be equivalent to just reshaping the input matrix into a vector, reshaping the rank-4 tensor into a matrix, doing matrix-vector product and then reshaping the output back into a matrix, so it makes little sense to devise such Matrix-LSTMs instead of just reshaping your inputs.
That said, it might still make sense to design a generalized LSTM that takes something other than a vector as input if the you know something about the input structure that instructs a more specific linear operator than a general rank-4 tensor. For example, images are known to have local structure (nearby pixels are more related than those far apart), hence using convolutions is more "reasonable" than reshaping images to vectors and then performing a general matrix multiplication. In a similar fashion you could replace all the matrix-vector multiplications in the LSTM with convolutions, which would allow for image-like input, states and outputs.
I have some 2d data which looks like it could be well classified by an area described by two intersecting straight lines. These lines won't necessarily by at right angles to each other. Here is a simple example where the two lines would be more or less at right angles:
Is there a suitable classifier for this? Logistic regression will give me one straight line but I am not sure what will give me two as a decision boundary. A decision tree will give me two that are axis parallel which isn't really want I want.
You can give Support Vector Machine (SVM) a try. There are multiple kernels that can be used with SVM, like
Linear
Polynomial
RBF (Radial Basis Function)
Sigmoid
You can even specify custom kernels as mentioned in this list.
Here is an image of decision boundaries over Iris dataset, taken from this example
References
Difference between various SVM kernels
Selecting Kernels for SVM
Custom kernel SVM
1) In eigenface approach the eigenfaces is a combination of elements from different faces. What are these elements?
2) The output face is an image composed of different eigenfaces with different weights. What does the weights of eigenfaces exactly mean? I know that the weight is percentage of eigenfacein the image, but what does it mean exactly, is mean the number of selected pixels?
Please study about PCA to understand what is the physical meaning of eigenfaces, when PCA is applied to an image. The answer lies in the understanding of eigenvectors and eigenvalues associated with PCA.
EigenFaces is based on Principal Component Analysis
Principal Component Analysis does dimensionality reduction and finds unique features in the training images and removes the similar features from the face images
By getting unique features our recognition task gets simpler
By using PCA you calculate the eigenvectors for your face image data
From these eigenvectors you calculate EigenFace of every training subject or you can say calculating EigenFace for every class in your data
So if you have 9 classes then the number of EigenFaces will be 9
The weight usually means how important something is
In EigenFaces weight of a particular EigenFace is a vector which just tells you how important that particular EigenFace is in contributing the MeanFace
Now if you have 9 EigenFaces then for every EigenFace you will get exactly one Weight vector which will be of N dimension where N is number of eigenvectors
So every element out N elements in one weight vector will tell you how important that particular eigenvector is for that corresponding EigenFace
The facial Recognition in EigenFaces is done by comparing the weights of training images and testing images with some kind of distance function
You can refer this github link: https://github.com/jayshah19949596/Computer-Vision-Course-Assignments/blob/master/EigenFaces/EigenFaces.ipynb
The code on the above link is a good documented code so If you know the basics you will understand the code
I'm not quite sure how to implement the "Bag of Words" approach with HOG descriptors.
I've checked several sources which usually provide several steps to follow:
Compute the HOGs for the set of valid training images.
Apply an clustering algorithm to retrieve n centroids from the descriptors.
Perform some magic to create histograms with the frequency of the nearest centroids of the computed HOGs or use OpenCVs implementation to do this.
Train a linear SVM with the histograms
The step which involves magic (3) is not really clear. If I don't use OpenCV, how would I implement it?
The HOGs are vectors which are calculated cell-wise. So I have a vector for each cell. I could iterate over the vector and calculate the closest centroid for each element of the vector and create the histogram accordingly. Would this be a proper way to do it? But if so, I still have vectors of different sizes and no benefit from it.
Main steps can be expressed;
1- Extract features from your entire training set. (HOG feature for your aim)
2- Cluster those features into a vocabulary V; you get K distinct cluster centers.(K-Means, K-Medoid. Your hyperparameter will be K)
3- Encode each training image as a histogram of the number of times each vocabulary element shows up in the image. Each image is then represented by a length-K vector.
For example; first element of K maybe occurs 5 times, second element of K maybe occurs 10 times in your image. Doesn't matter at the end you will have a vector which has K elements.
K[0] = 5
k[1] = 10
....
....
K[n] = 3
4- Train the classifier using this vector. (Linear SVM)
When given a test image, extract the features. Now represent the test image as a histogram of the number of times each cluster center from V was closest to a feature in the test image. This is a length K vector again.
In OpenCV, there is a CvSVM class which takes in a matrix of samples to train the SVM. The matrix is 2D, with the samples in the rows.
I created my own method to generate a histogram of oriented gradients (HOG) off of a video feed. To do this, I created a 9 channeled matrix to store the HOG, where each channel corresponds to an orientation bin. So in the end I have a 40x30 matrix of type CV_32FC(9).
Also made a visualisation for the HOG and it's working.
I don't see how I'm supposed to feed this matrix into the OpenCV SVM, because if I flatten it, I don't see how the SVM is supposed to learn a 9D hyperplane from 1D input data.
The SVM always takes in a single row of data per feature vector. The dimensionality of the feature vector is thus the length of the row. If you're dealing with 2D data, then there are 2 items per feature vector. Example of 2D data is on this webpage:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
code of an equivalent demo in OpenCV http://sites.google.com/site/btabibian/labbook/svmusingopencv
The point is that even though you're thinking of the histogram as 2D with 9-bin cells, the feature vector is in fact the flattened version of this. So it's correct to flatten it out into a long feature vector. The result for me was a feature vector of length 2304 (16x16x9) and I get 100% prediction accuracy on a small test set (i.e. it's probably slightly less than 100% but it's working exceptionally well).
The reason this works is that the SVM is working on a system of weights per item of the feature vector. So it doesn't have anything to do with the problem's dimension, the hyperplane is always in the same dimension as the feature vector. Another way of looking at it is to forget about the hyperplane and just view it as a bunch of weights for each item in the feature vector. In this case, it needs one weighting for every item, then it multiplies each item by its weighting and outputs the result.