hand motion recognition using Hidden Markov Model - opencv

I'm doing hand motion recognition project for my final assigment, the core of my code is Hidden Markov Model some papers said that we first need to detect the object, perform feature extraction then use HMM to recognize the motion,
I'm using openCV, I've done the hand detection using haar clasifier, I've prepared the hmm code using c++, but I missed something:
I dont' know how to integrating Haar Clasifier with HMM
How to perform feature extraction from detected hand (haar clasifier)?
I know we should first train the HMM for motion recognition, but i don't how to train motion data, what kind of data that I should use? how to prepare the data? where can I find them or how can I collect them?
If I searching on google, some people said that HMM motion recognition has a similiarity with HMM speech recognition, but I confused which part is similiar?
someone please tell me if I do wrong, give me suggestion what should I do
please teach me, master

To my understanding:
1) haar is used to detect static objects, which means it works within a frame of image.
2) HMM is used to recognize temporal features, which means it works across frames.
So the things you wanna do is to first track the hand, get the feature of the hand and train the gesture movement in HMM.
As for the feature, the most naive one is the "pixel by pixel" feature. You just put all the pixels' intensities together. After this, a dimensionality reduction is needed, say, PCA.
After that, one way of using HMM is to discretize the features into clusters, and train the model with discretized states sequence, then predict the probability of a given sequence of features belonging to each of the groups.
Note
This is not a standard gesture recognition procedure. However it is quite naive for your "final project".

Related

OpenCV: Training a soft cascade classifier

I've built an algorithm for pedestrian detection using openCV tools. To perform classification I use a boosted classifier trained with the CvBoost class.
The problem of this implementation is that I need to feed my classifier the whole set of features I used for training. This makes the algorithm extremely slow, so much that each image takes around 20 seconds to be fully analysed.
I need a different detection structure, and openCV has this Soft Cascade class that seems like exactly what I need. Its basic principle is that there is no need to examine all the features of a testing sample, since a detector can reject most negative samples using a small number of features. The problem is that I have no idea how to train one given a fully labeled set of negative and positive examples.
I find no information about this online, so I am looking for any tips you can give me on how to use this soft cascade to make classification.
Best regards

Feature Extraction Methods for Hand gesture/posture recognition

I am currently working on a Sign Language Recognition application, where I would like to use a Hidden Markov Model as the classification stage, meaning that I will classify a gesture/posture to obtain the relevant letter or word.
I have currently completed the first stage where I am detecting the hand. Currently I can obtain a number of parameters (features) which I can use for my machine learning stage such as:
convex hull of hand
convexity defects
centroid of the hand
bounding rotated ellipses/rectangles (e.g. obtain any angle needed in terms of rotation)
contour of the hand
moments (I am not sure what these are extactly)
These are all possible to do through openCv.
My question: once I have all these features, how can I execute the 'Feature Extraction' stage? i.e. if a machine learning algorithm, in this case the HMM requires a set of probabilities, how can I use the above information?
One idea I have is to create a special data structure with such information which uniquely identifies each gesture, but how do I feed it to the machine learning technique? (in this case the Hidden Markov Model)
Can any one be able to guide me as to what I should at least search for at this particular stage or guide me to show what is actually the real difficulty I have?
Once you have your set of observations ready, you could feed it to the Viterbi Algorithm to detect the best state sequence that may have produced these observations. Also, you can train your HMM over a data set of samples using the Baum-Welch algorithm. You could have a look at my blog post which is a simple explanation of recognizing dynamic hand gestures using HMM (although I am NOT using openCV or scanning the contour of the hand). Hope this can help you in getting a general idea about the processing and learning phase.

methods of face detection?

i want to know the best method of face detection because i'm working on predict face emotion application
so Before analyzing the facial expression of a face fixed or moving, it should detect or follow to extract relevant information. several
detection methods existe but what is the best in my case ?
A fast and easy way to get started with face detection is through using OpenCV's Haar detection methods (a slightly modified version of the viola-jones face detection algorithm IIRC). They have pre-trained haar cascade classifiers for entire faces and individual face components, e.g. eyes, nose, etc. You can also train your own if you feel so inclined. Haar features also have the advantage of being very fast, so it's quite usable with video (which it sounds like you'll be using). Also, by having the individual face-components classified, it may simplify your emotion detection/prediction algorithm.
You can find the OpenCV documentation detailing Haar feature-based object recognition at http://docs.opencv.org/modules/objdetect/doc/cascade_classification.html#viola01
and an example of performing face detection at http://code.opencv.org/projects/opencv/repository/revisions/master/entry/samples/cpp/dbt_face_detection.cpp
As for the emotion detection, that's an open research question, so anything you try will likely be fairly involved. If you're into that sort of thing, some good papers to look over might be http://www.utdallas.edu/dept/eecs/research/researchlabs/msp-lab/publications/Busso_2004.pdf and http://humansensing.cs.cmu.edu/papers/Automated.pdf

Difference between feature detection and object detection

I know that most common object detection involves Haar cascades and that there are many techniques for feature detection such as SIFT, SURF, STAR, ORB, etc... but if my end goal is to recognizes objects doesn't both ways end up giving me the same result? I understand using feature techniques on simple shapes and patterns but for complex objects these feature algorithms seem to work as well.
I don't need to know the difference in how they function but whether or not having one of them is enough to exclude the other. If I use Haar cascading, do I need to bother with SIFT? Why bother?
thanks
EDIT: for my purposes I want to implement object recognition on a broad class of things. Meaning that any cups that are similarly shaped as cups will be picked up as part of class cups. But I also want to specify instances, meaning a NYC cup will be picked up as an instance NYC cup.
Object detection usually consists of two steps: feature detection and classification.
In the feature detection step, the relevant features of the object to be detected are gathered.
These features are input to the second step, classification. (Even Haar cascading can be used
for feature detection, to my knowledge.) Classification involves algorithms
such as neural networks, K-nearest neighbor, and so on. The goal of classification is to find
out whether the detected features correspond to features that the object to be detected
would have. Classification generally belongs to the realm of machine learning.
Face detection, for example, is an example of object detection.
EDIT (Jul. 9, 2018):
With the advent of deep learning, neural networks with multiple hidden layers have come into wide use, making it relatively easy to see the difference between feature detection and object detection. A deep learning neural network consists of two or more hidden layers, each of which is specialized for a specific part of the task at hand. For neural networks that detect objects from an image, the earlier layers arrange low-level features into a many-dimensional space (feature detection), and the later layers classify objects according to where those features are found in that many-dimensional space (object detection). A nice introduction to neural networks of this kind is found in the Wolfram Blog article "Launching the Wolfram Neural Net Repository".
Normally objects are collections of features. A feature tends to be a very low-level primitive thing. An object implies moving the understanding of the scene to the next level up.
A feature might be something like a corner, an edge etc. whereas an object might be something like a book, a box, a desk. These objects are all composed of multiple features, some of which may be visible in any given scene.
Invariance, speed, storage; few reasons, I can think on top of my head. The other method to do would be to keep the complete image and then check whether the given image is similar to glass images you have in your database. But if you have a compressed representation of the glass, it will need lesser computation (thus faster), will need lesser storage and the features tells you the invariance across images.
Both the methods you mentioned are essentially the same with slight differences. In case of Haar, you detect the Haar features then you boost them to increase the confidence. Boosting is nothing but a meta-classifier, which smartly chooses which all Harr features to be included in your final meta-classification, so that it can give a better estimate. The other method, also more or less does this, except that you have more "sophisticated" features. The main difference is that, you don't use boosting directly. You tend to use some sort of classification or clustering, like MoG (Mixture of Gaussian) or K-Mean or some other heuristic to cluster your data. Your clustering largely depends on your features and application.
What will work in your case : that is a tough question. If I were you, I would play around with Haar and if it doesn't work, would try the other method (obs :>). Be aware that you might want to segment the image and give some sort of a boundary around for it to detect glasses.

machine learning - svm feature fusion techique

for my final thesis i am trying to build up an 3d face recognition system by combining color and depth information. the first step i did, is to realign the data-head to an given model-head using the iterative closest point algorithm. for the detection step i was thinking about using the libsvm. but i dont understand how to combine the depth and the color information to one feature vector? they are dependent information (each point consist of color (RGB), depth information and also scan quality).. what do you suggest to do? something like weighting?
edit:
last night i read an article about SURF/SIFT features i would like to use them! could it work? the concept would be the following: extracting this features out of the color image and the depth image (range image), using each feature as a single feature vector for the svm?
Concatenation is indeed a possibility. However, as you are working on 3d face recognition you should have some strategy as to how you go about it. Rotation and translation of faces will be hard to recognize using a "straightforward" approach.
You should decide whether you attempt to perform a detection of the face as a whole, or of sub-features. You could attempt to detect rotation by finding some core features (eyes, nose, etc).
Also, remember that SVMs are inherently binary (i.e. they separate between two classes). Depending on your exact application you will very likely have to employ some multi-class strategy (One-against-all or One-against-many).
I would recommend doing some literature research to see how others have attacked the problem (a google search will be a good start).
It sounds simple, but you can simply concatenate the two vectors into one. Many researchers do this.
What you arrived at is an important open problem. Yes, there are some ways to handle it, as mentioned here by Eamorr. For example you can concatenate and do PCA (or some non linear dimensionality reduction method). But it is kind of hard to defend the practicality of doing so, considering that PCA takes O(n^3) time in the number of features. This alone might be unreasonable for data in vision that may have thousands of features.
As mentioned by others, the easiest approach is to simply combine the two sets of features into one.
SVM is characterized by the normal to the maximum-margin hyperplane, where its components specify the weights/importance of the features, such that higher absolute values have a larger impact on the decision function. Thus SVM assigns weights to each feature all on its own.
In order for this to work, obviously you would have to normalize all the attributes to have the same scale (say transform all features to be in the range [-1,1] or [0,1])

Resources