Finger/Hand Gesture Recognition using Kinect - opencv

Let me explain my need before I explain the problem.
I am looking forward for a hand controlled application.
Navigation using palm and clicks using grab/fist.
Currently, I am working with Openni, which sounds promising and has few examples which turned out to be useful in my case, as it had inbuild hand tracker in samples. which serves my purpose for time being.
What I want to ask is,
1) what would be the best approach to have a fist/grab detector ?
I trained and used Adaboost fist classifiers on extracted RGB data, which was pretty good, but, it has too many false detections to move forward.
So, here I frame two more questions
2) Is there any other good library which is capable of achieving my needs using depth data ?
3)Can we train our own hand gestures, especially using fingers, as some paper was referring to HMM, if yes, how do we proceed with a library like OpenNI ?
Yeah, I tried with the middle ware libraries in OpenNI like, the grab detector, but, they wont serve my purpose, as its neither opensource nor matches my need.
Apart from what I asked, if there is something which you think, that could help me will be accepted as a good suggestion.

You don't need to train your first algorithm since it will complicate things.
Don't use color either since it's unreliable (mixes with background and changes unpredictably depending on lighting and viewpoint)
Assuming that your hand is a closest object you can simply
segment it out by depth threshold. You can set threshold manually, use a closest region of depth histogram, or perform connected component on a depth map to break it on meaningful parts first (and then select your object based not only on its depth but also using its dimensions, motion, user input, etc). Here is the output of a connected components method:
Apply convex defects from opencv library to find fingers;
Track fingers rather than rediscover them in 3D.This will increase stability. I successfully implemented such finger detection about 3 years ago.

Read my paper :) http://robau.files.wordpress.com/2010/06/final_report_00012.pdf
I have done research on gesture recognition for hands, and evaluated several approaches that are robust to scale, rotation etc. You have depth information which is very valuable, as the hardest problem for me was to actually segment the hand out of the image.
My most successful approach is to trail the contour of the hand and for each point on the contour, take the distance to the centroid of the hand. This gives a set of points that can be used as input for many training algorithms.
I use the image moments of the segmented hand to determine its rotation, so there is a good starting point on the hands contour. It is very easy to determine a fist, stretched out hand and the number of extended fingers.
Note that while it works fine, your arm tends to get tired from pointing into the air.

It seems that you are unaware of the Point Cloud Library (PCL). It is an open-source library dedicated to the processing of point clouds and RGB-D data, which is based on OpenNI for the low-level operations and which provides a lot of high-level algorithm, for instance to perform registration, segmentation and also recognition.
A very interesting algorithm for shape/object recognition in general is called implicit shape model. In order to detect a global object (such as a car, or an open hand), the idea is first to detect possible parts of it (e.g. wheels, trunk, etc, or fingers, palm, wrist etc) using a local feature detector, and then to infer the position of the global object by considering the density and the relative position of its parts. For instance, if I can detect five fingers, a palm and a wrist in a given neighborhood, there's a good chance that I am in fact looking at a hand, however, if I only detect one finger and a wrist somewhere, it could be a pair of false detections. The academic research article on this implicit shape model algorithm can be found here.
In PCL, there is a couple of tutorials dedicated to the topic of shape recognition, and luckily, one of them covers the implicit shape model, which has been implemented in PCL. I never tested this implementation, but from what I could read in the tutorial, you can specify your own point clouds for the training of the classifier.
That being said, you did not mentioned it explicitly in your question, but since your goal is to program a hand-controlled application, you might in fact be interested in a real-time shape detection algorithm. You would have to test the speed of the implicit shape model provided in PCL, but I think this approach is better suited to offline shape recognition.
If you do need real-time shape recognition, I think you should first use a hand/arm tracking algorithm (which are usually faster than full detection) in order to know where to look in the images, instead of trying to perform a full shape detection at each frame of your RGB-D stream. You could for instance track the hand location by segmenting the depthmap (e.g. using an appropriate threshold on the depth) and then detecting the extermities.
Then, once you approximately know where the hand is, it should be easier to decide whether the hand is making one gesture relevant to your application. I am not sure what you exactly mean by fist/grab gestures, but I suggest that you define and use some app-controlling gestures which are easy and quick to distinguish from one another.
Hope this helps.

The fast answer is: Yes, you can train your own gesture detector using depth data. It is really easy, but it depends on the type of the gesture.
Suppose you want to detect a hand movement:
Detect the hand position (x,y,x). Using OpenNi is straighforward as you have one node for the hand
Execute the gesture and collect ALL the positions of the hand during the gesture.
With the list of positions train a HMM. For example you can use Matlab, C, or Python.
For your own gestures, you can test the model and detect the gestures.
Here you can find a nice tutorial and code (in Matlab). The code (test.m is pretty easy to follow). Here is an snipet:
%Load collected data
training = get_xyz_data('data/train',train_gesture);
testing = get_xyz_data('data/test',test_gesture);
%Get clusters
[centroids N] = get_point_centroids(training,N,D);
ATrainBinned = get_point_clusters(training,centroids,D);
ATestBinned = get_point_clusters(testing,centroids,D);
% Set priors:
pP = prior_transition_matrix(M,LR);
% Train the model:
cyc = 50;
[E,P,Pi,LL] = dhmm_numeric(ATrainBinned,pP,[1:N]',M,cyc,.00001);
Dealing with fingers is pretty much the same, but instead of detecting the hand you need to detect de fingers. As Kinect doesn't have finger points, you need to use a specific code to detect them (using segmentation or contour tracking). Some examples using OpenCV can be found here and here, but the most promising one is the ROS library that have a finger node (see example here).

If you only need the detection of a fist/grab state, you should give microsoft a chance. Microsoft.Kinect.Toolkit.Interaction contains methods and events that detects the grip / grip release state of a hand. Take a look at the HandEventType of InteractionHandPointer . That works quite good for the fist/grab detection, but does not detect or report the position of individual fingers.
The next kinect (kinect one) detects 3 joint per hand (Wrist, Hand, Thumb) and has 3 hand based gestures: open, closed (grip/fist) and lasso (pointer). If that is enough for you, you should consider the microsoft libraries.

1) If there are a lot of false detections, you could try to extend the negative sample set of the classifier, and train it again. The extended negative image set should contain such images, where the fist was false detected. Maybe this will help to create a better classifier.

I've had quite a bit of succes with the middleware library as provided by http://www.threegear.com/. They provide several gestures (including grabbing, pinching and pointing) and 6 DOF handtracking.

You might be interested in this paper & open-source code:
Robust Articulated-ICP for Real-Time Hand Tracking
Code: https://github.com/OpenGP/htrack
Screenshot: http://lgg.epfl.ch/img/codedata/htrack_icp.png
YouTube Video: https://youtu.be/rm3YnClSmIQ
Paper PDF: http://infoscience.epfl.ch/record/206951/files/htrack.pdf

Related

Structure from Motion with Optical Flow

Let say I have a video from a drive recorder. I want to construct the recorded scene's points cloud using structure from motion technique. First I need to track some points.
Which algorithm can yield a better result? By using the sparse optical flow (Kanade-Lucas-Tomasi tracker) or the dense optical flow (Farneback)? I have experimented a bit but cannot really decide. Each one of them has their own strengths and weaknesses.
The ultimate target is to get the points cloud of the recorded cars in the scene. By using the sparse optical flow, I can track the interesting points of the cars. But it would be quite unpredictable. One solution is to make some kind of grid in the image, and force the tracker to track one interesting point in each of the grid. But I think this would be quite hard.
By using the dense flow, I can get the movement of every pixel, but the problem is, it cannot really detect the motion of cars that have only little motion. Also, I have doubt that the flow of every pixel yielded by the algorithm would be that accurate. Plus, with this, I believe I can only get the pixels movement between two frames only (unlike by using the sparse optical flow in which I can get multiple coordinates of the same interesting point along time t)
Your title indicate SFM which includes pose estimation ,
tracking is only the first step (matching) , if you want point cloud from video (very hard task) first thing I would think of, is bundle adjustment which also works for MVE,
Nevertheless , for video we can do more, as frames are too close to each other, we can use faster algorithm like ( optical flow ) , /than matching SIFT/ and extract F matrix from it , then :
E = 1/K * F * K
Back to your original question , what is better:
1) Dense Optical flow , or
2) Sparse one .
apparently you are working offline , so no importance of speed ,but I would recommend the sparse one ,
Update
for 3d reconstruction , the dense may seem more attractive, but as you said it's rarely robust, so you can use sparse but add as many points as you want to make it semi-dense ,
I cannot name but a few methods that could do this, like mono-slam or orb-slam
Final Update
use semi-dense as I write earlier, but SFM always assume static objects (no movement) or it will never works.
in practical using all the pixels in the image is something never used for 3d reconstruction (not direct methods), and always SIFT were praised way for features detecting and matching, .. recently all the pixels were used in different kind of calibration ,for ex in methods like: Direct Sparse odometry and LSD known as Direct methods

PARABOLIC (not panoramic) video stitching?

I want to do something like this but in reverse-- so that the cameras are outside and pointing inward. Let's start with the abstract and get specific:
1) Are there any TOOLS that will do this for me? How close can I get using existing software?
2) Say the nearest tool is a graphics library like OpenCV. I've taken linear algebra and have an undergraduate degree in CS but without any special training in graphics. Where should I go from there?
3) If I really am undergoing a decade-long spiritual quest of a self-teaching+programming exercise to make this happen, are there any papers or other resources that you aware of that might aid me?
I think the demo you linked uses a 360° camera (see the black circle on the bottom) and does not involve stitching in any way.
About your question, are you aware of this work? They don't do stitching either, just blending between different views.
If you use inward views, then the objects you will observe will probably be quite close to the cameras, while standard stitching assumes that objects are far away. Close 3D objects mean high distortion when you change the viewpoint (i.e. parallax & occlusions), which makes it difficult to interpolate between two views. Hence, if you want stitching, then your main problem is to correctly handle parallax effects & occlusions between the views.
In my opinion, the most promising approach would be to do live stereo matching (i.e. dense 3D reconstruction) between the two camera images closest to your current viewpoint, and then interpolate the estimated disparities to generate an expected image. However, it's not likely to run in real-time, as demonstrated in the demo you linked, and the result could be quite ugly...
EDIT
You can also have a look at this paper, which uses a different but interesting approach, however maybe not directly useful in your case since it requires the new viewpoint to be visible in the available images.

Real time tracking of hand

I am trying to detect and track hand in real time using opencv. I thought haar cascade classifiers would yield a fair result. After training with 10k and 20k positive and negative images respectively, I obtained a classifier xml file. Unfortunately, it detects hand only in certain positions, proving that it works best only for rigid objects. So I am now thinking of adopting another algorithm that can track hand, once detected through haar classifier.
My question is,if I make sure that haar classifier detects hand in a certain frame, certain position, what method would yield robust tracking of hand further?
I searched web a bit, and have understood I can go for optical flow of the detected hand , or kalman filter or particle filter, but also have come across their own disadvantages.
also, If I incorporate stereo vision, would it help me, as I can possibly reconstruct hand in 3d.
You concluded rightly about Haar features - they aren't that useful when it comes to non-rigid objects.
Take a look at the following papers which use skin colour to detect hands.
Interaction between hands and wearable cameras
Markerless inspection of augmented reality objects
and this paper that uses KLT features to track the hand after the first detection:
Fast 2D hand tracking with flocks of features and multi-cue integration
I would say that a stereo camera will not help your cause much, as 3D reconstruction of non-rigid objects isn't straightforward and would require a whole lot of innovation and development. However, you can take a look at the papers in the hand pose estimation section of this page if you wish to pursue 3D tracking.
EDIT: Also take a look at this recent paper, which seems to get good results.
Zhang et al.'s Real-time Compressive Tracking does a reasonable job of tracking an object, once it has been detected by some other method, provided that the motion is not too fast. They have an OpenCV implementation (but it would need a bit of work to reuse).
This research paper describes a method to track hands, without using gloves by using a stereo camera setup.
there have been similar questions on stack overflow...
have a look at my answer and that of others: https://stackoverflow.com/a/17375647/1463143
you can for certain get better results by avoiding haar training and detection for deformable entities.
CamShift algorithm is generally fast and accurate, if you want to track the hand as a single entity. OpenCV documentation contains a good, easy-to-understand demo program that you can easily modify.
If you need to track fingers etc., however, further modeling will be needed.

single person tracking from video sequence

As a part of my thesis work, I need to build a program for human tracking from video or image sequence like the KTH or IXMAS dataset with the assumptions:
Illumination remains unchanged
Only one person appear in the scene
The program need to perform in real-time
I have searched a lot but still can not find a good solution.
Please suggest me a good method or an existing program that is suitable.
Case 1 - If camera static
If the camera is static, it is really simple to track one person.
You can apply a method called background subtraction.
Here, for better results, you need a bare image from camera, with no persons in it. It is the background. ( It can also be done, even if you don't have this background image. But if you have it, better. I will tell at end what to do if no background image)
Now start capture from camera. Take first frame,convert both to grayscale, smooth both images to avoid noise.
Subtract background image from frame.
If the frame has no change wrt background image (ie no person), you get a black image ( Of course there will be some noise, we can remove it). If there is change, ie person walked into frame, you will get an image with person and background as black.
Now threshold the image for a suitable value.
Apply some erosion to remove small granular noise. Apply dilation after that.
Now find contours. Most probably there will be one contour,ie the person.
Find centroid or whatever you want for this person to track.
Now suppose you don't have a background image, you can find it using cvRunningAvg function. It finds running average of frames from your video which you use to track. But you can obviously understand, first method is better, if you get background image.
Here is the implementation of above method using cvRunningAvg.
Case 2 - Camera not static
Here background subtraction won't give good result, since you can't get a fixed background.
Then OpenCV come with a sample for people detection sample. Use it.
This is the file: peopledetect.cpp
I also recommend you to visit this SOF which deals with almost same problem: How can I detect and track people using OpenCV?
One possible solution is to use feature points tracking algorithm.
Look at this book:
Laganiere Robert - OpenCV 2 Computer Vision Application Programming Cookbook - 2011
p. 266
Full algorithm is already implemented in this book, using opencv.
The above method : a simple frame differencing followed by dilation and erosion would work, in case of a simple clean scene with just the motion of the person walking with absolutely no other motion or illumination changes. Also you are doing a detection every frame, as opposed to tracking. In this specific scenario, it might not be much more difficult to track either. Movement direction and speed : you can just run Lucas Kanade on the difference images.
At the core of it, what you need is a person detector followed by a tracker. Tracker can be either point based (Lucas Kanade or Horn and Schunck) or using Kalman filter or any of those kind of tracking for bounding boxes or blobs.
A lot of vision problems are ill-posed, some some amount of structure/constraints, helps to solve it considerably faster. Few questions to ask would be these :
Is the camera moving : No quite easy, Yes : Much harder, exactly what works depends on other conditions.
Is the scene constant except for the person
Is the person front facing / side-facing most of the time : Detect using viola jones or train one (adaboost with Haar or similar features) for side-facing face.
How spatially accurate do you need it to be, will a bounding box do it, or do you need a contour : Bounding box : just search (intelligently :)) in neighbourhood with SAD (sum of Abs Differences). Contour : Tougher, use contour based trackers.
Do you need the "tracklet" or only just the position of the person at each frame, temporal accuracy ?
What resolution are we speaking about here, since you need real time :
Is the scene sparse like the sequences or would it be cluttered ?
Is there any other motion in the sequence ?
Offline or Online ?
If you develop in .NET you can use the Aforge.NET framework.
http://www.aforgenet.com/
I was a regular visitor of the forums and I seem to remember there are plenty of people using it for tracking people.
I've also used the framework for other non-related purposes and can say I highly recommend it for its ease of use and powerful features.

What is the best method for object detection in low-resolution moving video?

I'm looking for the fastest and more efficient method of detecting an object in a moving video. Things to note about this video: It is very grainy and low resolution, also both the background and foreground are moving simultaneously.
Note: I'm trying to detect a moving truck on a road in a moving video.
Methods I've tried:
Training a Haar Cascade - I've attempted training the classifiers to identify the object by taking copping multiple images of the desired object. This proved to produce either many false detects or no detects at all (the object desired was never detected). I used about 100 positive images and 4000 negatives.
SIFT and SURF Keypoints - When attempting to use either of these methods which is based on features, I discovered that the object I wanted to detect was too low in resolution, so there were not enough features to match to make an accurate detection. (Object desired was never detected)
Template Matching - This is probably the best method I've tried. It's the most accurate although the most hacky of them all. I can detect the object for one specific video using a template cropped from the video. However, there is no guaranteed accuracy because all that is known is the best match for each frame, no analysis is done on the percentage template matches the frame. Basically, it only works if the object is always in the video, otherwise it will create a false detect.
So those are the big 3 methods I've tried and all have failed. What would work best is something like template matching but with scale and rotation invariance (which led me to try SIFT/SURF), but i have no idea how to modify the template matching function.
Does anyone have any suggestions how to best accomplish this task?
Apply optical flow to the image and then segment it based on flow field. Background flow is very different from "object" flow (which mainly diverges or converges depending on whether it is moving towards or away from you, with some lateral component also).
Here's an oldish project which worked this way:
http://users.fmrib.ox.ac.uk/~steve/asset/index.html
This vehicle detection paper uses a Gabor filter bank for low level detection and then uses the response to create the features space where it trains an SVM classifier.
The technique seems to work well and is at least scale invariant. I am not sure about rotation though.
Not knowing your application, my initial impression is normalized cross-correlation, especially since I remember seeing a purely optical cross-correlator that had vehicle-tracking as the example application. (Tracking a vehicle as it passes using only optical components and an image of the side of the vehicle - I wish I could find the link.) This is similar (if not identical) to "template matching", which you say kind of works, but this won't work if the images are rotated, as you know.
However, there's a related method based on log-polar coordinates that will work regardless of rotation, scale, shear, and translation.
I imagine this would also enable tracking that the object has left the scene of the video, too, since the maximum correlation will decrease.
How low resolution are we talking? Could you also elaborate on the object? Is it a specific color? Does it have a pattern? The answers affect what you should be using.
Also, I might be reading your template matching statement wrong, but it sounds like you are overtraining it (by testing on the same video you extracted the object from??).
A Haar Cascade is going to require significant training data on your part, and will be poor for any adjustments in orientation.
Your best bet might be to combine template matching with an algorithm similar to camshift in opencv (5,7MB PDF), along with a probabilistic model (you'll have to figure this one out) of whether the truck is still in the image.

Resources