Let me explain my need before I explain the problem.
I am looking forward for a hand controlled application.
Navigation using palm and clicks using grab/fist.
Currently, I am working with Openni, which sounds promising and has few examples which turned out to be useful in my case, as it had inbuild hand tracker in samples. which serves my purpose for time being.
What I want to ask is,
1) what would be the best approach to have a fist/grab detector ?
I trained and used Adaboost fist classifiers on extracted RGB data, which was pretty good, but, it has too many false detections to move forward.
So, here I frame two more questions
2) Is there any other good library which is capable of achieving my needs using depth data ?
3)Can we train our own hand gestures, especially using fingers, as some paper was referring to HMM, if yes, how do we proceed with a library like OpenNI ?
Yeah, I tried with the middle ware libraries in OpenNI like, the grab detector, but, they wont serve my purpose, as its neither opensource nor matches my need.
Apart from what I asked, if there is something which you think, that could help me will be accepted as a good suggestion.
You don't need to train your first algorithm since it will complicate things.
Don't use color either since it's unreliable (mixes with background and changes unpredictably depending on lighting and viewpoint)
Assuming that your hand is a closest object you can simply
segment it out by depth threshold. You can set threshold manually, use a closest region of depth histogram, or perform connected component on a depth map to break it on meaningful parts first (and then select your object based not only on its depth but also using its dimensions, motion, user input, etc). Here is the output of a connected components method:
Apply convex defects from opencv library to find fingers;
Track fingers rather than rediscover them in 3D.This will increase stability. I successfully implemented such finger detection about 3 years ago.
Read my paper :) http://robau.files.wordpress.com/2010/06/final_report_00012.pdf
I have done research on gesture recognition for hands, and evaluated several approaches that are robust to scale, rotation etc. You have depth information which is very valuable, as the hardest problem for me was to actually segment the hand out of the image.
My most successful approach is to trail the contour of the hand and for each point on the contour, take the distance to the centroid of the hand. This gives a set of points that can be used as input for many training algorithms.
I use the image moments of the segmented hand to determine its rotation, so there is a good starting point on the hands contour. It is very easy to determine a fist, stretched out hand and the number of extended fingers.
Note that while it works fine, your arm tends to get tired from pointing into the air.
It seems that you are unaware of the Point Cloud Library (PCL). It is an open-source library dedicated to the processing of point clouds and RGB-D data, which is based on OpenNI for the low-level operations and which provides a lot of high-level algorithm, for instance to perform registration, segmentation and also recognition.
A very interesting algorithm for shape/object recognition in general is called implicit shape model. In order to detect a global object (such as a car, or an open hand), the idea is first to detect possible parts of it (e.g. wheels, trunk, etc, or fingers, palm, wrist etc) using a local feature detector, and then to infer the position of the global object by considering the density and the relative position of its parts. For instance, if I can detect five fingers, a palm and a wrist in a given neighborhood, there's a good chance that I am in fact looking at a hand, however, if I only detect one finger and a wrist somewhere, it could be a pair of false detections. The academic research article on this implicit shape model algorithm can be found here.
In PCL, there is a couple of tutorials dedicated to the topic of shape recognition, and luckily, one of them covers the implicit shape model, which has been implemented in PCL. I never tested this implementation, but from what I could read in the tutorial, you can specify your own point clouds for the training of the classifier.
That being said, you did not mentioned it explicitly in your question, but since your goal is to program a hand-controlled application, you might in fact be interested in a real-time shape detection algorithm. You would have to test the speed of the implicit shape model provided in PCL, but I think this approach is better suited to offline shape recognition.
If you do need real-time shape recognition, I think you should first use a hand/arm tracking algorithm (which are usually faster than full detection) in order to know where to look in the images, instead of trying to perform a full shape detection at each frame of your RGB-D stream. You could for instance track the hand location by segmenting the depthmap (e.g. using an appropriate threshold on the depth) and then detecting the extermities.
Then, once you approximately know where the hand is, it should be easier to decide whether the hand is making one gesture relevant to your application. I am not sure what you exactly mean by fist/grab gestures, but I suggest that you define and use some app-controlling gestures which are easy and quick to distinguish from one another.
Hope this helps.
The fast answer is: Yes, you can train your own gesture detector using depth data. It is really easy, but it depends on the type of the gesture.
Suppose you want to detect a hand movement:
Detect the hand position (x,y,x). Using OpenNi is straighforward as you have one node for the hand
Execute the gesture and collect ALL the positions of the hand during the gesture.
With the list of positions train a HMM. For example you can use Matlab, C, or Python.
For your own gestures, you can test the model and detect the gestures.
Here you can find a nice tutorial and code (in Matlab). The code (test.m is pretty easy to follow). Here is an snipet:
%Load collected data
training = get_xyz_data('data/train',train_gesture);
testing = get_xyz_data('data/test',test_gesture);
%Get clusters
[centroids N] = get_point_centroids(training,N,D);
ATrainBinned = get_point_clusters(training,centroids,D);
ATestBinned = get_point_clusters(testing,centroids,D);
% Set priors:
pP = prior_transition_matrix(M,LR);
% Train the model:
cyc = 50;
[E,P,Pi,LL] = dhmm_numeric(ATrainBinned,pP,[1:N]',M,cyc,.00001);
Dealing with fingers is pretty much the same, but instead of detecting the hand you need to detect de fingers. As Kinect doesn't have finger points, you need to use a specific code to detect them (using segmentation or contour tracking). Some examples using OpenCV can be found here and here, but the most promising one is the ROS library that have a finger node (see example here).
If you only need the detection of a fist/grab state, you should give microsoft a chance. Microsoft.Kinect.Toolkit.Interaction contains methods and events that detects the grip / grip release state of a hand. Take a look at the HandEventType of InteractionHandPointer . That works quite good for the fist/grab detection, but does not detect or report the position of individual fingers.
The next kinect (kinect one) detects 3 joint per hand (Wrist, Hand, Thumb) and has 3 hand based gestures: open, closed (grip/fist) and lasso (pointer). If that is enough for you, you should consider the microsoft libraries.
1) If there are a lot of false detections, you could try to extend the negative sample set of the classifier, and train it again. The extended negative image set should contain such images, where the fist was false detected. Maybe this will help to create a better classifier.
I've had quite a bit of succes with the middleware library as provided by http://www.threegear.com/. They provide several gestures (including grabbing, pinching and pointing) and 6 DOF handtracking.
You might be interested in this paper & open-source code:
Robust Articulated-ICP for Real-Time Hand Tracking
Code: https://github.com/OpenGP/htrack
Screenshot: http://lgg.epfl.ch/img/codedata/htrack_icp.png
YouTube Video: https://youtu.be/rm3YnClSmIQ
Paper PDF: http://infoscience.epfl.ch/record/206951/files/htrack.pdf
I have been trying to detect multiple people in a small space and hence track them.
Input: CCTV feed from a camera mounted in a small room.
Expected Output: Track and hence store the path that people take while moving from one end of the room to the other.
I tried to implement some of the basic methods like background subtraction and pedestrian detection. But the results are not as desired.
In the results obtained by implementing background subtraction, due to occlusion the blob is not one single entity(the blob of one person is broken into multiple small blobs) hence, detecting it as a single person is very difficult.
Now, consider the case when there are many people standing close to each other. In this case detecting people using simple background subtraction is a complete disaster.
Is there a better way to detect multiple people?
Or maybe is there a way to improve the result of background subtraction?
And please suggest a good way for tracking multiple people?
that's a quite hard problem and there is no out-of-the-box solution, so you might have to try different methods.
In the beginning you will want to make some assumptions like static camera position and everything that's not background is a person or part of a person, maybe multiple persons. Persons can't appear within the image but they will have to 'enter' it (and are detected on entering and tracked after detection).
Detection and tracking can both be difficult problems so you might want to focus on one of them first. I would start with tracking and choose a probabilisic tracking method, since simple tracking methods like tracking by detection probably can't handle overlap and multiple targets very well.
Tracking:
I would try a particle filter, like http://www.irisa.fr/vista/Papers/2002/perez_hue_eccv02.pdf
which is capable to track multiple targets.
Detection: There is a HoG Person Detector in OpenCV which works quite fine for upright persons
HOGDescriptor hog;
hog.setSVMDetector(HOGDescriptor::getDefaultPeopleDetector());
but it's good to know the approximate size of a person in the image and scale the image accordingly. You can do this after background subtraction by scaling the blobs or combination of blobs, or you use a calibration of your camera and scale image parts of size 1.6m to 2.0m to your HoG detector size. Otherwise you might have many misses and many false alarms.
In the end you will have to work and research for some time to get the things running, but don't expect early success or 100% hit rates ;)
I would create a sample video and work on that, manually masking entering people as detection and implement the tracker with those detections.
I am developing a feature tracking application and so far, after trying to almost all the feature detectors/descriptors, i've got the most satisfactory overall results with ORB.
Both my feature descriptor and detector is ORB.
I am selecting a specific area for detecting features on my source image (by masking). and then matching it with features detected on subsequent frames.
Then i filter my matches by performing ratio test on 'matches' obtained from the following code:
std::vector<std::vector<DMatch>> matches1;
m_matcher.knnMatch( m_descriptorsSrcScene, m_descriptorsCurScene, matches1,2 );
I also tried the two way ratio test(filtering matches from Source to Current scene and vice-versa, then filtering out common matches) but it didn't do much, so I went ahead with the one way ratio test.
i also add a min distance check to my ratio test, which, it apppears, gives better results
if (distanceRatio < m_fThreshRatio && bestMatch.distance < 5*min_dist)
{
refinedMatches.push_back(bestMatch);
}
and in the end , i estimate the Homography.
Mat H = findHomography(points1,points2);
I've tried using the RANSAC method for estimating inliners and then using those to recalculate my Homography, but that gives more unstability plus consumes more time.
then in the end i draw a rectangle around my specific region which is to be tracked. i get the plane coordinates by:
perspectiveTransform( obj_corners, scene_corners, H);
where 'objcorners' are the coordinates of my masked(or unmasked) region.
The reactangle I draw using 'scene_corners' seems to be vibrating. increasing the number of features has reduced it quite a bit, but I cant increase them too much because of the time constraint.
How can i improve the stability?
Any suggestions would be appreciated.
Thanks.
If it is the vibrations that are really bothersome to you then you could try taking the moving average of the homography matrices over time:
cv::Mat homoG = cv::findHomography(obj, scene, CV_RANSAC);
if (homography.empty()) {
homoG.copyTo(homography);
}
cv::accumulateWeighted(homoG, homography, 0.1);
Make the 'homography' variable global, and keep calling this every time you get a new frame.
The alpha parameter of accumulateWeighted is the reciprocal of the period of the moving average.
So 0.1 is taking the average of the last 10 frames and 0.2 is taking the average of the last 5 and so on...
A suggestion that comes to mind from experience with feature detection/matching is that sometimes you just have to accept the matched feature points will not work perfectly. Even subtle changes in the scene you are looking at can cause somewhat annoying problems, for example changes in light or unwanted objects coming into view.
It appears to me that you have a decently working feature matching in place from what you say, you may want to work on a way of keeping the region of interest constant. If you know the typical speed or any other movement patterns unique to any object you are trying to track between frames, or any constraints relating to the position of your camera, it may be useful in avoiding recalculating the region of interest unnecessarily causing vibrations. Or in fact it may help in creating a more efficient searching algorithm, allowing you to increase the number of feature points you can detect and use.
Another (small) hack you can use is to avoid redrawing the region window if the previous window was of similar size and position.
I am trying to implement a people counting system using computer vision for uni project. Currently, my method is:
Background subtraction using MOG2
Morphological filter to remove noise
Track blob
Count blob passing a specified region (a line)
The problem is if people come as group, my method only counts one people. From my readings, I believe this is what called as occlusion. Another problem is when people looks similar to background (use dark clothing and passing a black pillar/wall), the blob is separated while it is actually one person.
From what I read, I should implement a detector + tracker (e.g. detect human using HOG). But my detection result is poor (e.g. 50% false positives with 50% hit rate; using OpenCV human detector and my own trained detector) so I am not convinced to use the detector as basis for tracking. Thanks for your answers and time for reading this post!
Tracking people in video surveillance sequences is still an open problem in the research community. However particule filters (PF) (aka sequential monte-carlo) gives good results towards occlusion and complex scene. You should read this. There is also extra links to example source code after biblio.
An advantage on using PF is the gain in computational time towards tracking by detection (only).
If you go this way, feel free to ask for better understanding about the maths behind the PF.
There is no single "good" answer to this as handling occlusion (and background substraction) are still open problems! There are several pointers that can be given that might help you along with your project.
You want to detect if a "blob" is one person or a group of people. There are several things you could do to handle this.
Use multiple cameras (it's unlikely that a group of people is detected as a single blob from all angles)
Try to detect parts of the human body. If you detect two heads on a single blob, there are multiple people. Same can be said for 3 legs, 5 shoulders, etc.
On the area of tracking a "lost" person (one walking behind another object), is to extrapolate it's position. You know that a person can only move so much in between frames. By holding this into account, you know that it's impossible for a user to be detected in the middle of your image and then suddenly disappear. After several frames of not seeing that person, you can discard the observation, as the person might have had enough time to move away.
As a part of my thesis work, I need to build a program for human tracking from video or image sequence like the KTH or IXMAS dataset with the assumptions:
Illumination remains unchanged
Only one person appear in the scene
The program need to perform in real-time
I have searched a lot but still can not find a good solution.
Please suggest me a good method or an existing program that is suitable.
Case 1 - If camera static
If the camera is static, it is really simple to track one person.
You can apply a method called background subtraction.
Here, for better results, you need a bare image from camera, with no persons in it. It is the background. ( It can also be done, even if you don't have this background image. But if you have it, better. I will tell at end what to do if no background image)
Now start capture from camera. Take first frame,convert both to grayscale, smooth both images to avoid noise.
Subtract background image from frame.
If the frame has no change wrt background image (ie no person), you get a black image ( Of course there will be some noise, we can remove it). If there is change, ie person walked into frame, you will get an image with person and background as black.
Now threshold the image for a suitable value.
Apply some erosion to remove small granular noise. Apply dilation after that.
Now find contours. Most probably there will be one contour,ie the person.
Find centroid or whatever you want for this person to track.
Now suppose you don't have a background image, you can find it using cvRunningAvg function. It finds running average of frames from your video which you use to track. But you can obviously understand, first method is better, if you get background image.
Here is the implementation of above method using cvRunningAvg.
Case 2 - Camera not static
Here background subtraction won't give good result, since you can't get a fixed background.
Then OpenCV come with a sample for people detection sample. Use it.
This is the file: peopledetect.cpp
I also recommend you to visit this SOF which deals with almost same problem: How can I detect and track people using OpenCV?
One possible solution is to use feature points tracking algorithm.
Look at this book:
Laganiere Robert - OpenCV 2 Computer Vision Application Programming Cookbook - 2011
p. 266
Full algorithm is already implemented in this book, using opencv.
The above method : a simple frame differencing followed by dilation and erosion would work, in case of a simple clean scene with just the motion of the person walking with absolutely no other motion or illumination changes. Also you are doing a detection every frame, as opposed to tracking. In this specific scenario, it might not be much more difficult to track either. Movement direction and speed : you can just run Lucas Kanade on the difference images.
At the core of it, what you need is a person detector followed by a tracker. Tracker can be either point based (Lucas Kanade or Horn and Schunck) or using Kalman filter or any of those kind of tracking for bounding boxes or blobs.
A lot of vision problems are ill-posed, some some amount of structure/constraints, helps to solve it considerably faster. Few questions to ask would be these :
Is the camera moving : No quite easy, Yes : Much harder, exactly what works depends on other conditions.
Is the scene constant except for the person
Is the person front facing / side-facing most of the time : Detect using viola jones or train one (adaboost with Haar or similar features) for side-facing face.
How spatially accurate do you need it to be, will a bounding box do it, or do you need a contour : Bounding box : just search (intelligently :)) in neighbourhood with SAD (sum of Abs Differences). Contour : Tougher, use contour based trackers.
Do you need the "tracklet" or only just the position of the person at each frame, temporal accuracy ?
What resolution are we speaking about here, since you need real time :
Is the scene sparse like the sequences or would it be cluttered ?
Is there any other motion in the sequence ?
Offline or Online ?
If you develop in .NET you can use the Aforge.NET framework.
http://www.aforgenet.com/
I was a regular visitor of the forums and I seem to remember there are plenty of people using it for tracking people.
I've also used the framework for other non-related purposes and can say I highly recommend it for its ease of use and powerful features.