I am trying to detect and track hand in real time using opencv. I thought haar cascade classifiers would yield a fair result. After training with 10k and 20k positive and negative images respectively, I obtained a classifier xml file. Unfortunately, it detects hand only in certain positions, proving that it works best only for rigid objects. So I am now thinking of adopting another algorithm that can track hand, once detected through haar classifier.
My question is,if I make sure that haar classifier detects hand in a certain frame, certain position, what method would yield robust tracking of hand further?
I searched web a bit, and have understood I can go for optical flow of the detected hand , or kalman filter or particle filter, but also have come across their own disadvantages.
also, If I incorporate stereo vision, would it help me, as I can possibly reconstruct hand in 3d.
You concluded rightly about Haar features - they aren't that useful when it comes to non-rigid objects.
Take a look at the following papers which use skin colour to detect hands.
Interaction between hands and wearable cameras
Markerless inspection of augmented reality objects
and this paper that uses KLT features to track the hand after the first detection:
Fast 2D hand tracking with flocks of features and multi-cue integration
I would say that a stereo camera will not help your cause much, as 3D reconstruction of non-rigid objects isn't straightforward and would require a whole lot of innovation and development. However, you can take a look at the papers in the hand pose estimation section of this page if you wish to pursue 3D tracking.
EDIT: Also take a look at this recent paper, which seems to get good results.
Zhang et al.'s Real-time Compressive Tracking does a reasonable job of tracking an object, once it has been detected by some other method, provided that the motion is not too fast. They have an OpenCV implementation (but it would need a bit of work to reuse).
This research paper describes a method to track hands, without using gloves by using a stereo camera setup.
there have been similar questions on stack overflow...
have a look at my answer and that of others: https://stackoverflow.com/a/17375647/1463143
you can for certain get better results by avoiding haar training and detection for deformable entities.
CamShift algorithm is generally fast and accurate, if you want to track the hand as a single entity. OpenCV documentation contains a good, easy-to-understand demo program that you can easily modify.
If you need to track fingers etc., however, further modeling will be needed.
Related
Let me explain my need before I explain the problem.
I am looking forward for a hand controlled application.
Navigation using palm and clicks using grab/fist.
Currently, I am working with Openni, which sounds promising and has few examples which turned out to be useful in my case, as it had inbuild hand tracker in samples. which serves my purpose for time being.
What I want to ask is,
1) what would be the best approach to have a fist/grab detector ?
I trained and used Adaboost fist classifiers on extracted RGB data, which was pretty good, but, it has too many false detections to move forward.
So, here I frame two more questions
2) Is there any other good library which is capable of achieving my needs using depth data ?
3)Can we train our own hand gestures, especially using fingers, as some paper was referring to HMM, if yes, how do we proceed with a library like OpenNI ?
Yeah, I tried with the middle ware libraries in OpenNI like, the grab detector, but, they wont serve my purpose, as its neither opensource nor matches my need.
Apart from what I asked, if there is something which you think, that could help me will be accepted as a good suggestion.
You don't need to train your first algorithm since it will complicate things.
Don't use color either since it's unreliable (mixes with background and changes unpredictably depending on lighting and viewpoint)
Assuming that your hand is a closest object you can simply
segment it out by depth threshold. You can set threshold manually, use a closest region of depth histogram, or perform connected component on a depth map to break it on meaningful parts first (and then select your object based not only on its depth but also using its dimensions, motion, user input, etc). Here is the output of a connected components method:
Apply convex defects from opencv library to find fingers;
Track fingers rather than rediscover them in 3D.This will increase stability. I successfully implemented such finger detection about 3 years ago.
Read my paper :) http://robau.files.wordpress.com/2010/06/final_report_00012.pdf
I have done research on gesture recognition for hands, and evaluated several approaches that are robust to scale, rotation etc. You have depth information which is very valuable, as the hardest problem for me was to actually segment the hand out of the image.
My most successful approach is to trail the contour of the hand and for each point on the contour, take the distance to the centroid of the hand. This gives a set of points that can be used as input for many training algorithms.
I use the image moments of the segmented hand to determine its rotation, so there is a good starting point on the hands contour. It is very easy to determine a fist, stretched out hand and the number of extended fingers.
Note that while it works fine, your arm tends to get tired from pointing into the air.
It seems that you are unaware of the Point Cloud Library (PCL). It is an open-source library dedicated to the processing of point clouds and RGB-D data, which is based on OpenNI for the low-level operations and which provides a lot of high-level algorithm, for instance to perform registration, segmentation and also recognition.
A very interesting algorithm for shape/object recognition in general is called implicit shape model. In order to detect a global object (such as a car, or an open hand), the idea is first to detect possible parts of it (e.g. wheels, trunk, etc, or fingers, palm, wrist etc) using a local feature detector, and then to infer the position of the global object by considering the density and the relative position of its parts. For instance, if I can detect five fingers, a palm and a wrist in a given neighborhood, there's a good chance that I am in fact looking at a hand, however, if I only detect one finger and a wrist somewhere, it could be a pair of false detections. The academic research article on this implicit shape model algorithm can be found here.
In PCL, there is a couple of tutorials dedicated to the topic of shape recognition, and luckily, one of them covers the implicit shape model, which has been implemented in PCL. I never tested this implementation, but from what I could read in the tutorial, you can specify your own point clouds for the training of the classifier.
That being said, you did not mentioned it explicitly in your question, but since your goal is to program a hand-controlled application, you might in fact be interested in a real-time shape detection algorithm. You would have to test the speed of the implicit shape model provided in PCL, but I think this approach is better suited to offline shape recognition.
If you do need real-time shape recognition, I think you should first use a hand/arm tracking algorithm (which are usually faster than full detection) in order to know where to look in the images, instead of trying to perform a full shape detection at each frame of your RGB-D stream. You could for instance track the hand location by segmenting the depthmap (e.g. using an appropriate threshold on the depth) and then detecting the extermities.
Then, once you approximately know where the hand is, it should be easier to decide whether the hand is making one gesture relevant to your application. I am not sure what you exactly mean by fist/grab gestures, but I suggest that you define and use some app-controlling gestures which are easy and quick to distinguish from one another.
Hope this helps.
The fast answer is: Yes, you can train your own gesture detector using depth data. It is really easy, but it depends on the type of the gesture.
Suppose you want to detect a hand movement:
Detect the hand position (x,y,x). Using OpenNi is straighforward as you have one node for the hand
Execute the gesture and collect ALL the positions of the hand during the gesture.
With the list of positions train a HMM. For example you can use Matlab, C, or Python.
For your own gestures, you can test the model and detect the gestures.
Here you can find a nice tutorial and code (in Matlab). The code (test.m is pretty easy to follow). Here is an snipet:
%Load collected data
training = get_xyz_data('data/train',train_gesture);
testing = get_xyz_data('data/test',test_gesture);
%Get clusters
[centroids N] = get_point_centroids(training,N,D);
ATrainBinned = get_point_clusters(training,centroids,D);
ATestBinned = get_point_clusters(testing,centroids,D);
% Set priors:
pP = prior_transition_matrix(M,LR);
% Train the model:
cyc = 50;
[E,P,Pi,LL] = dhmm_numeric(ATrainBinned,pP,[1:N]',M,cyc,.00001);
Dealing with fingers is pretty much the same, but instead of detecting the hand you need to detect de fingers. As Kinect doesn't have finger points, you need to use a specific code to detect them (using segmentation or contour tracking). Some examples using OpenCV can be found here and here, but the most promising one is the ROS library that have a finger node (see example here).
If you only need the detection of a fist/grab state, you should give microsoft a chance. Microsoft.Kinect.Toolkit.Interaction contains methods and events that detects the grip / grip release state of a hand. Take a look at the HandEventType of InteractionHandPointer . That works quite good for the fist/grab detection, but does not detect or report the position of individual fingers.
The next kinect (kinect one) detects 3 joint per hand (Wrist, Hand, Thumb) and has 3 hand based gestures: open, closed (grip/fist) and lasso (pointer). If that is enough for you, you should consider the microsoft libraries.
1) If there are a lot of false detections, you could try to extend the negative sample set of the classifier, and train it again. The extended negative image set should contain such images, where the fist was false detected. Maybe this will help to create a better classifier.
I've had quite a bit of succes with the middleware library as provided by http://www.threegear.com/. They provide several gestures (including grabbing, pinching and pointing) and 6 DOF handtracking.
You might be interested in this paper & open-source code:
Robust Articulated-ICP for Real-Time Hand Tracking
Code: https://github.com/OpenGP/htrack
Screenshot: http://lgg.epfl.ch/img/codedata/htrack_icp.png
YouTube Video: https://youtu.be/rm3YnClSmIQ
Paper PDF: http://infoscience.epfl.ch/record/206951/files/htrack.pdf
Information:
I would like to use OpenCV's HOG detection to identify objects that can be seen in a variety of orientations. The only problem is, I can't seem to find a reasonable feature detector or classifier to detect this in a rotation and scale invaraint way (as is needed by objects such as forearms).
Prior Work:
Lets focus on forearms for this discussion. A forearm can have multiple orientations, the primary distinct features probably being its contour edges. It is possible to have images of forearms that are pointing in any direction in an image, thus the complexity. So far I have done some in depth research on using HOG descriptors to solve this problem, but I am finding that the variety of poses produced by forearms in my positives training set is producing very low detection scores in actual images. I suspect the issue is that the gradients produced by each positive image do not produce very consistent results when saved into the Histogram. I have reviewed many research papers on the topic trying to resolve or improvie this, including the original from Dalal & Triggs [Link]: http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf It also seems that the assumptions made for detecting whole humans do not necessary apply to detecting individual features (particularly the assumption that all humans are standing up seems to suggest HOG is not a good route for rotation invariant detection like that of forearms).
Note:
If possible, I would like to steer clear of any non-free solutions such as those pertaining to Sift, Surf, or Haar.
Question:
What is a good solution to detecting rotation and scale invariant objects in an image? Particularly for this example, what would be a good solution to detecting all orientations of forearms in an image?
I use hog to detect human heads and shoulders. To train particular part you have to give the location of it. If you use opencv, you can clip samples containing only the training part you want, and make sure all training samples share the same size. For example, I clip images to contain only head and shoulder and resize all them to 64x64. Other opensource codes may require you to pass the location as the input parameter, essentially the same.
Are you trying the Discriminatively trained deformable part model ?http://www.cs.berkeley.edu/~rbg/latent/
you may find answers there.
I'm using optical flow as a real time obstacle detection and avoidance system for the visually impaired. I'm developing the application in c# and using Emgu Cv for image processing. I use the Lucas and Kanade method and I'm pretty satisfied with the speed of the algorithm. I am using monocular vision thus making it hard for me to compute the depth accurately to each of the features being tracked and to alert the user accordingly. I plan on using an ultrasonic sensor to help with the obstacle detection due to the fact that depth computation is hard with monocular camera. Any suggestions on how I could get an accurate estimation of depth using the camera alone?
You might want to check out this paper: A Robust Visual Odometry and Precipice Detection System Using Consumer-grade Monocular Vision. They usea nice trick for detecting as well obstacles as holes in the field of view.
Hate to give such a generic answer, but you'd be best off starting with a standard text on structure-from-motion to get an overview of techniques. A good one is Richard Szeliski's recent book available online (Chapter 7), and its references. After that, for your application you may want to look at recent work in SLAM - Oxford's Active Vision group have published some great work and Andrew Davison's group too.
more a comment on RobAu's answer below,
'structure from motion' might give better search results, than '3d from video'
Depth from one care will only work if you have movement of the camera. You could look into some 3d from video approaches. It is a very hard problem, especially when the objects in the field of view of the camera are moving as well.
I've built an imaging system with a webcam and feature matching such that as I move the camera around; I can track the camera's motion. I am doing something similar to here, except with the webcam frames as the input.
It works really well for "good" images, but when taking images in really low light lots of noise appears (camera high gain), and that messes with the feature detection and matching. Basically, it doesn't detect any good features, and when it does, it cannot match them correctly between frames.
Does anyone know a good solution for this? What other methods are used for finding and matching features?
Here are two example images with very low features:
I think phase correlation is going to be your best bet here. It is designed to tell you the phase shift (i.e., translation) between two images. It is much more resilient (but not immune) to noise than feature detection because it operates in frequency space; whereas, feature detectors operate spatially. Another benefit is, it is very fast when compared with feature detection methods. I have an implementation available in the OpenCV trunk that is sub-pixel accurate located here.
However, your images are pretty much "featureless" with the exception of the crease in the middle, so even phase correlation may have some trouble with it. Think of it like trying to detect translation in a snow storm. If all you can see is white, you can't tell that you have translated at all, thus the term whiteout. In your case, the algorithm might suffer from "greenout" :)
Can you adjust the camera settings to work better in low-light conditions. Have you fully opened the iris? Can you live with lower framerates? Setting a longer exposure time will allow the camera to gather more light, thus giving you more features at the cost of adding motion blur. Or, if low-light is your default environment you probably want something designed for this like an IR camera, but those can be expensive. Other than that, a big lens and long exposures are your friend :)
Histogram equalization may be of interest in improving the image contrast. But, sometimes it can just enhance the noise. OpenCV has a global histogram equalization function called equalizeHist. For a more localized implementation, you'll want to look at Contrast Limited Adaptive Histogram Equalization or CLAHE for short. Here is a good article on it. This page has some nice examples, and some code.
If I take a picture with a camera, so I know the distance from the camera to the object, such as a scale model of a house, I would like to turn this into a 3D model that I can maneuver around so I can comment on different parts of the house.
If I sit down and think about taking more than one picture, labeling direction, and distance, I should be able to figure out how to do this, but, I thought I would ask if someone has some paper that may help explain more.
What language you explain in doesn't matter, as I am looking for the best approach.
Right now I am considering showing the house, then the user can put in some assistance for height, such as distance from the camera to the top of that part of the model, and given enough of this it would be possible to start calculating heights for the rest, especially if there is a top-down image, then pictures from angles on the four sides, to calculate relative heights.
Then I expect that parts will also need to differ in color to help separate out the various parts of the model.
As mentioned, the problem is very hard and is often also referred to as multi-view object reconstruction. It is usually approached by solving the stereo-view reconstruction problem for each pair of consecutive images.
Performing stereo reconstruction requires that pairs of images are taken that have a good amount of visible overlap of physical points. You need to find corresponding points such that you can then use triangulation to find the 3D co-ordinates of the points.
Epipolar geometry
Stereo reconstruction is usually done by first calibrating your camera setup so you can rectify your images using the theory of epipolar geometry. This simplifies finding corresponding points as well as the final triangulation calculations.
If you have:
the intrinsic camera parameters (requiring camera calibration),
the camera's position and rotation (it's extrinsic parameters), and
8 or more physical points with matching known positions in two photos (when using the eight-point algorithm)
you can calculate the fundamental and essential matrices using only matrix theory and use these to rectify your images. This requires some theory about co-ordinate projections with homogeneous co-ordinates and also knowledge of the pinhole camera model and camera matrix.
If you want a method that doesn't need the camera parameters and works for unknown camera set-ups you should probably look into methods for uncalibrated stereo reconstruction.
Correspondence problem
Finding corresponding points is the tricky part that requires you to look for points of the same brightness or colour, or to use texture patterns or some other features to identify the same points in pairs of images. Techniques for this either work locally by looking for a best match in a small region around each point, or globally by considering the image as a whole.
If you already have the fundamental matrix, it will allow you to rectify the images such that corresponding points in two images will be constrained to a line (in theory). This helps you to use faster local techniques.
There is currently still no ideal technique to solve the correspondence problem, but possible approaches could fall in these categories:
Manual selection: have a person hand-select matching points.
Custom markers: place markers or use specific patterns/colours that you can easily identify.
Sum of squared differences: take a region around a point and find the closest whole matching region in the other image.
Graph cuts: a global optimisation technique based on optimisation using graph theory.
For specific implementations you can use Google Scholar to search through the current literature. Here is one highly cited paper comparing various techniques:
A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms.
Multi-view reconstruction
Once you have the corresponding points, you can then use epipolar geometry theory for the triangulation calculations to find the 3D co-ordinates of the points.
This whole stereo reconstruction would then be repeated for each pair of consecutive images (implying that you need an order to the images or at least knowledge of which images have many overlapping points). For each pair you would calculate a different fundamental matrix.
Of course, due to noise or inaccuracies at each of these steps you might want to consider how to solve the problem in a more global manner. For instance, if you have a series of images that are taken around an object and form a loop, this provides extra constraints that can be used to improve the accuracy of earlier steps using something like bundle adjustment.
As you can see, both stereo and multi-view reconstruction are far from solved problems and are still actively researched. The less you want to do in an automated manner the more well-defined the problem becomes, but even in these cases quite a bit of theory is required to get started.
Alternatives
If it's within the constraints of what you want to do, I would recommend considering dedicated hardware sensors (such as the XBox's Kinect) instead of only using normal cameras. These sensors use structured light, time-of-flight or some other range imaging technique to generate a depth image which they can also combine with colour data from their own cameras. They practically solve the single-view reconstruction problem for you and often include libraries and tools for stitching/combining multiple views.
Epipolar geometry references
My knowledge is actually quite thin on most of the theory, so the best I can do is to further provide you with some references that are hopefully useful (in order of relevance):
I found a PDF chapter on Multiple View Geometry that contains most of the critical theory. In fact the textbook Multiple View Geometry in Computer Vision should also be quite useful (sample chapters available here).
Here's a page describing a project on uncalibrated stereo reconstruction that seems to include some source code that could be useful. They find matching points in an automated manner using one of many feature detection techniques. If you want this part of the process to be automated as well, then SIFT feature detection is commonly considered to be an excellent non-real-time technique (since it's quite slow).
A paper about Scene Reconstruction from Multiple Uncalibrated Views.
A slideshow on Methods for 3D Reconstruction from Multiple Images (it has some more references below it's slides towards the end).
A paper comparing different multi-view stereo reconstruction algorithms can be found here. It limits itself to algorithms that "reconstruct dense object models from calibrated views".
Here's a paper that goes into lots of detail for the case that you have stereo cameras that take multiple images: Towards robust metric reconstruction
via a dynamic uncalibrated stereo head. They then find methods to self-calibrate the cameras.
I'm not sure how helpful all of this is, but hopefully it includes enough useful terminology and references to find further resources.
Research has made significant progress and these days it is possible to obtain pretty good-looking 3D shapes from 2D images. For instance, in our recent research work titled "Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and Silhouettes With Deep Generative Networks" took a big step in solving the problem of obtaining 3D shapes from 2D images. In our work, we show that you can not only go from 2D to 3D directly and get a good, approximate 3D reconstruction but you can also learn a distribution of 3D shapes in an efficient manner and generate/synthesize 3D shapes. Below is an image of our work showing that we are able to do 3D reconstruction even from a single silhouette or depth map (on the left). The ground-truth 3D shapes are shown on the right.
The approach we took has some contributions related to cognitive science or the way the brain works: the model we built shares parameters for all shape categories instead of being specific to only one category. Also, it obtains consistent representations and takes the uncertainty of the input view into account when producing a 3D shape as output. Therefore, it is able to naturally give meaningful results even for very ambiguous inputs. If you look at the citation to our paper you can see even more progress just in terms of going from 2D images to 3D shapes.
This problem is known as Photogrammetry.
Google will supply you with endless references, just be aware that if you want to roll your own, it's a very hard problem.
Check out The Deadalus Project, althought that website does not contain a gallery with illustrative information about the solution, it post several papers and info about the working method.
I watched a lecture from one of the main researchers of the project (Roger Hubbold), and the image results are quite amazing! Althought is a complex and long problem. It has a lot of tricky details to take into account to get an approximation of the 3d data, take for example the 3d information from wall surfaces, for which the heuristic to work is as follows: Take a photo with normal illumination of the scene, and then retake the picture in same position with full flash active, then substract both images and divide the result by a pre-taken flash calibration image, apply a box filter to this new result and then post-process to estimate depth values, the whole process is explained in detail in this paper (which is also posted/referenced in the project website)
Google Sketchup (free) has a photo matching tool that allows you to take a photograph and match its perspective for easy modeling.
EDIT: It appears that you're interested in developing your own solution. I thought you were trying to obtain a 3D model of an image in a single instance. If this answer isn't helpful, I apologize.
Hope this helps if you are trying to construct 3d volume from 2d stack of images !! You can use open source tool such as ImageJ Fiji which comes with 3d viewer plugin..
https://quppler.com/creating-a-classifier-using-image-j-fiji-for-3d-volume-data-preparation-from-stack-of-images/