As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am developing an Augmented Reality SDK on OpenCV. I had some problems to find tutorials on the topic, which steps to follow, possible algorithms, fast and efficient coding for real-time performance etc.
So far I have gathered the next information and useful links.
OpenCV installation
Download latest release version.
You can find installation guides here (platforms: linux, mac, windows, java, android, iOS).
Online documentation.
Augmented Reality
For begginers here is a simple augmented reality code in OpenCV. It is a good start.
For anyone searching for a well designed state-of-the-art SDK I found some general steps that every augmented-reality based on marker tracking should have, considering OpenCV functions.
Main program: creates all classes, initialization, capture frames from video.
AR_Engine class: Controls the parts of an augmented reality application. There should be 2 main states:
detection: tries to detect the marker in the scene
tracking: once it is detected, uses lower computational techniques for traking the marker in upcoming frames.
Also there should be some algorithms for finding the position and orientation of the camera in every frame. This is achieve by detecting the homography transformation between the marker detected in the scene, and a 2D image of the marker we have processed offline. The explanation of this method here (page 18). The main steps for Pose Estimations are:
Load camera Intrinsic Parameters. Previously extracted offline through calibration.
Load the pattern (marker) to track: It is an image of the planar marker we are going to track. It is necessary to extract features and generate descriptors (keypoints) for this pattern so later we can compare with features from the scene. Algorithms for this task:
SIFT
FAST
SURF
For every frame update, run a detection algorithm for extracting features from the scene and generate descriptors. Again we have several options.
SIFT
FAST
SURF
FREAK: A new method (2012) supossed to be the fastest.
ORB
Find matches between pattern and the scene descriptors.
FLANN matcher
Find Homography matrix from those matches. RANSAC can be used before to find inliers/outliers in the set of matches.
Extract Camera Pose from homography.
Sample code on Pose from Homography.
Sample code on Homography from Pose.
Complete examples:
aruco
Mastering OpenCV samples
Since AR applications often run on mobile devices, you could consider also other features detector/descriptor:
FREAK
ORB
Generally if you can chose the markers you first detect a square target using an edge detector and then either Hough or simply contours - then identify the particular marker from the internal design. Rather than using a general point matcher.
Take a look at Aruco for well written example code.
Related
The bounty expires in 5 days. Answers to this question are eligible for a +50 reputation bounty.
Jack wants to draw more attention to this question.
I'm using openALPR library to read plate licenses but I'm having issues reading the plates in different angles, like below image. My question is: what's the proper way to do that? Image processing to try make the image straight before submitting such as Homography ?
openALPR's settings such as max_plate_angle_degrees max_detection_input_width, max_detection_input_height, etc or train the tesseract-ocr with cropped images in different angles? I have no code to show because I'm looking for a direction how to do that.
Yes, Homography is the general term for what you are looking for, but I think you might find better resources by searching for "geometric transformation" or "projective/perspective transformation".
Here are some starting resources that might help:
Understanding Homography (a.k.a Perspective Transformation) - Towards Data Science
Geometric Transformation of Image - OpenCV Tutorials
Applying perspective transformation and homography - Packt (soft paywall)
I don't know about OpenALPR, and can't find clear documentation anywhere, so I can't comment on it. But those functions you've listed (max_plate_angle_degrees max_detection_input_width, max_detection_input_height) does not sound like what you are looking for.
And I personally don't recommend training Tesseract OCR directly on unprocessed images (uncropped, angled license plates). Typically, you would follow an image processing procedure before passing it to Tesseract OCR as the final step. However, I haven't touched Tesseract since version 3 back in 2018, so the latest version 5 might fare better...
Regardless, I would imagine a traditional, end-to-end image processing pipeline like this:
Image capture
Preprocessing of image brightness, contrast, and colors
Cascade classifiers to quickly detect & locate license plates (also filters out images without license plates)
Crop image to the detected license plate
Edge and corner detection
Identify control points with Hough transform (finding 4 corners of license plate through line intersections)
Estimate homography matrix with linear algebra (most image processing libraries should already have this function for you)
Projective transform on image from step (4) with the estimated homography matrix
Tesseract OCR on transformed image to extract license number
You would likely implement fewer steps than this, as most libraries provide higher level functionalities that handles all this automatically.
I have to implement a contour detection of full human body (from feet to head, in several poses such as raising hands etc.) using opencv. I managed to compile and run code I found here https://gist.github.com/yoggy/1470956, but it only draws a rectangle around the body, and not the exact contour. Can one help me with identifying and displaying the contour itself?
Thanks!!
I'm afraid the answer to this question is:
There's no algorithm that can do this perfectly.
Computer vision has not developed to that extent yet. Take a look at recent papers in CVPR, PAMI, and you will find that most algorithms are "rectangle", or more specifically, bounding-box based, in terms of human labeling and algorithmic detecting.
It is true that you can find the contours within the bounding-box. However the computer just doesn't know which contour belongs to the specified object.
I suggest you search for "human pose estimation" for further information.
One approach that might work is background subtraction:
http://docs.opencv.org/3.1.0/db/d5c/tutorial_py_bg_subtraction.html
This would work for video but perhaps also for single images in a scenario where you were in a controlled (fixed camera) environment where you had an image of the pose and also and image of the background, with no one present.
You can use the function findCountors within the returned bounding box:
http://docs.opencv.org/doc/tutorials/imgproc/shapedescriptors/find_contours/find_contours.html
Information:
I would like to use OpenCV's HOG detection to identify objects that can be seen in a variety of orientations. The only problem is, I can't seem to find a reasonable feature detector or classifier to detect this in a rotation and scale invaraint way (as is needed by objects such as forearms).
Prior Work:
Lets focus on forearms for this discussion. A forearm can have multiple orientations, the primary distinct features probably being its contour edges. It is possible to have images of forearms that are pointing in any direction in an image, thus the complexity. So far I have done some in depth research on using HOG descriptors to solve this problem, but I am finding that the variety of poses produced by forearms in my positives training set is producing very low detection scores in actual images. I suspect the issue is that the gradients produced by each positive image do not produce very consistent results when saved into the Histogram. I have reviewed many research papers on the topic trying to resolve or improvie this, including the original from Dalal & Triggs [Link]: http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf It also seems that the assumptions made for detecting whole humans do not necessary apply to detecting individual features (particularly the assumption that all humans are standing up seems to suggest HOG is not a good route for rotation invariant detection like that of forearms).
Note:
If possible, I would like to steer clear of any non-free solutions such as those pertaining to Sift, Surf, or Haar.
Question:
What is a good solution to detecting rotation and scale invariant objects in an image? Particularly for this example, what would be a good solution to detecting all orientations of forearms in an image?
I use hog to detect human heads and shoulders. To train particular part you have to give the location of it. If you use opencv, you can clip samples containing only the training part you want, and make sure all training samples share the same size. For example, I clip images to contain only head and shoulder and resize all them to 64x64. Other opensource codes may require you to pass the location as the input parameter, essentially the same.
Are you trying the Discriminatively trained deformable part model ?http://www.cs.berkeley.edu/~rbg/latent/
you may find answers there.
I have tried face recognition using OpenCV using the documentation provided on their wiki. Its working fine and it can detect multiple faces. However there is no data provided on the site regarding 3D object detection or head tracking. The links to the code and the wiki are provided below :
Face recognition
Cascade Classifier
While the wiki does provide sufficient information about face detection, as you might have found, 3D face recognition methods are not provided.
I wanted to know about projects related to 3D face recognition and tracking so that I can see the source code and try to make a project doing the same.
This might come late but willow garage has another project running called the Point Cloud Library (PCL) that is entirely focused on 3D data processing tasks. Face recognition is one of the use cases they use to advertise the project. Of course all of this is free...
http://pointclouds.org
There are many methods. I just can point you to right direction. Face recognition examples usually provide sub-detection of eyes. So actually you know face and eyes location. In similar or other means you can also detect lips.
Now when you have at least three points of object (face this time), you can calculate its 3D position in room using triangulation. This part of example exists in find_obj.cpp which comes as example with OpenCV. Just this example uses x points from SURF and draws rectangle based on this information. Check out also anything else with CvFindHomography.
Since OpenCV 2.4.2, there has been a header file for face detection and tracking: opencv2/contrib/detection_based_tracker.hpp
The header file defines a class called DetectionBasedTracker. The tracking mechanism it defines uses haar cascades in the background to detect objects. The tracking is much faster than the OpenCV Haar implementation (however, some have found it to be less accurate).
I have personally found it to work very well on an android device. Some sample code implementing the face detection and tracker is found here:
http://bytesandlogics.wordpress.com/2012/08/23/detectionbasedtracker-opencv-implementation/
You should have a look at Active shapes models and Active Appearance Models that are for the task you are describing.
OpenCV provides you only 2D detection methods, while the methods in reference (now very popular in the field) track a set of 3D points distributed on a face plus a texture to describe its appearance.
The Wikipedia pages will give you some links to implementations of teh said methods.
If you want to know the 3D parameters of the head in the world coordinates (for example for gaze detection), then you should google for the keywords "3D head tracking" and "head pose estimation".
If I take a picture with a camera, so I know the distance from the camera to the object, such as a scale model of a house, I would like to turn this into a 3D model that I can maneuver around so I can comment on different parts of the house.
If I sit down and think about taking more than one picture, labeling direction, and distance, I should be able to figure out how to do this, but, I thought I would ask if someone has some paper that may help explain more.
What language you explain in doesn't matter, as I am looking for the best approach.
Right now I am considering showing the house, then the user can put in some assistance for height, such as distance from the camera to the top of that part of the model, and given enough of this it would be possible to start calculating heights for the rest, especially if there is a top-down image, then pictures from angles on the four sides, to calculate relative heights.
Then I expect that parts will also need to differ in color to help separate out the various parts of the model.
As mentioned, the problem is very hard and is often also referred to as multi-view object reconstruction. It is usually approached by solving the stereo-view reconstruction problem for each pair of consecutive images.
Performing stereo reconstruction requires that pairs of images are taken that have a good amount of visible overlap of physical points. You need to find corresponding points such that you can then use triangulation to find the 3D co-ordinates of the points.
Epipolar geometry
Stereo reconstruction is usually done by first calibrating your camera setup so you can rectify your images using the theory of epipolar geometry. This simplifies finding corresponding points as well as the final triangulation calculations.
If you have:
the intrinsic camera parameters (requiring camera calibration),
the camera's position and rotation (it's extrinsic parameters), and
8 or more physical points with matching known positions in two photos (when using the eight-point algorithm)
you can calculate the fundamental and essential matrices using only matrix theory and use these to rectify your images. This requires some theory about co-ordinate projections with homogeneous co-ordinates and also knowledge of the pinhole camera model and camera matrix.
If you want a method that doesn't need the camera parameters and works for unknown camera set-ups you should probably look into methods for uncalibrated stereo reconstruction.
Correspondence problem
Finding corresponding points is the tricky part that requires you to look for points of the same brightness or colour, or to use texture patterns or some other features to identify the same points in pairs of images. Techniques for this either work locally by looking for a best match in a small region around each point, or globally by considering the image as a whole.
If you already have the fundamental matrix, it will allow you to rectify the images such that corresponding points in two images will be constrained to a line (in theory). This helps you to use faster local techniques.
There is currently still no ideal technique to solve the correspondence problem, but possible approaches could fall in these categories:
Manual selection: have a person hand-select matching points.
Custom markers: place markers or use specific patterns/colours that you can easily identify.
Sum of squared differences: take a region around a point and find the closest whole matching region in the other image.
Graph cuts: a global optimisation technique based on optimisation using graph theory.
For specific implementations you can use Google Scholar to search through the current literature. Here is one highly cited paper comparing various techniques:
A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms.
Multi-view reconstruction
Once you have the corresponding points, you can then use epipolar geometry theory for the triangulation calculations to find the 3D co-ordinates of the points.
This whole stereo reconstruction would then be repeated for each pair of consecutive images (implying that you need an order to the images or at least knowledge of which images have many overlapping points). For each pair you would calculate a different fundamental matrix.
Of course, due to noise or inaccuracies at each of these steps you might want to consider how to solve the problem in a more global manner. For instance, if you have a series of images that are taken around an object and form a loop, this provides extra constraints that can be used to improve the accuracy of earlier steps using something like bundle adjustment.
As you can see, both stereo and multi-view reconstruction are far from solved problems and are still actively researched. The less you want to do in an automated manner the more well-defined the problem becomes, but even in these cases quite a bit of theory is required to get started.
Alternatives
If it's within the constraints of what you want to do, I would recommend considering dedicated hardware sensors (such as the XBox's Kinect) instead of only using normal cameras. These sensors use structured light, time-of-flight or some other range imaging technique to generate a depth image which they can also combine with colour data from their own cameras. They practically solve the single-view reconstruction problem for you and often include libraries and tools for stitching/combining multiple views.
Epipolar geometry references
My knowledge is actually quite thin on most of the theory, so the best I can do is to further provide you with some references that are hopefully useful (in order of relevance):
I found a PDF chapter on Multiple View Geometry that contains most of the critical theory. In fact the textbook Multiple View Geometry in Computer Vision should also be quite useful (sample chapters available here).
Here's a page describing a project on uncalibrated stereo reconstruction that seems to include some source code that could be useful. They find matching points in an automated manner using one of many feature detection techniques. If you want this part of the process to be automated as well, then SIFT feature detection is commonly considered to be an excellent non-real-time technique (since it's quite slow).
A paper about Scene Reconstruction from Multiple Uncalibrated Views.
A slideshow on Methods for 3D Reconstruction from Multiple Images (it has some more references below it's slides towards the end).
A paper comparing different multi-view stereo reconstruction algorithms can be found here. It limits itself to algorithms that "reconstruct dense object models from calibrated views".
Here's a paper that goes into lots of detail for the case that you have stereo cameras that take multiple images: Towards robust metric reconstruction
via a dynamic uncalibrated stereo head. They then find methods to self-calibrate the cameras.
I'm not sure how helpful all of this is, but hopefully it includes enough useful terminology and references to find further resources.
Research has made significant progress and these days it is possible to obtain pretty good-looking 3D shapes from 2D images. For instance, in our recent research work titled "Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and Silhouettes With Deep Generative Networks" took a big step in solving the problem of obtaining 3D shapes from 2D images. In our work, we show that you can not only go from 2D to 3D directly and get a good, approximate 3D reconstruction but you can also learn a distribution of 3D shapes in an efficient manner and generate/synthesize 3D shapes. Below is an image of our work showing that we are able to do 3D reconstruction even from a single silhouette or depth map (on the left). The ground-truth 3D shapes are shown on the right.
The approach we took has some contributions related to cognitive science or the way the brain works: the model we built shares parameters for all shape categories instead of being specific to only one category. Also, it obtains consistent representations and takes the uncertainty of the input view into account when producing a 3D shape as output. Therefore, it is able to naturally give meaningful results even for very ambiguous inputs. If you look at the citation to our paper you can see even more progress just in terms of going from 2D images to 3D shapes.
This problem is known as Photogrammetry.
Google will supply you with endless references, just be aware that if you want to roll your own, it's a very hard problem.
Check out The Deadalus Project, althought that website does not contain a gallery with illustrative information about the solution, it post several papers and info about the working method.
I watched a lecture from one of the main researchers of the project (Roger Hubbold), and the image results are quite amazing! Althought is a complex and long problem. It has a lot of tricky details to take into account to get an approximation of the 3d data, take for example the 3d information from wall surfaces, for which the heuristic to work is as follows: Take a photo with normal illumination of the scene, and then retake the picture in same position with full flash active, then substract both images and divide the result by a pre-taken flash calibration image, apply a box filter to this new result and then post-process to estimate depth values, the whole process is explained in detail in this paper (which is also posted/referenced in the project website)
Google Sketchup (free) has a photo matching tool that allows you to take a photograph and match its perspective for easy modeling.
EDIT: It appears that you're interested in developing your own solution. I thought you were trying to obtain a 3D model of an image in a single instance. If this answer isn't helpful, I apologize.
Hope this helps if you are trying to construct 3d volume from 2d stack of images !! You can use open source tool such as ImageJ Fiji which comes with 3d viewer plugin..
https://quppler.com/creating-a-classifier-using-image-j-fiji-for-3d-volume-data-preparation-from-stack-of-images/