3D reconstruction -- How to create 3D model from 2D image? - image-processing

If I take a picture with a camera, so I know the distance from the camera to the object, such as a scale model of a house, I would like to turn this into a 3D model that I can maneuver around so I can comment on different parts of the house.
If I sit down and think about taking more than one picture, labeling direction, and distance, I should be able to figure out how to do this, but, I thought I would ask if someone has some paper that may help explain more.
What language you explain in doesn't matter, as I am looking for the best approach.
Right now I am considering showing the house, then the user can put in some assistance for height, such as distance from the camera to the top of that part of the model, and given enough of this it would be possible to start calculating heights for the rest, especially if there is a top-down image, then pictures from angles on the four sides, to calculate relative heights.
Then I expect that parts will also need to differ in color to help separate out the various parts of the model.

As mentioned, the problem is very hard and is often also referred to as multi-view object reconstruction. It is usually approached by solving the stereo-view reconstruction problem for each pair of consecutive images.
Performing stereo reconstruction requires that pairs of images are taken that have a good amount of visible overlap of physical points. You need to find corresponding points such that you can then use triangulation to find the 3D co-ordinates of the points.
Epipolar geometry
Stereo reconstruction is usually done by first calibrating your camera setup so you can rectify your images using the theory of epipolar geometry. This simplifies finding corresponding points as well as the final triangulation calculations.
If you have:
the intrinsic camera parameters (requiring camera calibration),
the camera's position and rotation (it's extrinsic parameters), and
8 or more physical points with matching known positions in two photos (when using the eight-point algorithm)
you can calculate the fundamental and essential matrices using only matrix theory and use these to rectify your images. This requires some theory about co-ordinate projections with homogeneous co-ordinates and also knowledge of the pinhole camera model and camera matrix.
If you want a method that doesn't need the camera parameters and works for unknown camera set-ups you should probably look into methods for uncalibrated stereo reconstruction.
Correspondence problem
Finding corresponding points is the tricky part that requires you to look for points of the same brightness or colour, or to use texture patterns or some other features to identify the same points in pairs of images. Techniques for this either work locally by looking for a best match in a small region around each point, or globally by considering the image as a whole.
If you already have the fundamental matrix, it will allow you to rectify the images such that corresponding points in two images will be constrained to a line (in theory). This helps you to use faster local techniques.
There is currently still no ideal technique to solve the correspondence problem, but possible approaches could fall in these categories:
Manual selection: have a person hand-select matching points.
Custom markers: place markers or use specific patterns/colours that you can easily identify.
Sum of squared differences: take a region around a point and find the closest whole matching region in the other image.
Graph cuts: a global optimisation technique based on optimisation using graph theory.
For specific implementations you can use Google Scholar to search through the current literature. Here is one highly cited paper comparing various techniques:
A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms.
Multi-view reconstruction
Once you have the corresponding points, you can then use epipolar geometry theory for the triangulation calculations to find the 3D co-ordinates of the points.
This whole stereo reconstruction would then be repeated for each pair of consecutive images (implying that you need an order to the images or at least knowledge of which images have many overlapping points). For each pair you would calculate a different fundamental matrix.
Of course, due to noise or inaccuracies at each of these steps you might want to consider how to solve the problem in a more global manner. For instance, if you have a series of images that are taken around an object and form a loop, this provides extra constraints that can be used to improve the accuracy of earlier steps using something like bundle adjustment.
As you can see, both stereo and multi-view reconstruction are far from solved problems and are still actively researched. The less you want to do in an automated manner the more well-defined the problem becomes, but even in these cases quite a bit of theory is required to get started.
Alternatives
If it's within the constraints of what you want to do, I would recommend considering dedicated hardware sensors (such as the XBox's Kinect) instead of only using normal cameras. These sensors use structured light, time-of-flight or some other range imaging technique to generate a depth image which they can also combine with colour data from their own cameras. They practically solve the single-view reconstruction problem for you and often include libraries and tools for stitching/combining multiple views.
Epipolar geometry references
My knowledge is actually quite thin on most of the theory, so the best I can do is to further provide you with some references that are hopefully useful (in order of relevance):
I found a PDF chapter on Multiple View Geometry that contains most of the critical theory. In fact the textbook Multiple View Geometry in Computer Vision should also be quite useful (sample chapters available here).
Here's a page describing a project on uncalibrated stereo reconstruction that seems to include some source code that could be useful. They find matching points in an automated manner using one of many feature detection techniques. If you want this part of the process to be automated as well, then SIFT feature detection is commonly considered to be an excellent non-real-time technique (since it's quite slow).
A paper about Scene Reconstruction from Multiple Uncalibrated Views.
A slideshow on Methods for 3D Reconstruction from Multiple Images (it has some more references below it's slides towards the end).
A paper comparing different multi-view stereo reconstruction algorithms can be found here. It limits itself to algorithms that "reconstruct dense object models from calibrated views".
Here's a paper that goes into lots of detail for the case that you have stereo cameras that take multiple images: Towards robust metric reconstruction
via a dynamic uncalibrated stereo head. They then find methods to self-calibrate the cameras.
I'm not sure how helpful all of this is, but hopefully it includes enough useful terminology and references to find further resources.

Research has made significant progress and these days it is possible to obtain pretty good-looking 3D shapes from 2D images. For instance, in our recent research work titled "Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and Silhouettes With Deep Generative Networks" took a big step in solving the problem of obtaining 3D shapes from 2D images. In our work, we show that you can not only go from 2D to 3D directly and get a good, approximate 3D reconstruction but you can also learn a distribution of 3D shapes in an efficient manner and generate/synthesize 3D shapes. Below is an image of our work showing that we are able to do 3D reconstruction even from a single silhouette or depth map (on the left). The ground-truth 3D shapes are shown on the right.
The approach we took has some contributions related to cognitive science or the way the brain works: the model we built shares parameters for all shape categories instead of being specific to only one category. Also, it obtains consistent representations and takes the uncertainty of the input view into account when producing a 3D shape as output. Therefore, it is able to naturally give meaningful results even for very ambiguous inputs. If you look at the citation to our paper you can see even more progress just in terms of going from 2D images to 3D shapes.

This problem is known as Photogrammetry.
Google will supply you with endless references, just be aware that if you want to roll your own, it's a very hard problem.

Check out The Deadalus Project, althought that website does not contain a gallery with illustrative information about the solution, it post several papers and info about the working method.
I watched a lecture from one of the main researchers of the project (Roger Hubbold), and the image results are quite amazing! Althought is a complex and long problem. It has a lot of tricky details to take into account to get an approximation of the 3d data, take for example the 3d information from wall surfaces, for which the heuristic to work is as follows: Take a photo with normal illumination of the scene, and then retake the picture in same position with full flash active, then substract both images and divide the result by a pre-taken flash calibration image, apply a box filter to this new result and then post-process to estimate depth values, the whole process is explained in detail in this paper (which is also posted/referenced in the project website)

Google Sketchup (free) has a photo matching tool that allows you to take a photograph and match its perspective for easy modeling.
EDIT: It appears that you're interested in developing your own solution. I thought you were trying to obtain a 3D model of an image in a single instance. If this answer isn't helpful, I apologize.

Hope this helps if you are trying to construct 3d volume from 2d stack of images !! You can use open source tool such as ImageJ Fiji which comes with 3d viewer plugin..
https://quppler.com/creating-a-classifier-using-image-j-fiji-for-3d-volume-data-preparation-from-stack-of-images/

Related

Finding displacement between two camera frames

I'm currently working on a visual odometry project. Currently I've implemented up to Essential Matrix decomposition stage. But the resulting translation vector is normalized and cannot be able to plot the movement.
Now how can I compute the displacement in some scale? I have seen some suggestions to use planner homography to compute the absolute translation. I didn't got the idea of doing it as, the outdoor environment is not simply planner. At least, by considering the ground as planner, how to obtain, the translation of it. I've seen a suggestion here. Is it possible to use this approach to get the displacement between two frames?
What you are referring to is called registration. This is a vast field. There are methods for linear transformation across the entire image, and per pixel methods ( the two ends of the spectrum). Naturally per pixel methods are far slower typically and have many local errors.
Typically two frames have very little transformation between them and simple Homography will do to find the general scaling between them. Especially if you are talking about aerial photos. If your case is very far from planar then you may want to use something closer to pixel-wise. For example using spline fitting: https://www.mathworks.com/matlabcentral/fileexchange/20057-b-spline-grid--image-and-point-based-registration
You cannot recover scale, generally speaking, unless you can recognize in the scene 1 or more objects of known physical size.

OpenCV - Feature Matching vs Optical Flow

I am interested in making a motion tracking app using OpenCV, and there has been a wealth of information available online. However, I am a tad confused between feature matching and tracking features using a sparse optical flow algorithm such as Lucas-Kanade. With that in mind, I have the following questions:
What is the main difference between the two (feature matching and optical flow) if I have specified a region of pixels to track? I'm not interested in tracking in real time, if that helps clear up any assumptions.
In addition, since I'm not doing real time tracking, is it a better idea to use dense optical flow (Farneback) to keep track of the pixels in my specified region of interest?
Thank you.
I would like to add a few thoughts about that theme since I found this a very interesting question too.
As said before Feature Matching is a technique that is based on:
A feature detection step which returns a set of so called feature points. These feature points are located at positions with salient image structures, e.g. edge-like structures when you are using FAST or blob like structures if you are using SIFT or SURF.
The second step is the matching. The association of feature points extracted from two different images. The matching is based on local visual descriptors, e.g. histogram of gradients or binary patterns, that are locally extracted around the feature positions. The descriptor is a feature vector and associated feature point pairs are pairs a minimal feature vector distances.
Most feature matching methods are scale and rotation invariant and are robust for changes in illuminations (e.g caused by shadow or different contrast). Thus these methods can be applied to image sequences but are more often used to align image pairs captured from different views or with different devices.The disadvantage of Feature Matching methods is the difficulty of defining where the feature matches are spawn and that the feature pair (which in a image sequence are motion vectors) are in general very sparse. In addition the subpixel accuracy of matching approaches are very limited as most detector are fine-graded to integer positions.
From my experience the main advantage of feature matching approaches is that they can compute very large motions/ displacements.
OpenCV offers some feature matching methods but there are a lot of more recent, faster and more accurate approaches available online e.g.:
DeepMatching which relies on deep learning and are often used to initialize optical flow methods to help them deal with long-range motions.
Stereoscann which is a very fast approach at its origin proposed for visual odometry.
Optical flow methods in contrast rely on the minimization of the brightness constancy and additional constrain e.g. smoothness etc. Thus they derive motion vector based on spatial and temporal image gradients of a sequence of consecutive frames. Thus they are more suited image sequences rather than image pairs that are captured from very different view points. The main challenges in the estimation of motion with optical flow vectors are large motions, occlusion, strong illumination changes and changes of the appearance of the objects and mostly the low runtime. However optical flow methods can be highly accurate and compute dense motion fields which respect to shared motion boundaries of the objects in a scene.
However, the accuracy of different optical flow methods is very different. Local methods such as the PLK (Lucas Kanade) are in general less accurate but allow to compute pre selected motion vectors only and can thus be very fast. (In the recent years we have done some research to improve the accuracy of the local approach, see here for further information).
The main OpenCV trunk offers global approaches such as the Farnback. But this is a quite outdated approach. Try the OpenCV contrib trunk which more recent methods. But to get an good overview of the most recent methods take a look at the public optical flow benchmarks. Here you will find code and implementations as well e.g.:
MPI-Sintel optical flow benchmark
KITTI 2012 optical flow benchmark. Both offer links e.g. to git's or source code for some newer methods. Such as FlowFields.
But from my point of view I would not on an early stage reject a specific approach matching or optical flow. Try as much as possible available online implementations and see what is the best for your application.
Feature matching uses the feature descriptors to match features with one another (usually) using a nearest neighbor search in the feature descriptor space. The basic idea is you have descriptor vectors, and the same feature in two images should be near each other in the descriptor space, so you just match that way.
Optical flow algorithms do not look at a descriptor space, and instead, looks at pixel patches around features and tries to match those patches instead. If you're familiar with dense optical flow, sparse optical flow just does dense optical flow but on small patches of the image around feature points. Thus optical flow assumes brightness constancy, that is, that pixel brightness doesn't change between frames. Also, since you're looking around neighboring pixels, you need to make the assumption that neighboring points to your features move similarly to your feature. Finally, since it's using a dense flow algorithm on small patches, the points where they move cannot be very far in the image from the original feature location. If they are, then the pyramid-resolution approach is recommended, where you scale down the image before you do this so that what once was a 16 pixel translation is now a 2 pixel translation, and then you can scale up with the found transformation as your prior.
So feature matching algorithms are all-in-all far better when it comes to using templates where the scale is not exactly the same, or if there's a perspective difference in the image and template, or if the transformations are large. However, your matches are only as good as your feature detector is exact. On optical flow algorithms, as long as it's looking in the right spot, the transformations can be really, really precise. They're both computationally expensive a bit; optical flow algorithms being an iterative approach makes them expensive (and although you'd think the pyramid approach can eat up more costs by running on more images, it can actually make it faster in some cases to reach the desired accuracy), and nearest neighbor searches are also expensive. Optical flow algorithms OTOH can work really well when the transformations are small, but if anything in your scene messes with your lighting or you get some incorrect pixels (like say, even minor occlusion) can really throw it off.
Which one to use definitely depends on the project. For a project I worked on with satellite imagery, I used dense optical flow because the images of desert terrain I was working with did not have precise enough features (in location) and different feature descriptors happen to look relatively similar so searching that feature space wasn't giving tons of great matches. In this case, optical flow was the better method. However, if you were doing image alignment on satellite imagery of a city where buildings can occlude parts of the scene, there are a lot of features that will stay matched and give a better result.
The OpenCV Lucas-Kanade tutorial doesn't give a whole lot of insight but should get your code moving in the right direction with the above in mind.
key-point matching = sparse optical flow
KLT tracking is a good example of sparse flow, see the demo LKDemo.cpp (it had some python wrapper example too, cant remember it now).
for a dense example, see samples/python/opt_flow.py, using Farnebäcks method.
You are right in being confused... The entire world is confused about this terribly simple topic. Alot of the reason is because people believe Lucas-Kanade to be sparse flow (due to a terribly badly named and commented example in openCV: LKdemo which should be called KLTDemo).

Object Recognition by Outlines vs Features

Context:
I have the RGB-D video from a Kinect, which is aimed straight down at a table. There is a library of around 12 objects I need to identify, alone or several at a time. I have been working with SURF extraction and detection from the RGB image, preprocessing by downscaling to 320x240, grayscale, stretching the contrast and balancing the histogram before applying SURF. I built a lasso tool to choose among detected keypoints in a still of the video image. Then those keypoints are used to build object descriptors which are used to identify objects in the live video feed.
Problem:
SURF examples show successful identification of objects with a decent amount of text-like feature detail eg. logos and patterns. The objects I need to identify are relatively plain but have distinctive geometry. The SURF features found in my stills are sometimes consistent but mostly unimportant surface features. For instance, say I have a wooden cube. SURF detects a few bits of grain on one face, then fails on other faces. I need to detect (something like) that there are four corners at equal distances and right angles. None of my objects has much of a pattern but all have distinctive symmetric geometry and color. Think cellphone, lollipop, knife, bowling pin. My thought was that I could build object descriptors for each significantly different-looking orientation of the object, eg. two descriptors for a bowling pin: one standing up and one laying down. For a cellphone, one laying on the front and one on the back. My recognizer needs rotational invariance and some degree of scale invariance in case objects are stacked. Ability to deal with some occlusion is preferable (SURF behaves well enough) but not the most important characteristic. Skew invariance would be preferable and SURF does well with paper printouts of my objects held by hand at a skew.
Questions:
Am I using the wrong SURF parameters to find features at the wrong scale? Is there a better algorithm for this kind of object identification? Is there something as readily usable as SURF that uses the depth data from the Kinect along with or instead of the RGB data?
I was doing something similar for a project, and ended up using a super simple method for object recognition, which was using OpenCV blob detection, and recognizing objects based on their areas. Obviously, there needs to be enough variance for this method to work.
You can see my results here: http://portfolio.jackkalish.com/Secondhand-Stories
I know there are other methods out there, one possible solution for you could be approxPolyDP, which is described here:
How to detect simple geometric shapes using OpenCV
Would love to hear about your progress on this!

Type of graph cut algorithm for 3D reconstruction

I have read several papers on using graph cuts for 3D reconstruction and I have noticed that there seem to be two alternative approaches to posing this problem.
One approach is volumetric and describes a 3D region of voxels for which a graph cut is used to infer a binary labelling (contains object of interest or does not) for each voxel. Papers which take this approach include Multi-View Stereo via Volumetric Graph Cuts and Occlusion Robust Photo-Consistency and A Surface Reconstruction Using Global Graph Cut Optimization.
The second approach is 2D and seeks to label each pixel of a reference image with the depth of the 3D point that projects there. Papers which take this approach include Computing Visual Correspondence with Occlusions via Graph Cuts.
I want to understand the advantages/disadvantages of each method and which are the most significant when choosing which method to use. So far I understand that some advantages of the first approach are:
It is a binary problem, so is solvable exactly with Max-Flow algorithms.
Provides simple methods of modelling occlusion.
And some advantages of the second approach are:
Smaller neighbor set for each node of the graph.
Easier to model smoothness (but does it give better results?).
Additionally, I would be interested in which situations I would be better off choosing one representation or the other and why.
The most significant difference is the type of scenes the algorithms are typically used with, and the way they represent the 3D shape of the object.
Volumetric approaches perform best...
with a large number of images...
taken from different viewpoints, well distributed around the object,...
of a more or less compact "object" (e.g. an artifact, in contrast, for example, to an outdoor scene observed by a vehicle camera).
Volumetric approaches are popular for reconstructing "objects" (especially artifacts). Given sufficient views (i.e. images), the algorithms give a complete volumetric (i.e. voxel) representation of the object's shape. This can be converted to a surface representation using Marching Cubes or similar method.
The second type of algorithms you identified are called stereo algorithms, and graph cuts are just one of many methods of solving such problems. Stereo is best...
if you have only two images...
with a fairly narrow baseline (i.e. distance between cameras)
Generalizations to more than two images (with narrow baselines) exist, but most of the literature deals with the binocular (i.e. two image) case. Some algorithms generalize more easily to more views than others.
Stereo algorithms only give you a depth map, i.e. an image with a depth value for each pixel. This does not allow you to look "around" the object. There are, however, 3D reconstruction systems that start with stereo on image pairs and combine the depth maps in order to get a representation of the complete object, which is a non-trivial problem of its own right. Interestingly, this is often approached using a volumetric representation as an intermediate step.
Stereo algorithms can be and are often used to for "scenes", e.g. the road observed by a pair of cameras in a vehicle, or people in a room for 3D video conferencen.
Some closing remarks
For both stereo and volumetric reconstruction, graph cuts are just one of several methods to solve the problem. Stereo, for example, can also be formulated as a continuous optimization problem, rather than a discrete one, which implies other optimization methods for its solution.
My answer contains a bunch of generalizations and simplifications. It is not meant to be a definitive treatment of the subject.
I don't necessarily agree that smoothness is easier in the stereo case. Why do you think so?

Algorithm for measuring the Euclidean distance between pixels in an image

I have a number of images where I know the focal length, pixel count, dimensions and position (from GPS). They are all in a high oblique manner, taken on the ground with commercially available cameras.
alt text http://desmond.yfrog.com/Himg411/scaled.php?tn=0&server=411&filename=mjbm.jpg&xsize=640&ysize=640
What would be the best method for calculating the euclidean distances between certain pixels within an image? If it is indeed possible.
Assuming you're not looking for full landscape modelling but a simple approximation then this shouldn't be too hard. Basically a first approximation of your image reduces to a camera with know focal length looking along a plane. So we can create a model of the system in 3D very easily - it's not too far from the classic observer looking over a checkerboard demo.
Normally our graphics problem would be to project the 3D model into 2D so we could render the image. Although most programs nowadays use an API (such as OpenGL) to do this the equations are not particularly complex or difficult to understand. I wrote my first code using the examples from 3D Graphics In Pascal which is a nice clear treatise, but there will be lots of other similar source (although probably less nowadays as a hardware API is invariably used).
What's useful about this is that the projection equations are commutative, in that if you have a point on the image and the model you can run the data back though the projection to retrieve the original 3D coordinates - which is what you wish to do.
So a couple of approaches suggest: either write the code to do the above yourself directly, or probably more simply use OpenGL (I'd recommend the GLUT toolkit for this). If your math is good and manipulating matrices causes you no issue then I'd recommend the former as the solution will be tighter and it's interesting stuff - otherwise take the OpenGL approach. You'd probably want to turn the camera/plane approximation into camera/sphere fairly early too.
If this isn't sufficient for your needs then in theory going to actual landscape modelling would be feasible. The SRTM data is freely available (albeit not in the friendliest of forms) so combined with your GPS position it should be possible to create a mesh model in with which you apply the same algorithms as above.

Resources