As stated in this question, it is necessary to have an organized pointcloud for most functions, I was wondering how to convert my uorganized pointcloud to organized? Something like "So I organized it using spherical projection" is mentioned in the original question.
My final pointcloud which needs to be organized is the result of concatenation after transformation of two individual pointclouds from two real sense D415 cameras.
Related
I have a non-planar object with 9 points with known dimensions in 3D i.e. length of all sides is known. Now given a 2D projection of this shape, I want to reconstruct the 3D model of it. I basically want to retrieve the shape of this object in the real world i.e. angles between different sides in 3D. For eg: given all the dimensions of every part of the table and a 2D image, I'm trying to reconstruct its 3D model.
I've read about homography, perspective transform, procrustes and fundamental/essential matrix so far but haven't found a solution that'll apply here. I'm new to this, so might have missed out something. Any direction on this will be really helpful.
In your question, you mention that you want to achieve this using only a single view of the object. In that case, homographies or Essential/Fundamental matrices wont help you, because these require at least two views of the scene to make sense. If you don't have any priors on the shape of the objects that you want to reconstruct, the key information that you'll be missing is (relative) depth, and in that case I think those are the two possible solutions:
Leverage a learning algorithm. There is a rich literature on 6dof object pose estimation with deep networks, see this paper for example. You wont have to deal with depth directly if you use those since those networks are trained end to end to estimate a pose in SO(3).
Add many more images and use a dense photometric SLAM/SFM pipeline, such as elastic fusion. However, in that case you will need to segment the resulting models since the estimation they produce is of the entire environment, which can be difficult depending on the scene.
However, as you mentioned in your comment, it is possible to reconstruct the model up to scale if you have very strong priors on its geometry. In the case of a planar object (a cuboid will just be an extension of that), you can use this simple algorithm (that is more or less what they do here, there are other methods but I find them a bit messy, equation-wise):
//let's note A,B,C,D the rectangle in 3d that we are after, such that
//AB is parellel with CD. Let's also note a,b,c,d their respective
//reprojections in the image, i.e. a=KA where K is the calibration matrix, and so on.
1) Compute the common vanishing point of AB and CD. This is just the intersection
of ab and cd in the image plane. Let's call it v_1.
2) Do the same for the two other edges, i.e bc and da. Let's call this
vanishing point v_2.
3) Now, you can compute the vanishing line, which will just be
crossproduct(v_1, v_2), i.e. the line going through both v_1 and v_2. This gives
you the orientation of your plane. Let's write its normal N.
5) All you need to find now is the boundaries of the rectangle. To do
that, just consider any plane with normal N that doesn't go through
the camera center. Now find the intersections of K^{-1}a, K^{-1}b,
K^{-1}c, K^{-1}d with that plane.
If you need a refresher on vanishing points and lines, I suggest you take a look at pages 213 and 216 of Hartley-Zisserman's book.
I am currently looking for a way to fit a simple shape (e.g. a T or an L shape) to a 2D point cloud. What I need as a result is the position and orientation of the shape.
I have been looking at a couple of approaches but most seem very complicated and involve building and learning a sample database first. As I am dealing with very simple shapes I was hoping that there might be a simpler approach.
By saying you don't want to do any training I am guessing that you mean you don't want to do any feature matching; feature matching is used to make good guesses about the pose (location and orientation) of the object in the image, and would be applicable along with RANSAC to your problem for guessing and verifying good hypotheses about object pose.
The simplest approach is template matching, but this may be too computationally complex (it depends on your use case). In template matching you simply loop over the possible locations of the object and its possible orientations and possible scales and check how well the template (a cloud that looks like an L or a T at that location and orientation and scale) matches (or you sample possible locations orientations and scales randomly). The checking of the template could be made fairly fast if your points are organised (or you organise them by e.g. converting them into pixels).
If this is too slow there are many methods for making template matching faster and I would recommend to you the Generalised Hough Transform.
Here, before starting the search for templates you loop over the boundary of the shape you are looking for (T or L) and for each point on its boundary you look at the gradient direction and then the angle at that point between the gradient direction and the origin of the object template, and the distance to the origin. You add that to a table (Let us call it Table A) for each boundary point and you end up with a table that maps from gradient direction to the set of possible locations of the origin of the object. Now you set up a 2D voting space, which is really just a 2D array (let us call it Table B) where each pixel contains a number representing the number of votes for the object in that location. Then for each point in the target image (point cloud) you check the gradient and find the set of possible object locations as found in Table A corresponding to that gradient, and then add one vote for all the corresponding object locations in Table B (the Hough space).
This is a very terse explanation but knowing to look for Template Matching and Generalised Hough transform you will be able to find better explanations on the web. E.g. Look at the Wikipedia pages for Template Matching and Hough Transform.
You may need to :
1- extract some features from the image inside which you are looking for the object.
2- extract another set of features in the image of the object
3- match the features (it is possible using methods like SIFT)
4- when you find a match apply RANSAC algorithm. it provides you with transformation matrix (including translation, rotation information).
for using SIFT start from here. it is actually one of the best source-codes written for SIFT. It includes RANSAC algorithm and you do not need to implement it by yourself.
you can read about RANSAC here.
Two common ways for detecting the shapes (L, T, ...) in your 2D pointcloud data would be using OpenCV or Point Cloud Library. I'll explain steps you may take for detecting those shapes in OpenCV. In order to do that, you can use the following 3 methods and the selection of the right method depends on the shape (Size, Area of the shape, ...):
Hough Line Transformation
Template Matching
Finding Contours
The first step would be converting your point to a grayscale Mat object, by doing that you basically make an image of your 2D pointcloud data and so you can use other OpenCV functions. Then you may smooth the image in order to reduce the noises and the result would be somehow a blurry image which contains real edges, if your application does not need real-time processing, you can use bilateralFilter. You can find more information about smoothing here.
The next step would be choosing the method. If the shape is just some sort of orthogonal lines (such as L or T) you can use Hough Line Transformation in order to detect the lines and after detection, you can loop over the lines and calculate the dot product of the lines (since they are orthogonal the result should be 0). You can find more information about Hough Line Transformation here.
Another way would be detecting your shape using Template Matching. Basically, you should make a template of your shape (L or T) and use it in matchTemplate function. You should consider that the size of the template you want to use should be in the order of your image, otherwise you may resize your image. More information about the algorithm can be found here.
If the shapes include areas you can find contours of the shape using findContours, it will give you the number of polygons which are around your shape you want to detect. For instance, if your shape is L, it would have polygon which has roughly 6 lines. Also, you can use some other filters along with findContours such as calculating the area of the shape.
Say I have an object, and I obtained RGB data and Depth data from a kinect on one angle. Then I moved the kinect slightly so that I can take a second picture of the object from a different angle. I'm trying to figure out the best way to determine how much the kinect camera has translated and rotated from its original position based on the correspondences from both images.
I'm using OpenCV for certain image processing tasks and I've looked into the SURF algorithm, which seems to find good correspondences from two 2D images. However it doesn't take into account the depth data, and I don't think it works very nice with multiple pictures. Additionally, I can't figure out how you can obtain the translation/rotation data from the correspondence.. Perhaps I'm looking in the wrong direction?
Note: My long-term goal is to "merge" the multiple images to form a 3D model from all the data. At the moment it somewhat works if I specify the angles that the kinect is located at, but I think it's much better to reduce the "errors" involved (i.e. points shifted slightly from where they should be) by finding correspondences instead of specifying parameters manually
I try to match two overlapping images captured with a camera. To do this, I'd like to use OpenCV. I already extracted the features with the SurfFeatureDetector. Now I try to to compute the rotation and translation vector between the two images.
As far as I know, I should use cvFindExtrinsicCameraParams2(). Unfortunately, this method require objectPoints as an argument. These objectPoints are the world coordinates of the extracted features. These are not known in the current context.
Can anybody give me a hint how to solve this problem?
The problem of simultaneously computing relative pose between two images and the unknown 3d world coordinates has been treated here:
Berthold K. P. Horn. Relative orientation revisited. Berthold K. P. Horn. Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 545 Technology ...
EDIT: here is a link to the paper:
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.64.4700
Please see my answer to a related question where I propose a solution to this problem:
OpenCV extrinsic camera from feature points
EDIT: You may want to take a look at bundle adjustments too,
http://en.wikipedia.org/wiki/Bundle_adjustment
That assumes an initial estimate is available.
EDIT: I found some code resources you might want to take a look at:
Resource I:
http://www.maths.lth.se/vision/downloads/
Two View Geometry Estimation with Outliers
C++ code for finding the relative orientation of two calibrated
cameras in presence of outliers. The obtained solution is optimal in
the sense that the number of inliers is maximized.
Resource II:
http://lear.inrialpes.fr/people/triggs/src/ Relative orientation from
5 points: a somewhat more polished C routine implementing the minimal
solution for relative orientation of two calibrated cameras from
unknown 3D points. 5 points are required and there can be as many as
10 feasible solutions (but 2-5 is more common). Also requires a few
CLAPACK routines for linear algebra. There's also a short technical
report on this (included with the source).
Resource III:
http://www9.in.tum.de/praktika/ppbv.WS02/doc/html/reference/cpp/toc_tools_stereo.html
vector_to_rel_pose Compute the relative orientation between two
cameras given image point correspondences and known camera parameters
and reconstruct 3D space points.
There is a theoretical solution, however, the OpenCV implementation of camera pose estimation lacks the needed tools.
The theoretical approach:
Step 1: extract the homography (the matrix describing the geometrical transform between images). use findHomography()
Step 2. Decompose the result matrix into rotations and translations. Use cv::solvePnP();
Problem: findHomography() returns a 3x3 matrix, corresponding to a projection from a plane to another. solvePnP() needs a 3x4 matrix, representing the 3D rotation/translation of the objects. I think that with some approximations, you can modify the solvePnP to give you some results, but it requires a lot of math and a very good understanding of 3D geometry.
Read more about at http://en.wikipedia.org/wiki/Transformation_matrix
If I take a picture with a camera, so I know the distance from the camera to the object, such as a scale model of a house, I would like to turn this into a 3D model that I can maneuver around so I can comment on different parts of the house.
If I sit down and think about taking more than one picture, labeling direction, and distance, I should be able to figure out how to do this, but, I thought I would ask if someone has some paper that may help explain more.
What language you explain in doesn't matter, as I am looking for the best approach.
Right now I am considering showing the house, then the user can put in some assistance for height, such as distance from the camera to the top of that part of the model, and given enough of this it would be possible to start calculating heights for the rest, especially if there is a top-down image, then pictures from angles on the four sides, to calculate relative heights.
Then I expect that parts will also need to differ in color to help separate out the various parts of the model.
As mentioned, the problem is very hard and is often also referred to as multi-view object reconstruction. It is usually approached by solving the stereo-view reconstruction problem for each pair of consecutive images.
Performing stereo reconstruction requires that pairs of images are taken that have a good amount of visible overlap of physical points. You need to find corresponding points such that you can then use triangulation to find the 3D co-ordinates of the points.
Epipolar geometry
Stereo reconstruction is usually done by first calibrating your camera setup so you can rectify your images using the theory of epipolar geometry. This simplifies finding corresponding points as well as the final triangulation calculations.
If you have:
the intrinsic camera parameters (requiring camera calibration),
the camera's position and rotation (it's extrinsic parameters), and
8 or more physical points with matching known positions in two photos (when using the eight-point algorithm)
you can calculate the fundamental and essential matrices using only matrix theory and use these to rectify your images. This requires some theory about co-ordinate projections with homogeneous co-ordinates and also knowledge of the pinhole camera model and camera matrix.
If you want a method that doesn't need the camera parameters and works for unknown camera set-ups you should probably look into methods for uncalibrated stereo reconstruction.
Correspondence problem
Finding corresponding points is the tricky part that requires you to look for points of the same brightness or colour, or to use texture patterns or some other features to identify the same points in pairs of images. Techniques for this either work locally by looking for a best match in a small region around each point, or globally by considering the image as a whole.
If you already have the fundamental matrix, it will allow you to rectify the images such that corresponding points in two images will be constrained to a line (in theory). This helps you to use faster local techniques.
There is currently still no ideal technique to solve the correspondence problem, but possible approaches could fall in these categories:
Manual selection: have a person hand-select matching points.
Custom markers: place markers or use specific patterns/colours that you can easily identify.
Sum of squared differences: take a region around a point and find the closest whole matching region in the other image.
Graph cuts: a global optimisation technique based on optimisation using graph theory.
For specific implementations you can use Google Scholar to search through the current literature. Here is one highly cited paper comparing various techniques:
A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms.
Multi-view reconstruction
Once you have the corresponding points, you can then use epipolar geometry theory for the triangulation calculations to find the 3D co-ordinates of the points.
This whole stereo reconstruction would then be repeated for each pair of consecutive images (implying that you need an order to the images or at least knowledge of which images have many overlapping points). For each pair you would calculate a different fundamental matrix.
Of course, due to noise or inaccuracies at each of these steps you might want to consider how to solve the problem in a more global manner. For instance, if you have a series of images that are taken around an object and form a loop, this provides extra constraints that can be used to improve the accuracy of earlier steps using something like bundle adjustment.
As you can see, both stereo and multi-view reconstruction are far from solved problems and are still actively researched. The less you want to do in an automated manner the more well-defined the problem becomes, but even in these cases quite a bit of theory is required to get started.
Alternatives
If it's within the constraints of what you want to do, I would recommend considering dedicated hardware sensors (such as the XBox's Kinect) instead of only using normal cameras. These sensors use structured light, time-of-flight or some other range imaging technique to generate a depth image which they can also combine with colour data from their own cameras. They practically solve the single-view reconstruction problem for you and often include libraries and tools for stitching/combining multiple views.
Epipolar geometry references
My knowledge is actually quite thin on most of the theory, so the best I can do is to further provide you with some references that are hopefully useful (in order of relevance):
I found a PDF chapter on Multiple View Geometry that contains most of the critical theory. In fact the textbook Multiple View Geometry in Computer Vision should also be quite useful (sample chapters available here).
Here's a page describing a project on uncalibrated stereo reconstruction that seems to include some source code that could be useful. They find matching points in an automated manner using one of many feature detection techniques. If you want this part of the process to be automated as well, then SIFT feature detection is commonly considered to be an excellent non-real-time technique (since it's quite slow).
A paper about Scene Reconstruction from Multiple Uncalibrated Views.
A slideshow on Methods for 3D Reconstruction from Multiple Images (it has some more references below it's slides towards the end).
A paper comparing different multi-view stereo reconstruction algorithms can be found here. It limits itself to algorithms that "reconstruct dense object models from calibrated views".
Here's a paper that goes into lots of detail for the case that you have stereo cameras that take multiple images: Towards robust metric reconstruction
via a dynamic uncalibrated stereo head. They then find methods to self-calibrate the cameras.
I'm not sure how helpful all of this is, but hopefully it includes enough useful terminology and references to find further resources.
Research has made significant progress and these days it is possible to obtain pretty good-looking 3D shapes from 2D images. For instance, in our recent research work titled "Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and Silhouettes With Deep Generative Networks" took a big step in solving the problem of obtaining 3D shapes from 2D images. In our work, we show that you can not only go from 2D to 3D directly and get a good, approximate 3D reconstruction but you can also learn a distribution of 3D shapes in an efficient manner and generate/synthesize 3D shapes. Below is an image of our work showing that we are able to do 3D reconstruction even from a single silhouette or depth map (on the left). The ground-truth 3D shapes are shown on the right.
The approach we took has some contributions related to cognitive science or the way the brain works: the model we built shares parameters for all shape categories instead of being specific to only one category. Also, it obtains consistent representations and takes the uncertainty of the input view into account when producing a 3D shape as output. Therefore, it is able to naturally give meaningful results even for very ambiguous inputs. If you look at the citation to our paper you can see even more progress just in terms of going from 2D images to 3D shapes.
This problem is known as Photogrammetry.
Google will supply you with endless references, just be aware that if you want to roll your own, it's a very hard problem.
Check out The Deadalus Project, althought that website does not contain a gallery with illustrative information about the solution, it post several papers and info about the working method.
I watched a lecture from one of the main researchers of the project (Roger Hubbold), and the image results are quite amazing! Althought is a complex and long problem. It has a lot of tricky details to take into account to get an approximation of the 3d data, take for example the 3d information from wall surfaces, for which the heuristic to work is as follows: Take a photo with normal illumination of the scene, and then retake the picture in same position with full flash active, then substract both images and divide the result by a pre-taken flash calibration image, apply a box filter to this new result and then post-process to estimate depth values, the whole process is explained in detail in this paper (which is also posted/referenced in the project website)
Google Sketchup (free) has a photo matching tool that allows you to take a photograph and match its perspective for easy modeling.
EDIT: It appears that you're interested in developing your own solution. I thought you were trying to obtain a 3D model of an image in a single instance. If this answer isn't helpful, I apologize.
Hope this helps if you are trying to construct 3d volume from 2d stack of images !! You can use open source tool such as ImageJ Fiji which comes with 3d viewer plugin..
https://quppler.com/creating-a-classifier-using-image-j-fiji-for-3d-volume-data-preparation-from-stack-of-images/