Computer Vision 3D transform from Images - image-processing

Here is a simplified (but gives the essence) of the problem. Suppose I have a box in space in some reference position/orientation and a calibrated camera C with known position/orientation. I take a picture of the box and can identify N feature points x_i on the projected image B.
Now suppose someone moves the box (rigid-body transform) a relatively small amount. I take a picture of the box and can again identify N feature points x*_i. I want to solve for the rigid-body transform T.
My strategy is to equivalently suppose the box did not move, and suppose I have another camera C* that is found by transforming camera C by the inverse of the transform T. So the N points x_i are the projected feature points on image B relative to camera C*.
So then I believe I can solve for the essential matrix E from the two sets of projected image points (provided I have enough--I think I need 8). (Since the cameras are calibrated I think I can just use essential matrix, not fundamental matrix?) From there I can use matrix decomposition to extract the rotation and translation transform that describes how the cameras differ. The inverse of that is the transform I want.
Does that sound like it will work? What happens if I can't find 8 feature points, but say only 3? Will I be able to get an estimate of the essential matrix or will it totally be wrong?

Yes, it will work and it is possible to solve even for some lost features as far as you are able to tell which is which. As far as I know you need at least 8 points like you said. What you described is how "Structure from motion" algorithms work. Please look it up on slide 3 in this lecture, and next at slide 20: The 8 point algorithm. It relates exactly to what you are talking about. If you realized all this without even knowing about structure form motion, then I am really impressed.
Here's the link to the lecture:
https://ags.cs.uni-kl.de/fileadmin/inf_ags/3dcv-ws14-15/3DCV_lec06_SFM1.pdf

Related

Reconstructing a non-planar polygon in 3D given a 2d projection and known polygon dimensions

I have a non-planar object with 9 points with known dimensions in 3D i.e. length of all sides is known. Now given a 2D projection of this shape, I want to reconstruct the 3D model of it. I basically want to retrieve the shape of this object in the real world i.e. angles between different sides in 3D. For eg: given all the dimensions of every part of the table and a 2D image, I'm trying to reconstruct its 3D model.
I've read about homography, perspective transform, procrustes and fundamental/essential matrix so far but haven't found a solution that'll apply here. I'm new to this, so might have missed out something. Any direction on this will be really helpful.
In your question, you mention that you want to achieve this using only a single view of the object. In that case, homographies or Essential/Fundamental matrices wont help you, because these require at least two views of the scene to make sense. If you don't have any priors on the shape of the objects that you want to reconstruct, the key information that you'll be missing is (relative) depth, and in that case I think those are the two possible solutions:
Leverage a learning algorithm. There is a rich literature on 6dof object pose estimation with deep networks, see this paper for example. You wont have to deal with depth directly if you use those since those networks are trained end to end to estimate a pose in SO(3).
Add many more images and use a dense photometric SLAM/SFM pipeline, such as elastic fusion. However, in that case you will need to segment the resulting models since the estimation they produce is of the entire environment, which can be difficult depending on the scene.
However, as you mentioned in your comment, it is possible to reconstruct the model up to scale if you have very strong priors on its geometry. In the case of a planar object (a cuboid will just be an extension of that), you can use this simple algorithm (that is more or less what they do here, there are other methods but I find them a bit messy, equation-wise):
//let's note A,B,C,D the rectangle in 3d that we are after, such that
//AB is parellel with CD. Let's also note a,b,c,d their respective
//reprojections in the image, i.e. a=KA where K is the calibration matrix, and so on.
1) Compute the common vanishing point of AB and CD. This is just the intersection
of ab and cd in the image plane. Let's call it v_1.
2) Do the same for the two other edges, i.e bc and da. Let's call this
vanishing point v_2.
3) Now, you can compute the vanishing line, which will just be
crossproduct(v_1, v_2), i.e. the line going through both v_1 and v_2. This gives
you the orientation of your plane. Let's write its normal N.
5) All you need to find now is the boundaries of the rectangle. To do
that, just consider any plane with normal N that doesn't go through
the camera center. Now find the intersections of K^{-1}a, K^{-1}b,
K^{-1}c, K^{-1}d with that plane.
If you need a refresher on vanishing points and lines, I suggest you take a look at pages 213 and 216 of Hartley-Zisserman's book.

Extrinsic Camera Calibration Using OpenCV's solvePnP Function

I'm currently working on an augmented reality application using a medical imaging program called 3DSlicer. My application runs as a module within the Slicer environment and is meant to provide the tools necessary to use an external tracking system to augment a camera feed displayed within Slicer.
Currently, everything is configured properly so that all that I have left to do is automate the calculation of the camera's extrinsic matrix, which I decided to do using OpenCV's solvePnP() function. Unfortunately this has been giving me some difficulty as I am not acquiring the correct results.
My tracking system is configured as follows:
The optical tracker is mounted in such a way that the entire scene can be viewed.
Tracked markers are rigidly attached to a pointer tool, the camera, and a model that we have acquired a virtual representation for.
The pointer tool's tip was registered using a pivot calibration. This means that any values recorded using the pointer indicate the position of the pointer's tip.
Both the model and the pointer have 3D virtual representations that augment a live video feed as seen below.
The pointer and camera (Referred to as C from hereon) markers each return a homogeneous transform that describes their position relative to the marker attached to the model (Referred to as M from hereon). The model's marker, being the origin, does not return any transformation.
I obtained two sets of points, one 2D and one 3D. The 2D points are the coordinates of a chessboard's corners in pixel coordinates while the 3D points are the corresponding world coordinates of those same corners relative to M. These were recorded using openCV's detectChessboardCorners() function for the 2 dimensional points and the pointer for the 3 dimensional. I then transformed the 3D points from M space to C space by multiplying them by C inverse. This was done as the solvePnP() function requires that 3D points be described relative to the world coordinate system of the camera, which in this case is C, not M.
Once all of this was done, I passed in the point sets into solvePnp(). The transformation I got was completely incorrect, though. I am honestly at a loss for what I did wrong. Adding to my confusion is the fact that OpenCV uses a different coordinate format from OpenGL, which is what 3DSlicer is based on. If anyone can provide some assistance in this matter I would be exceptionally grateful.
Also if anything is unclear, please don't hesitate to ask. This is a pretty big project so it was hard for me to distill everything to just the issue at hand. I'm wholly expecting that things might get a little confusing for anyone reading this.
Thank you!
UPDATE #1: It turns out I'm a giant idiot. I recorded colinear points only because I was too impatient to record the entire checkerboard. Of course this meant that there were nearly infinite solutions to the least squares regression as I only locked the solution to 2 dimensions! My values are much closer to my ground truth now, and in fact the rotational columns seem correct except that they're all completely out of order. I'm not sure what could cause that, but it seems that my rotation matrix was mirrored across the center column. In addition to that, my translation components are negative when they should be positive, although their magnitudes seem to be correct. So now I've basically got all the right values in all the wrong order.
Mirror/rotational ambiguity.
You basically need to reorient your coordinate frames by imposing the constraints that (1) the scene is in front of the camera and (2) the checkerboard axes are oriented as you expect them to be. This boils down to multiplying your calibrated transform for an appropriate ("hand-built") rotation and/or mirroring.
The basic problems is that the calibration target you are using - even when all the corners are seen, has at least a 180^ deg rotational ambiguity unless color information is used. If some corners are missed things can get even weirder.
You can often use prior info about the camera orientation w.r.t. the scene to resolve this kind of ambiguities, as I was suggesting above. However, in more dynamical situation, of if a further degree of automation is needed in situations in which the target may be only partially visible, you'd be much better off using a target in which each small chunk of corners can be individually identified. My favorite is Matsunaga and Kanatani's "2D barcode" one, which uses sequences of square lengths with unique crossratios. See the paper here.

How to determine distance of objects from camera using Epipolar Plane Image?

I am working on converting 2d images to 3d environment. The images were collected from a video made in a lateral motion. Then the images were placed one behind the other, so it would be easy to find the correspondences between the two images. This is called a spatial-temporal volume.
Next I take a slice from the spatiotemporal volume. That slice is called the Epipolar Plane Image.
Using the Epipolar Plane Image, I want to calculate the depth of the objects in the scene and make a 3D enviornment. I have listed the reference but I have not been able to figure out the math described in the paper. Can someone help me figure this out? Any help is appreciated.
Reference
Epipolar-Plane Image Analysis: An Approach to Determining Structure from Motion* !
The math in this situation is easy and straight forward.
First let's define two the coordinate systems for two overlapping images taken by the same camera with the focal length with the following schema:
Let us say that first camera position is defined as follows:
While it's orientation by using three Euler angles is:
By using this definition the corresponding rotation matrix is the identity matrix
The second camera position can be defined as follows:
And since the orientation is the same as the first camera, all Euler angles remain zero:
Which also means that the corresponding rotation matrix is the identity matrix.
If the images overlap and the orientation is the same, the situation in the image space looks like this:
Here the image coordinates and their measurement accuracy are defined as follows:
This geometrical situation can be described by using the Intercept Theorem:
As you see it's not complicated. But be aware that this solution is certainly not the best, since it's base assumption that all orientation angles are the same can't be fulfilled in reality.
If you need to be accurate then you have to perform an bundle adjustment. However, this equations are often used to determine the approximated solution for this geometric situation, where the values are used to linearize the collinearity equations.

Camera pose estimation

I'm currently working on a project that deals with the reconstruction based on a set of images, in a multi-view stereo approach. As such I need to know the several images pose in space. I find matching features using surf, and from the correspondences I find the essential matrix.
Now comes the problem: It is possible to decompose the essential matrix with SVD, but this can lead to 4 different results, as I read in a book. How can I obtain the correct one, assuming this is possible?
What other algorithms can I use for this?
Wikipedia says:
It turns out, however, that only one of the four classes of solutions
can be realized in practice. Given a pair of corresponding image
coordinates, three of the solutions will always produce a 3D point
which lies behind at least one of the two cameras and therefore cannot
be seen. Only one of the four classes will consistently produce 3D
points which are in front of both cameras. This must then be the
correct solution.
If you have the extrinsic calibration parameters for the camera in the first frame, or if you assume that it lies at a default calibration, say translation of (0,0,0) and rotation of (0,0,0), then you can determine which of the decompositions is the valid one.
Thanks to Zaphod answer I was able to solve my problem. Here's what I did:
First I calculated the Essential Matrix (E) from a set of point correspondences in both images.
Using SVD, decomposed it into 2 solutions. Using the negated Essential Matrix -E (which also satisfies the same constraints) I arrived at 2 more solutions for a total of 4 possible camera positions and orientations.
Then, for all solutions I triangulated the point correspondences and determined which intersected in front of both cameras, by taking the dot product of the point coordinate and each of the cameras viewing direction. I both are positive, then that intersection is in front of both cameras.
In the end the solution that delivers the most intersections in front of the cameras is the chosen one.

motion reconstruction from a single camera

I have a single calibrated camera (known intrinsic parameters, i.e. camera matrix K is known, as well as the distortion coefficients).
I would like to reconstruct the camera's 3d trajectory. There is no a-priori knowledge about the scene.
simplifying the problem by presenting two images that look on the same scene and extracting two set of corresponding matched feature points from them (SIFT, SURF, ORB, etc.)
My problem is how can I calculate the camera extrinsic parameters (i.e. the rotation matrix R and the translation vector t ) between the to viewpoints?
I have managed to calculate the fundamental matrix, and since K is know, the essential matrix as well. using David Nister's efficient solution to the Five-Point Relative Pose Problem I've managed to get 4 possible solution but:
the constraint on the essential matrix E ~ U * diag (s,s,0) * V' doesn't always apply - causing incorrect results.
[EDIT]: taking the average singular value seems to correct the results :) one down
how can I tell which one of the four is the correct one?
Thanks
Your solution to point 1 is correct: diag( (s1 + s2)/2, (s1 + s2)/2, 0).
As for telling which one of the four solutions is correct, only one will give positive depths for all points with respect to the camera frame. That's the one you want.
Code for checking which solution is correct can be found here: http://cs.gmu.edu/%7Ekosecka/examples-code/essentialDiscrete.m from http://cs.gmu.edu/%7Ekosecka/bookcode.html
They use the determinants of U and V to determine the solution with the correct orientation. Look for the comment "then four possibilities are". Since you're only estimating the essential matrix, it's susceptible to noise and does not behave well at all if all of the points are coplanar.
Also, the translation is only recovered to within a constant scaling factor, so the fact that you're seeing a normalized translation vector of unit magnitude is exactly correct. The reason is that the depth is unknown and estimated to be 1. You'll have to find some way to recover the depth as in the code for the eight-point algorithm + 3d reconstruction (Algorithm 5.1 in the bookcode link.)
The book the sample code above is taken from is also a very good reference. http://vision.ucla.edu/MASKS/ Chapter 5, the one you're interested in, is available on the Sample Chapters link.
Congrats on your hard work, sounds like you've tried hard to learn these techniques.
For actual production-strength code, I'd advise to download libmv and ceres, and re-code your solution using them.
Your two questions are really one: invalid solutions are rejected based on the data you have collected. In particular, Nister's (as well as Stewenius's) algorithm is normally used in the inner loop of a RANSAC-like solver, which selects for the solution with the best fit / max number of inliers.

Resources