When implementing monocular SLAM or Structure from Motion using single camera, translation can be estimated up to unknown scale. It is proven that without any other external information, this scale can not be determined. However, my question:
How to unify this scale in all sub translations. For example, if we have 3 frame (Frame0, Frame1 & Frame2), we applied tracking as follow:
Frame0 -> Frame 1 : R01, T01 (R&T
can be extracted using F Matrix and K
matrix and Essential Matrix
decompostion)
Frame 1-> Frame 2 : R12, T12
The problem is T01 & T12 are normalized so their magnitude is 1. However, in real, T01 magnitude may be twice as T12.
How can I recover the Relative magnitude between T01 and T12?
P.S. I do not want to know what is exactly T01 or T12. I just want to know that |T01| = 2 * |T12|.
I think it is possible because Monocular SLAM or SFM algorithms are already exists and working well. So, there should be some way to do this.
Calculate R,t between frames 2 & 0 and connect a triangle between the three vertices formed by the three frames. the only possible closed triangle (up to a single scale) will be formed when the relative translations are known up to a scale.
Related
I want to reopen a similar question to one which somebody posted a while ago with some major difference.
The previous post is https://stackoverflow.com/questions/52536520/image-matching-using-intrinsic-and-extrinsic-camera-parameters]
and my question is can I do the matching if I do have the depth?
If it is possible can some describe a set of formulas which I have to solve to get the desirable matching ?
Here there is also some correspondence on slide 16/43:
Depth from Stereo Lecture
In what units all the variables here, can some one clarify please ?
Will this formula help me to calculate the desirable point to point correspondence ?
I know the Z (mm, cm, m, whatever unit it is) and the x_l (I guess this is y coordinate of the pixel, so both x_l and x_r are on the same horizontal line, correct if I'm wrong), I'm not sure if T is in mm (or cm, m, i.e distance unit) and f is in pixels/mm (distance unit) or is it something else ?
Thank you in advance.
EDIT:
So as it was said by #fana, the solution is indeed a projection.
For my understanding it is P(v) = K (Rv+t), where R is 3 x 3 rotation matrix (calculated for example from calibration), t is the 3 x 1 translation vector and K is the 3 x 3 intrinsics matrix.
from the following video:
It can be seen that there is translation only in one dimension (because the situation is where the images are parallel so the translation takes place only on X-axis) but in other situation, as much as I understand if the cameras are not on the same parallel line, there is also translation on Y-axis. What is the translation on the Z-axis which I get through the calibration, is it some rescale factor due to different image resolutions for example ? Did I wrote the projection formula correctly in the general case?
I also want to ask about the whole idea.
Suppose I have 3 cameras, one with large FOV which gives me color and depth for each pixel, lets call it the first (3d tensor, color stacked with depth correspondingly), and two with which I want to do stereo, lets call them second and third.
Instead of calibrating the two cameras, my idea is to use the depth from the first camera to calculate the xyz of pixel u,v of its correspondent color frame, that can be done easily and now to project it on the second and the third image using the R,t found by calibration between the first camera and the second and the third, and using the K intrinsics matrices so the projection matrix seems to be full known, am I right ?
Assume for the case that FOV of color is big enough to include all that can be seen from the second and the third cameras.
That way, by projection each x,y,z of the first camera I can know where is the corresponding pixels on the two other cameras, is that correct ?
There are many posts about 3D reconstruction from stereo views of known internal calibration, some of which are excellent. I have read a lot of them, and based on what I have read I am trying to compute my own 3D scene reconstruction with the below pipeline / algorithm. I'll set out the method then ask specific questions at the bottom.
0. Calibrate your cameras:
This means retrieve the camera calibration matrices K1 and K2 for Camera 1 and Camera 2. These are 3x3 matrices encapsulating each camera's internal parameters: focal length, principal point offset / image centre. These don't change, you should only need to do this once, well, for each camera as long as you don't zoom or change the resolution you record in.
Do this offline. Do not argue.
I'm using OpenCV's CalibrateCamera() and checkerboard routines, but this functionality is also included in the Matlab Camera Calibration toolbox. The OpenCV routines seem to work nicely.
1. Fundamental Matrix F:
With your cameras now set up as a stereo rig. Determine the fundamental matrix (3x3) of that configuration using point correspondences between the two images/views.
How you obtain the correspondences is up to you and will depend a lot on the scene itself.
I am using OpenCV's findFundamentalMat() to get F, which provides a number of options method wise (8-point algorithm, RANSAC, LMEDS).
You can test the resulting matrix by plugging it into the defining equation of the Fundamental matrix: x'Fx = 0 where x' and x are the raw image point correspondences (x, y) in homogeneous coordinates (x, y, 1) and one of the three-vectors is transposed so that the multiplication makes sense. The nearer to zero for each correspondence, the better F is obeying it's relation. This is equivalent to checking how well the derived F actually maps from one image plane to another. I get an average deflection of ~2px using the 8-point algorithm.
2. Essential Matrix E:
Compute the Essential matrix directly from F and the calibration matrices.
E = K2TFK1
3. Internal Constraint upon E:
E should obey certain constraints. In particular, if decomposed by SVD into USV.t then it's singular values should be = a, a, 0. The first two diagonal elements of S should be equal, and the third zero.
I was surprised to read here that if this is not true when you test for it, you might choose to fabricate a new Essential matrix from the prior decomposition like so: E_new = U * diag(1,1,0) * V.t which is of course guaranteed to obey the constraint. You have essentially set S = (100,010,000) artificially.
4. Full Camera Projection Matrices:
There are two camera projection matrices P1 and P2. These are 3x4 and obey the x = PX relation. Also, P = K[R|t] and therefore K_inv.P = [R|t] (where the camera calibration has been removed).
The first matrix P1 (excluding the calibration matrix K) can be set to [I|0] then P2 (excluding K) is R|t
Compute the Rotation and translation between the two cameras R, t from the decomposition of E. There are two possible ways to calculate R (U*W*V.t and U*W.t*V.t) and two ways to calculate t (±third column of U), which means that there are four combinations of Rt, only one of which is valid.
Compute all four combinations, and choose the one that geometrically corresponds to the situation where a reconstructed point is in front of both cameras. I actually do this by carrying through and calculating the resulting P2 = [R|t] and triangulating the 3d position of a few correspondences in normalised coordinates to ensure that they have a positive depth (z-coord)
5. Triangulate in 3D
Finally, combine the recovered 3x4 projection matrices with their respective calibration matrices: P'1 = K1P1 and P'2 = K2P2
And triangulate the 3-space coordinates of each 2d point correspondence accordingly, for which I am using the LinearLS method from here.
QUESTIONS:
Are there any howling omissions and/or errors in this method?
My F matrix is apparently accurate (0.22% deflection in the mapping compared to typical coordinate values), but when testing E against x'Ex = 0 using normalised image correspondences the typical error in that mapping is >100% of the normalised coordinates themselves. Is testing E against xEx = 0 valid, and if so where is that jump in error coming from?
The error in my fundamental matrix estimation is significantly worse when using RANSAC than the 8pt algorithm, ±50px in the mapping between x and x'. This deeply concerns me.
'Enforcing the internal constraint' still sits very weirdly with me - how can it be valid to just manufacture a new Essential matrix from part of the decomposition of the original?
Is there a more efficient way of determining which combo of R and t to use than calculating P and triangulating some of the normalised coordinates?
My final re-projection error is hundreds of pixels in 720p images. Am I likely looking at problems in the calibration, determination of P-matrices or the triangulation?
The error in my fundamental matr1ix estimation is significantly worse
when using RANSAC than the 8pt algorithm, ±50px in the mapping between
x and x'. This deeply concerns me.
Using the 8pt algorithm does not exclude using the RANSAC principle.
When using the 8pt algorithm directly which points do you use? You have to choose 8 (good) points by yourself.
In theory you can compute a fundamental matrix from any point correspondences and you often get a degenerated fundamental matrix because the linear equations are not independend. Another point is that the 8pt algorithm uses a overdetermined system of linear equations so that one single outlier will destroy the fundamental matrix.
Have you tried to use the RANSAC result? I bet it represents one of the correct solutions for F.
My F matrix is apparently accurate (0.22% deflection in the mapping
compared to typical coordinate values), but when testing E against
x'Ex = 0 using normalised image correspondences the typical error in
that mapping is >100% of the normalised coordinates themselves. Is
testing E against xEx = 0 valid, and if so where is that jump in error
coming from?
Again, if F is degenerated, x'Fx = 0 can be for every point correspondence.
Another reason for you incorrect E may be the switch of the cameras (K1T * E * K2 instead of K2T * E * K1). Remember to check: x'Ex = 0
'Enforcing the internal constraint' still sits very weirdly with me -
how can it be valid to just manufacture a new Essential matrix from
part of the decomposition of the original?
It is explained in 'Multiple View Geometry in Computer Vision' from Hartley and Zisserman. As far as I know it has to do with the minimization of the Frobenius norm of F.
You can Google it and there are pdf resources.
Is there a more efficient way of determining which combo of R and t to
use than calculating P and triangulating some of the normalised
coordinates?
No as far as I know.
My final re-projection error is hundreds of pixels in 720p images. Am
I likely looking at problems in the calibration, determination of
P-matrices or the triangulation?
Your rigid body transformation P2 is incorrect because E is incorrect.
Assuming the static scene, with a single camera moving exactly sideways at small distance, there are two frames and a following computed optic flow (I use opencv's calcOpticalFlowFarneback):
Here scatter points are detected features, which are painted in pseudocolor with depth values (red is little depth, close to the camera, blue is more distant). Now, I obtain those depth values by simply inverting optic flow magnitude, like d = 1 / flow. Seems kinda intuitive, in a motion-parallax-way - the brighter the object, the closer it is to the observer. So there's a cube, exposing a frontal edge and a bit of a side edge to the camera.
But then I'm trying to project those feature points from camera plane to the real-life coordinates to make a kind of top view map (where X = (x * d) / f and Y = d (where d is depth, x is pixel coordinate, f is focal length, and X and Y are real-life coordinates). And here's what I get:
Well, doesn't look cubic to me. Looks like the picture is skewed to the right. I've spent some time thinking about why, and it seems that 1 / flow is not an accurate depth metric. Playing with different values, say, if I use 1 / power(flow, 1 / 3), I get a better picture:
But, of course, power of 1 / 3 is just a magic number out of my head. The question is, what is the relationship between optic flow in depth in general, and how do I suppose to estimate it for a given scene? We're just considering camera translation here. I've stumbled upon some papers, but no luck trying to find a general equation yet. Some, like that one, propose a variation of 1 / flow, which isn't going to work, I guess.
Update
What bothers me a little is that simple geometry points me to 1 / flow answer too. Like, optic flow is the same (in my case) as disparity, right? Then using this formula I get d = Bf / (x2 - x1), where B is distance between two camera positions, f is focal length, x2-x1 is precisely the optic flow. Focal length is a constant, and B is constant for any two given frames, so that leaves me with 1 / flow again multiplied by a constant. Do I misunderstand something about what optic flow is?
for a static scene, moving a camera precisely sideways a known amount, is exactly the same as a stereo camera setup. From this, you can indeed estimate depth, if your system is calibrated.
Note that calibration in this sense is rather broad. In order to get real accurate depth, you will need to in the end supply a scale parameter on top of the regular calibration stuff you have in openCV, or else there is a single uniform ambiguity of the 3D (This last step is often called going to the "metric" reconstruction from only the "Euclidean").
Another thing which is apart of broad calibration is lens distortion compensation. Before anything else, you probably want to force your cameras to behave like pin-hole cameras (which real-world cameras usually dont).
With that said, optical flow is definetely very different from a metric depth map. If you properly calibraty and rectify your system first, then optical flow is still not equivalent to disparity estimation. If your system is rectified, there is no point in doing a full optical flow estimation (such as Farnebäck), because the problem is thereafter constrained along the horizontal lines of the image. Doing a full optical flow estimation (giving 2 d.o.f) will introduce more error after said rectification likely.
A great reference for all this stuff is the classic "Multiple View Geometry in Computer Vision"
I am working on a project to detect the 3D location of the object. I have two cameras set up at two corners of the room and I have obtained the Fundamental matrix between them. These cameras are internally calibrated. My images are 2592 X 1944
K = [1228 0 3267
0 1221 538
0 0 1 ]
F = [-1.098e-7 3.50715e-7 -0.000313
2.312e-7 2.72256e-7 4.629e-5
0.000234 -0.00129250 1 ]
Now, How do I proceed so that given a 3D point in space, I should be able to get points on the image which correspond to the same object in the room. If I can obtain the right projection matrices (with correct scale) I can use them later as inputs to OpenCV's traingulatePoints function to obtain the location of the object.
I have been stuck at this since a long time. So, please help me.
Thanks.
From what I gather, you have obtained the Fundamental matrix through some means of calibration? Either way, with the fundamental matrix (or the calibration rig itself) you can obtain the pose difference via decomposition of the Essential matrix. Once you have that, you can use matched feature points (using a feature extractor and descriptor like SURF, BRISK, ...) to identify which points in one image belong to the same object point as another feature point in the other image.
With that information, you should be able to triangulate away.
Sorry its not coming in size of comment..
so #user2167617 reply to your comment.
Pretty much. A few pointers, though: the singular values should be (s,s,0), so (1.3, 1.05, 0) is a pretty good guess. About the R: Technically, this is right, however, ignoring signs. It might very well be that you get a rotation matrix which does not satisfy the constraint deteminant(R) = 1 but is instead -1. You might want to multiply it with -1 in that case. Generally, if you run into problems with this approach, try to determine the Essential Matrix using the 5 point algorithm (implemented into the very newest version of OpenCV, you will have to build it yourself). The scale is indeed impossible to obtain with these informations. However, it's all to scale. If you define for example the distance between the cameras being 1 unit, then everything will be measured in that unit.
May be it will be simplier use cv::reprojectImageTo3D function? It will give you 3D coordinates.
I tried determining camera motion from fundamental matrix using opencv. I'm currently using optical flow to track movement of points in every other frame. Essential matrix is being derived from fundamental matrix and camera matrix. My algorithm is as follows
1 . Use goodfeaturestotrack function to detect feature points from frame.
2 . Track the points to next two or three frames(Lk optical flow), during which calculate translation and rotation vectorsusing corresponding points
3 . Refresh points after two or three frame (use goodfeaturestotrack). Once again find translation and rotation vectors.
I understand that i cannot add the translation vectors to find the total movement from the beginning as the axis keep changing when I refresh points and start fresh tracking all over again. Can anyone please suggest me how to calculate the summation of movement from the origin.
You are asking is a typical visual odometry problem. concatenate the transformation matrix SE3 of the Lie-Group.
You just multiply the T_1 T_2 T_3 till you get T_1to3
You can try with this code https://github.com/avisingh599/mono-vo/blob/master/src/visodo.cpp
for(int numFrame=2; numFrame < MAX_FRAME; numFrame++)
if ((scale>0.1)&&(t.at<double>(2) > t.at<double>(0)) && (t.at<double>(2) > t.at<double>(1))) {
t_f = t_f + scale*(R_f*t);
R_f = R*R_f;
}
Its simple math concept. If you feel difficult, just look at robotics forward kinematic for easier understanding. Just the concatenation part, not the DH algo.
https://en.wikipedia.org/wiki/Forward_kinematics
write all of your relative camera position in a 4x4 transformation matrix and then multiply each matrix one after another. For example:
Frame 1 location with respect to origin coordinate system = [R1 T1]
Frame 2 location with respect to Frame 1 coordinate system = [R2 T2]
Frame 3 location with respect to Frame 2 coordinate system = [R3 T3]
Frame 3 location with respect to origin coordinate system =
[R1 T1] * [R2 T2] * [R3 T3]