Calculating camera motion out of corresponding 3d point sets - opencv

I am having a little problem. I wrote a program that extracts a set of three-dimensional points in each frame using a camera and depth information. The points are in the camera coordinate system, which means the origin is at the camera center, x is horizontal distance, y vertical distance and z the distance from the camera (along the optical axis). Everything is in meters. I.e. point (2,-1,5) would be two meters right, one meter below and five meters along the optical axis of the camera.
I calculate these points in each time frame and also know the correspondences, like I know which point in t-1 belongs to which 3d point in t.
My goal now is to calculate the motion of the camera in each time frame in my world coordinate system (with z pointing up representing the height). I would like to calculate relative motion but also the absolute one starting from some start position to visualize the trajectory of the camera.
This is an example data set of one frame with the current (left) and the previous 3D location (right) of the points in camera coordinates:
-0.174004 0.242901 3.672510 | -0.089167 0.246231 3.646694
-0.265066 -0.079420 3.668801 | -0.182261 -0.075341 3.634996
0.092708 0.459499 3.673029 | 0.179553 0.459284 3.636645
0.593070 0.056592 3.542869 | 0.675082 0.051625 3.509424
0.676054 0.517077 3.585216 | 0.763378 0.511976 3.555986
0.555625 -0.350790 3.496224 | 0.633524 -0.354710 3.465260
1.189281 0.953641 3.556284 | 1.274754 0.938846 3.504309
0.489797 -0.933973 3.435228 | 0.561585 -0.935864 3.404614
Since I would like to work with OpenCV if possible I found the estimateAffine3D() function in OpenCV 2.3, which takes two 3D point input vectors and calculates the affine transformation between them using RANSAC.
As output I get a 3x4 transformation matrix.
I already tried to make the calculation more accurate by setting the RANSAC parameters but a lot of times the trnasformation matrix shows a translatory movement that is quite big. As you can see in the sample data the movement is usually quite small.
So I wanted to ask if anybody has another idea on what I could try? Does OpenCV offer other solutions for this?
Also if I have the relative motion of the camera in each timeframe, how would I convert it to world coordinates? Also how would I then get the absolute position starting from a point (0,0,0) so I have the camera position (and direction) for each time frame?
Would be great if anybody could give me some advice!
Thank you!
UPDATE 1:
After #Michael Kupchick nice answer I tried to check how well the estimateAffine3D() function in OpenCV works. So I created two little test sets of 6 point-pairs that just have a translation, not a rotation and had a look at the resulting transformation matrix:
Test set 1:
1.5 2.1 6.7 | 0.5 1.1 5.7
6.7 4.5 12.4 | 5.7 3.5 11.4
3.5 3.2 1.2 | 2.5 2.2 0.2
-10.2 5.5 5.5 | -11.2 4.5 4.5
-7.2 -2.2 6.5 | -8.2 -3.2 5.5
-2.2 -7.3 19.2 | -3.2 -8.3 18.2
Transformation Matrix:
1 -1.0573e-16 -6.4096e-17 1
-1.3633e-16 1 2.59504e-16 1
3.20342e-09 1.14395e-09 1 1
Test set 2:
1.5 2.1 0 | 0.5 1.1 0
6.7 4.5 0 | 5.7 3.5 0
3.5 3.2 0 | 2.5 2.2 0
-10.2 5.5 0 | -11.2 4.5 0
-7.2 -2.2 0 | -8.2 -3.2 0
-2.2 -7.3 0 | -3.2 -8.3 0
Transformation Matrix:
1 4.4442e-17 0 1
-2.69695e-17 1 0 1
0 0 0 0
--> This gives me two transformation matrices that look right at first sight...
Assuming this is right, how would I recalculate the trajectory of this when I have this transformation matrix in each timestep?
Anybody any tips or ideas why it's that bad?

This problem is much more 3d related than image processing.
What you are trying to do is to register the knowing 3d and since for all the frames there is same 3d points->camera relation the transformations calculated from registration will be the camera motion transformations.
In order to solve this you can use PCL. It is opencv's sister project for 3d related tasks.
http://www.pointclouds.org/documentation/tutorials/template_alignment.php#template-alignment
This is a good tutorial on point cloud alignments.
Basically it goes like this:
For each pair of sequential frames 3d point correspondences are known, so you can use the SVD method implemented in
http://docs.pointclouds.org/trunk/classpcl_1_1registration_1_1_transformation_estimation_s_v_d.html
You should have at least 3 corresponding points.
You can follow the tutorial or implement your own ransac algorithm.
This will give you only some rough estimation of the transformation (can be quite good if the noise is not too big) in order to get the accurate transfomation you should apply ICP algorithm using the guess transformation calculated at the previous step.
ICP is described here:
http://www.pointclouds.org/documentation/tutorials/iterative_closest_point.php#iterative-closest-point
These two steps should give you an accurate estimation of the transformation between frames.
So you should do pairwise registration incrementally - registering first pair of frames get the transformation from first frame to the second 1->2. Register the second with third (2->3) and then append the 1->2 transformation to the 2->3 and so on. This way you will get the transformations in the global coordinate system where the first frame is the origin.

Related

Image point to point matching using intrinsics, extrinsics and third-party depth

I want to reopen a similar question to one which somebody posted a while ago with some major difference.
The previous post is https://stackoverflow.com/questions/52536520/image-matching-using-intrinsic-and-extrinsic-camera-parameters]
and my question is can I do the matching if I do have the depth?
If it is possible can some describe a set of formulas which I have to solve to get the desirable matching ?
Here there is also some correspondence on slide 16/43:
Depth from Stereo Lecture
In what units all the variables here, can some one clarify please ?
Will this formula help me to calculate the desirable point to point correspondence ?
I know the Z (mm, cm, m, whatever unit it is) and the x_l (I guess this is y coordinate of the pixel, so both x_l and x_r are on the same horizontal line, correct if I'm wrong), I'm not sure if T is in mm (or cm, m, i.e distance unit) and f is in pixels/mm (distance unit) or is it something else ?
Thank you in advance.
EDIT:
So as it was said by #fana, the solution is indeed a projection.
For my understanding it is P(v) = K (Rv+t), where R is 3 x 3 rotation matrix (calculated for example from calibration), t is the 3 x 1 translation vector and K is the 3 x 3 intrinsics matrix.
from the following video:
It can be seen that there is translation only in one dimension (because the situation is where the images are parallel so the translation takes place only on X-axis) but in other situation, as much as I understand if the cameras are not on the same parallel line, there is also translation on Y-axis. What is the translation on the Z-axis which I get through the calibration, is it some rescale factor due to different image resolutions for example ? Did I wrote the projection formula correctly in the general case?
I also want to ask about the whole idea.
Suppose I have 3 cameras, one with large FOV which gives me color and depth for each pixel, lets call it the first (3d tensor, color stacked with depth correspondingly), and two with which I want to do stereo, lets call them second and third.
Instead of calibrating the two cameras, my idea is to use the depth from the first camera to calculate the xyz of pixel u,v of its correspondent color frame, that can be done easily and now to project it on the second and the third image using the R,t found by calibration between the first camera and the second and the third, and using the K intrinsics matrices so the projection matrix seems to be full known, am I right ?
Assume for the case that FOV of color is big enough to include all that can be seen from the second and the third cameras.
That way, by projection each x,y,z of the first camera I can know where is the corresponding pixels on the two other cameras, is that correct ?

Stable Translation from essential decomposition

When implementing monocular SLAM or Structure from Motion using single camera, translation can be estimated up to unknown scale. It is proven that without any other external information, this scale can not be determined. However, my question:
How to unify this scale in all sub translations. For example, if we have 3 frame (Frame0, Frame1 & Frame2), we applied tracking as follow:
Frame0 -> Frame 1 : R01, T01 (R&T
can be extracted using F Matrix and K
matrix and Essential Matrix
decompostion)
Frame 1-> Frame 2 : R12, T12
The problem is T01 & T12 are normalized so their magnitude is 1. However, in real, T01 magnitude may be twice as T12.
How can I recover the Relative magnitude between T01 and T12?
P.S. I do not want to know what is exactly T01 or T12. I just want to know that |T01| = 2 * |T12|.
I think it is possible because Monocular SLAM or SFM algorithms are already exists and working well. So, there should be some way to do this.
Calculate R,t between frames 2 & 0 and connect a triangle between the three vertices formed by the three frames. the only possible closed triangle (up to a single scale) will be formed when the relative translations are known up to a scale.

Relative Camera Pose Estimation using OpenCV

I'm trying to estimate the relative camera pose using OpenCV. Cameras in my case are calibrated (i know the intrinsic parameters of the camera).
Given the images captured at two positions, i need to find out the relative rotation and translation between two cameras. Typical translation is about 5 to 15 meters and yaw angle rotation between cameras range between 0 - 20 degrees.
For achieving this, following steps are adopted.
a. Finding point corresponding using SIFT/SURF
b. Fundamental Matrix Identification
c. Estimation of Essential Matrix by E = K'FK and modifying E for singularity constraint
d. Decomposition Essential Matrix to get the rotation, R = UWVt or R = UW'Vt (U and Vt are obtained SVD of E)
e. Obtaining the real rotation angles from rotation matrix
Experiment 1: Real Data
For real data experiment, I captured images by mounting a camera on a tripod. Images captured at Position 1, then moved to another aligned Position and changed yaw angles in steps of 5 degrees and captured images for Position 2.
Problems/Issues:
Sign of the estimated yaw angles are not matching with ground truth yaw angles. Sometimes 5 deg is estimated as 5deg, but 10 deg as -10 deg and again 15 deg as 15 deg.
In experiment only yaw angle is changed, however estimated Roll and Pitch angles are having nonzero values close to 180/-180 degrees.
Precision is very poor in some cases the error in estimated and ground truth angles are around 2-5 degrees.
How to find out the scale factor to get the translation in real world measurement units?
The behavior is same on simulated data also.
Have anybody experienced similar problems as me? Have any clue on how to resolve them.
Any help from anybody would be highly appreciated.
(I know there are already so many posts on similar problems, going trough all of them has not saved me. Hence posting one more time.)
In chapter 9.6 of Hartley and Zisserman, they point out that, for a particular essential matrix, if one camera is held in the canonical position/orientation, there are four possible solutions for the second camera matrix: [UWV' | u3], [UWV' | -u3], [UW'V' | u3], and [UW'V' | -u3].
The difference between the first and third (and second and fourth) solutions is that the orientation is rotated by 180 degrees about the line joining the two cameras, called a "twisted pair", which sounds like what you are describing.
The book says that in order to choose the correct combination of translation and orientation from the four options, you need to test a point in the scene and make sure that the point is in front of both cameras.
For problems 1 and 2,
Look for "Euler angles" in wikipedia or any good math site like Wolfram Mathworld. You would find out the different possibilities of Euler angles. I am sure you can figure out why you are getting sign changes in your results based on literature reading.
For problem 3,
It should mostly have to do with the accuracy of our individual camera calibration.
For problem 4,
Not sure. How about, measuring a point from camera using a tape and comparing it with the translation norm to get the scale factor.
Possible reasons for bad accuracy:
1) There is a difference between getting reasonable and precise accuracy in camera calibration. See this thread.
2) The accuracy with which you are moving the tripod. How are you ensuring that there is no rotation of tripod around an axis perpendicular to surface during change in position.
I did not get your simulation concept. But, I would suggest the below test.
Take images without moving the camera or object. Now if you calculate relative camera pose, rotation should be identity matrix and translation should be null vector. Due to numerical inaccuracies and noise, you might see rotation deviation in arc minutes.

Determining camera motion with fundamental matrix opencv

I tried determining camera motion from fundamental matrix using opencv. I'm currently using optical flow to track movement of points in every other frame. Essential matrix is being derived from fundamental matrix and camera matrix. My algorithm is as follows
1 . Use goodfeaturestotrack function to detect feature points from frame.
2 . Track the points to next two or three frames(Lk optical flow), during which calculate translation and rotation vectorsusing corresponding points
3 . Refresh points after two or three frame (use goodfeaturestotrack). Once again find translation and rotation vectors.
I understand that i cannot add the translation vectors to find the total movement from the beginning as the axis keep changing when I refresh points and start fresh tracking all over again. Can anyone please suggest me how to calculate the summation of movement from the origin.
You are asking is a typical visual odometry problem. concatenate the transformation matrix SE3 of the Lie-Group.
You just multiply the T_1 T_2 T_3 till you get T_1to3
You can try with this code https://github.com/avisingh599/mono-vo/blob/master/src/visodo.cpp
for(int numFrame=2; numFrame < MAX_FRAME; numFrame++)
if ((scale>0.1)&&(t.at<double>(2) > t.at<double>(0)) && (t.at<double>(2) > t.at<double>(1))) {
t_f = t_f + scale*(R_f*t);
R_f = R*R_f;
}
Its simple math concept. If you feel difficult, just look at robotics forward kinematic for easier understanding. Just the concatenation part, not the DH algo.
https://en.wikipedia.org/wiki/Forward_kinematics
write all of your relative camera position in a 4x4 transformation matrix and then multiply each matrix one after another. For example:
Frame 1 location with respect to origin coordinate system = [R1 T1]
Frame 2 location with respect to Frame 1 coordinate system = [R2 T2]
Frame 3 location with respect to Frame 2 coordinate system = [R3 T3]
Frame 3 location with respect to origin coordinate system =
[R1 T1] * [R2 T2] * [R3 T3]

To calculate world coordinates from screen coordinates with OpenCV

I have calculated the intrinsic and extrinsic parameters of the camera with OpenCV.
Now, I want to calculate world coordinates (x,y,z) from screen coordinates (u,v).
How I do this?
N.B. as I use the kinect, I already know the z coordinate.
Any help is much appreciated. Thanks!
First to understand how you calculate it, it would help you if you read some things about the pinhole camera model and simple perspective projection. For a quick glimpse, check this. I'll try to update with more.
So, let's start by the opposite which describes how a camera works: project a 3d point in the world coordinate system to a 2d point in our image. According to the camera model:
P_screen = I * P_world
or (using homogeneous coordinates)
| x_screen | = I * | x_world |
| y_screen | | y_world |
| 1 | | z_world |
| 1 |
where
I = | f_x 0 c_x 0 |
| 0 f_y c_y 0 |
| 0 0 1 0 |
is the 3x4 intrinsics matrix, f being the focal point and c the center of projection.
If you solve the system above, you get:
x_screen = (x_world/z_world)*f_x + c_x
y_screen = (y_world/z_world)*f_y + c_y
But, you want to do the reverse, so your answer is:
x_world = (x_screen - c_x) * z_world / f_x
y_world = (y_screen - c_y) * z_world / f_y
z_world is the depth the Kinect returns to you and you know f and c from your intrinsics calibration, so for every pixel, you apply the above to get the actual world coordinates.
Edit 1 (why the above correspond to world coordinates and what are the extrinsics we get during calibration):
First, check this one, it explains the various coordinates systems very well.
Your 3d coordinate systems are: Object ---> World ---> Camera. There is a transformation that takes you from object coordinate system to world and another one that takes you from world to camera (the extrinsics you refer to). Usually you assume that:
Either the Object system corresponds with the World system,
or, the Camera system corresponds with the World system
1. While capturing an object with the Kinect
When you use the Kinect to capture an object, what is returned to you from the sensor is the distance from the camera. That means that the z coordinate is already in camera coordinates. By converting x and y using the equations above, you get the point in camera coordinates.
Now, the world coordinate system is defined by you. One common approach is to assume that the camera is located at (0,0,0) of the world coordinate system. So, in that case, the extrinsics matrix actually corresponds to the identity matrix and the camera coordinates you found, correspond to world coordinates.
Sidenote: Because the Kinect returns the z in camera coordinates, there is also no need from transformation from the object coordinate system to the world coordinate system. Let's say for example that you had a different camera that captured faces and for each point it returned the distance from the nose (which you considered to be the center of the object coordinate system). In that case, since the values returned would be in the object coordinate system, we would indeed need a rotation and translation matrix to bring them to the camera coordinate system.
2. While calibrating the camera
I suppose you are calibrating the camera using OpenCV using a calibration board with various poses. The usual way is to assume that the board is actually stable and the camera is moving instead of the opposite (the transformation is the same in both cases). That means that now the world coordinate system corresponds to the object coordinate system. This way, for every frame, we find the checkerboard corners and assign them 3d coordinates, doing something like:
std::vector<cv::Point3f> objectCorners;
for (int i=0; i<noOfCornersInHeight; i++)
{
for (int j=0; j<noOfCornersInWidth; j++)
{
objectCorners.push_back(cv::Point3f(float(i*squareSize),float(j*squareSize), 0.0f));
}
}
where noOfCornersInWidth, noOfCornersInHeight and squareSize depend on your calibration board. If for example noOfCornersInWidth = 4, noOfCornersInHeight = 3 and squareSize = 100, we get the 3d points
(0 ,0,0) (0 ,100,0) (0 ,200,0) (0 ,300,0)
(100,0,0) (100,100,0) (100,200,0) (100,300,0)
(200,0,0) (200,100,0) (200,200,0) (200,300,0)
So, here our coordinates are actually in the object coordinate system. (We have assumed arbitrarily that the upper left corner of the board is (0,0,0) and the rest corners' coordinates are according to that one). So here we indeed need the rotation and transformation matrix to take us from the object(world) to the camera system. These are the extrinsics that OpenCV returns for each frame.
To sum up in the Kinect case:
Camera and World coodinate systems are considered the same, so no need for extrinsics there.
No need for Object to World(Camera) transformation, since Kinect return value is already in Camera system.
Edit 2 (On the coordinate system used):
This is a convention and I think it depends also on which drivers you use and the kind of data you get back. Check for example that, that and that one.
Sidenote: It would help you a lot if you visualized a point cloud and played a little bit with it. You can save your points in a 3d object format (e.g. ply or obj) and then just import it into a program like Meshlab (very easy to use).
Edit 2 (On the coordinate system used):
This is a convention and I think it depends also on which drivers you use and the kind of data you get back. Check for example that, that and that one.
if you for instance use microsoft sdk: then Z is not the distance to the camera but the "planar" distance to the camera. This might change the appropriate formulas.

Resources