I'm filming with 6 RGB cameras a scene that I want to reconstruct in 3D, kind of like in the following picture. And I forgot to take a calibration chessboard. So I used a blank rectangle board instead and filmed it, as I would film a regular chessboard.
First step, calibration --> OK.
I obviously couldn't use cv2.findChessboardCorners, so I made a small program that would allow me to click and store the location of each 4 corners. I calibrated from these 4 points for about 10-15 frames as a test.
Tl;Dr: It seemed to work great.
Next step, triangulation. --> NOT OK
I use direct linear transform (DLT) to triangulate my points from all 6 cameras.
Tl;Dr: It's not working so well.
Image and world coordinates are connected this way: ,
which can be written .
A singular value decomposition (SVD) gives
3 of the 4 points are correctly triangulated, but the blue one that should lie on the origin has a wrong x coordinate.
WHY?
Why only one point, and why only the x coordinate?
Does it have anything to do with the fact that I calibrate from a 4 points board?
If so, can you explain; and if not, what else could it be?
Update: I tried for an other frame while the board is somewhere else, and the triangulation is fine.
So there is the mystery: some points are randomly triangulated wrong (or at least the one at the origin), while most of the others are fine. Again, why?
My guess is that it comes from the triangulation rather than from the calibration, and that there is no connexion with my sloppy calibration process.
One common issue I came across is the ambiguity in the solutions found by DLT. Indeed, solving AQ = 0 or solving AC C-¹Q gives the same solution. See page 46 here. But I don't know what to do about it.
I'm now fairly sure this is not a calibration issue but I don't want to delete this part of my post.
I used ret, K, D, R, T = cv2.calibrateCamera(objpoints, imgpoints, imSize, None, None). It worked seamlessly, and the points where
perfectly reprojected on my original image with
cv2.projectPoints(objpoints, R, T, K, D).
I computed my projection matrix P as , and R, _ = cv2.Rodrigues(R)
How is it that I get a solution while I have only 4 points per image?
Wouldn't I need 6 of them at least? We have .We
can solve P by SVD under the form This is 2
equations per point, for 11 independent unknown P parameters. So 4
points make 8 equations, which shouldn't be enough. And yet
cv2.calibrateCamera still gives a solution. It must be using
another method? I came across Perspective-n-Point (PnP), is it what
opencv uses? In which case, is it directly optimizing K, R, and T and
thus needs less points?I could artificially add a few points
to get more than the 4 corner points of my board (for example, the
centers of the edges, or the center of the rectangle). But is it
really the issue?
When calibrating, one needs to decompose the projection matrix into
intrinsic and extrinsic matrices. But this decomposition is not
unique and has 4 solutions. See there section 'I'm seeing
double' and Chapt.21 of Hartley&Zisserman about Cheirality
for more information. It is not my issue since my camera points
are correctly reprojected to the image plane and my cameras are
correctly set up on my 3D scene.
I did not quite understand what you are asking, it is rather vague. However, I think you are miscalculating your projection matrix.
if I'm not mistaken, you will surely define 4 3D points representing your rectangle in real world space in this way for example:
pt_3D = [[ 0 0 0]
[ 0 1 0]
[ 1 1 0]
[ 1 0 0]]
you will then retrieve the corresponding 2D points (in order) of each image, and generate two vectors as follows:
objpoints = [pt_3D, pt_3D, ....] # N times
imgpoints = [pt_2D_img1, pt_3D_img2, ....] # N times ( N images )
You can then calibrate your camera and recover the camera poses as well as the projection matrices as follows:
ret, mtx, dist, rvecs, tvecs = cv2.calibrateCamera(objpoints, imgpoints, imSize, None, None)
cv2.projectPoints(objpoints, rvecs, tvecs, K, dist)
for rvec, tvec in zip(rvecs, tvecs):
Rt, _ = cv2.Rodrigues(rvec)
R = Rt.T
T = - R # tvec
pose_Matrix = np.vstack(( np.hstack((R,T)) , [0, 0, 0, 1])) ( transformation matrix == camera pose )
Projection_Matrix = K # TransformationMatrix.T[:3, :4]
You don't have to apply the DLT or the triangulation (all is done in the cv2.calibrateCamera () function, and the 3D points remain what you define yourself
Related
I've been working on a pose estimation project and one of the steps is finding the pose using the recoverPose function of OpenCV.
int cv::recoverPose(InputArray E,
InputArray points1,
InputArray points2,
InputArray cameraMatrix,
OutputArray R,
OutputArray t,
InputOutputArray mask = noArray()
)
I have all the required info: essential matrix E, key points in image 1 points1, corresponding key points in image 2 points2, and the cameraMatrix. However, the one thing that still confuses me a lot is the int value (i.e. the number of inliers) returned by the function. As per the documentation:
Recover relative camera rotation and translation from an estimated essential matrix and the corresponding points in two images, using cheirality check. Returns the number of inliers which pass the check.
However, I don't completely understand that yet. I'm concerned with this because, at some point, the yaw angle (calculated using the output rotation matrix R) suddenly jumps by more than 150 degrees. For that particular frame, the number of inliers is 0. So, as per the documentation, no points passed the cheirality check. But still, what does it mean exactly? Can that be the reason for the sudden jump in yaw angle? If yes, what are my options to avoid that? As the process is iterative, that one sudden jump affects all the further poses!
This function decomposes the Essential matrix E into R and t. However, you can get up to 4 solutions, i. e. pairs of R and t. Of these 4, only one is physically realizable, meaning that the other 3 project the 3D points behind one or both cameras.
The cheirality check is what you use to find that one physically realizable solution, and this is why you need to pass matching points into the function. It will use the matching 2D points to triangulate the corresponding 3D points using each of the 4 R and t pairs, and choose the one for which it gets the most 3D points in front of both cameras. This accounts for the possibility that some of the point matches can be wrong. The number of points that end up in front of both cameras is the number of inliers that the functions returns.
So, if the number of inliers is 0, then something went very wrong. Either your E is wrong, or the point matches are wrong, or both. In this case you simply cannot estimate the camera motion from those two images.
There are several things you can check.
After you call findEssentialMat you get the inliers from the RANSAC used to find E. Make sure that you are passing only those inlier points into recoverPose. You don't want to pass in all the points that you passed into findEssentialMat.
Before you pass E into recoverPose check if it is of rank 2. If it is not, then you can enforce the rank 2 constraint on E. You can take the SVD of E, set the smallest eigenvalue to 0, and then reconstitute E.
After you get R and t from recoverPose, you can check that R is indeed a rotation matrix with the determinate equal to 1. If the determinant is equal to -1, then R is a reflection, and things have gone wrong.
I want to reopen a similar question to one which somebody posted a while ago with some major difference.
The previous post is https://stackoverflow.com/questions/52536520/image-matching-using-intrinsic-and-extrinsic-camera-parameters]
and my question is can I do the matching if I do have the depth?
If it is possible can some describe a set of formulas which I have to solve to get the desirable matching ?
Here there is also some correspondence on slide 16/43:
Depth from Stereo Lecture
In what units all the variables here, can some one clarify please ?
Will this formula help me to calculate the desirable point to point correspondence ?
I know the Z (mm, cm, m, whatever unit it is) and the x_l (I guess this is y coordinate of the pixel, so both x_l and x_r are on the same horizontal line, correct if I'm wrong), I'm not sure if T is in mm (or cm, m, i.e distance unit) and f is in pixels/mm (distance unit) or is it something else ?
Thank you in advance.
EDIT:
So as it was said by #fana, the solution is indeed a projection.
For my understanding it is P(v) = K (Rv+t), where R is 3 x 3 rotation matrix (calculated for example from calibration), t is the 3 x 1 translation vector and K is the 3 x 3 intrinsics matrix.
from the following video:
It can be seen that there is translation only in one dimension (because the situation is where the images are parallel so the translation takes place only on X-axis) but in other situation, as much as I understand if the cameras are not on the same parallel line, there is also translation on Y-axis. What is the translation on the Z-axis which I get through the calibration, is it some rescale factor due to different image resolutions for example ? Did I wrote the projection formula correctly in the general case?
I also want to ask about the whole idea.
Suppose I have 3 cameras, one with large FOV which gives me color and depth for each pixel, lets call it the first (3d tensor, color stacked with depth correspondingly), and two with which I want to do stereo, lets call them second and third.
Instead of calibrating the two cameras, my idea is to use the depth from the first camera to calculate the xyz of pixel u,v of its correspondent color frame, that can be done easily and now to project it on the second and the third image using the R,t found by calibration between the first camera and the second and the third, and using the K intrinsics matrices so the projection matrix seems to be full known, am I right ?
Assume for the case that FOV of color is big enough to include all that can be seen from the second and the third cameras.
That way, by projection each x,y,z of the first camera I can know where is the corresponding pixels on the two other cameras, is that correct ?
I am currently using OpenCV for a pose estimation related work, in which I am triangulating points between pairs for reconstruction and scale factor estimation. I have encountered a strange issue while working on this, especially in the opencv functions recoverPose() and triangulatePoints().
Say I have camera 1 and camera 2, spaced apart in X, with cam1 at (0,0,0) and cam2 to the right of it (positive X). I have two arrays points1 and points2 that are the matched features between the two images. According to the OpenCV documentation and code, I have noted two points:
recoverPose() assumes that points1 belong to the camera at (0,0,0).
triangulatePoints() is called twice: one from recoverPose() to tell us which of the four R/t combinations is valid, and then again from my code, and the documentation says:
cv::triangulatePoints(P1, P2, points1, points2, points3D) : points1 -> P1 and points2 -> P2.
Hence, as in the case of recoverPose(), it is safe to assume that P1 is [I|0] and P2 is [R|t].
What I actually found: It doesn't work that way. Although my camera1 is at 0,0,0 and camera2 is at 1,0,0 (1 being up to scale), the only correct configuration is obtained if I run
recoverPose(E, points2, points1...)
triangulatePoints([I|0], [R|t], points2, points1, pts3D)
which should be incorrect, because points2 is the set from R|t, not points1. I tested an image pair of a scene in my room where there are three noticeable objects after triangulation: a monitor and two posters on the wall behind it. Here are the point clouds resulting from the triangulation (excuse the MS Paint)
If I do it the OpenCV's prescribed way: (poster points dispersed in space, weird looking result)
If I do it my (wrong?) way:
Can anybody share their views about what's going on here? Technically, both solutions are valid because all points fall in front of both cameras: and I didn't know what to pick until I rendered it as a pointcloud. Am I doing something wrong, or is it an error in the documentation? I am not that knowledgeable about the theory of computer vision so it might be possible I am missing something fundamental here. Thanks for your time!
I ran into a similar issue. I believe OpenCV defines the translation vector in the opposite way that one would expect. With your camera configuration the translation vector would be [-1, 0, 0]. It's counter intuitive, but both RecoverPose and stereoCalibrate give this translation vector.
I found that when I used the incorrect but intuitive translation vector (eg. [1, 0, 0]) I couldn't get correct results unless I swapped points1 and points2, just as you did.
I suspect the translation vector actually translate the points into the other camera coordinate system, not the vector to translate the camera poses. The OpenCV Documentation, seems to imply that this is the case:
The joint rotation-translation matrix [R|t] is called a matrix of extrinsic parameters. It is used to describe the camera motion around a static scene, or vice versa, rigid motion of an object in front of a still camera. That is, [R|t] translates coordinates of a point (X,Y,Z) to a coordinate system, fixed with respect to the camera.
Wikipedia has a nice description on Translation of Axis:
In the new coordinate system, the point P will appear to have been translated in the opposite direction. For example, if the xy-system is translated a distance h to the right and a distance k upward, then P will appear to have been translated a distance h to the left and a distance k downward in the x'y'-system
Reason for this observation is discussed in this thread
Is the recoverPose() function in OpenCV is left-handed?
Translation t is the vector from cam2 to cam1 in cam2 frame.
There are many posts about 3D reconstruction from stereo views of known internal calibration, some of which are excellent. I have read a lot of them, and based on what I have read I am trying to compute my own 3D scene reconstruction with the below pipeline / algorithm. I'll set out the method then ask specific questions at the bottom.
0. Calibrate your cameras:
This means retrieve the camera calibration matrices K1 and K2 for Camera 1 and Camera 2. These are 3x3 matrices encapsulating each camera's internal parameters: focal length, principal point offset / image centre. These don't change, you should only need to do this once, well, for each camera as long as you don't zoom or change the resolution you record in.
Do this offline. Do not argue.
I'm using OpenCV's CalibrateCamera() and checkerboard routines, but this functionality is also included in the Matlab Camera Calibration toolbox. The OpenCV routines seem to work nicely.
1. Fundamental Matrix F:
With your cameras now set up as a stereo rig. Determine the fundamental matrix (3x3) of that configuration using point correspondences between the two images/views.
How you obtain the correspondences is up to you and will depend a lot on the scene itself.
I am using OpenCV's findFundamentalMat() to get F, which provides a number of options method wise (8-point algorithm, RANSAC, LMEDS).
You can test the resulting matrix by plugging it into the defining equation of the Fundamental matrix: x'Fx = 0 where x' and x are the raw image point correspondences (x, y) in homogeneous coordinates (x, y, 1) and one of the three-vectors is transposed so that the multiplication makes sense. The nearer to zero for each correspondence, the better F is obeying it's relation. This is equivalent to checking how well the derived F actually maps from one image plane to another. I get an average deflection of ~2px using the 8-point algorithm.
2. Essential Matrix E:
Compute the Essential matrix directly from F and the calibration matrices.
E = K2TFK1
3. Internal Constraint upon E:
E should obey certain constraints. In particular, if decomposed by SVD into USV.t then it's singular values should be = a, a, 0. The first two diagonal elements of S should be equal, and the third zero.
I was surprised to read here that if this is not true when you test for it, you might choose to fabricate a new Essential matrix from the prior decomposition like so: E_new = U * diag(1,1,0) * V.t which is of course guaranteed to obey the constraint. You have essentially set S = (100,010,000) artificially.
4. Full Camera Projection Matrices:
There are two camera projection matrices P1 and P2. These are 3x4 and obey the x = PX relation. Also, P = K[R|t] and therefore K_inv.P = [R|t] (where the camera calibration has been removed).
The first matrix P1 (excluding the calibration matrix K) can be set to [I|0] then P2 (excluding K) is R|t
Compute the Rotation and translation between the two cameras R, t from the decomposition of E. There are two possible ways to calculate R (U*W*V.t and U*W.t*V.t) and two ways to calculate t (±third column of U), which means that there are four combinations of Rt, only one of which is valid.
Compute all four combinations, and choose the one that geometrically corresponds to the situation where a reconstructed point is in front of both cameras. I actually do this by carrying through and calculating the resulting P2 = [R|t] and triangulating the 3d position of a few correspondences in normalised coordinates to ensure that they have a positive depth (z-coord)
5. Triangulate in 3D
Finally, combine the recovered 3x4 projection matrices with their respective calibration matrices: P'1 = K1P1 and P'2 = K2P2
And triangulate the 3-space coordinates of each 2d point correspondence accordingly, for which I am using the LinearLS method from here.
QUESTIONS:
Are there any howling omissions and/or errors in this method?
My F matrix is apparently accurate (0.22% deflection in the mapping compared to typical coordinate values), but when testing E against x'Ex = 0 using normalised image correspondences the typical error in that mapping is >100% of the normalised coordinates themselves. Is testing E against xEx = 0 valid, and if so where is that jump in error coming from?
The error in my fundamental matrix estimation is significantly worse when using RANSAC than the 8pt algorithm, ±50px in the mapping between x and x'. This deeply concerns me.
'Enforcing the internal constraint' still sits very weirdly with me - how can it be valid to just manufacture a new Essential matrix from part of the decomposition of the original?
Is there a more efficient way of determining which combo of R and t to use than calculating P and triangulating some of the normalised coordinates?
My final re-projection error is hundreds of pixels in 720p images. Am I likely looking at problems in the calibration, determination of P-matrices or the triangulation?
The error in my fundamental matr1ix estimation is significantly worse
when using RANSAC than the 8pt algorithm, ±50px in the mapping between
x and x'. This deeply concerns me.
Using the 8pt algorithm does not exclude using the RANSAC principle.
When using the 8pt algorithm directly which points do you use? You have to choose 8 (good) points by yourself.
In theory you can compute a fundamental matrix from any point correspondences and you often get a degenerated fundamental matrix because the linear equations are not independend. Another point is that the 8pt algorithm uses a overdetermined system of linear equations so that one single outlier will destroy the fundamental matrix.
Have you tried to use the RANSAC result? I bet it represents one of the correct solutions for F.
My F matrix is apparently accurate (0.22% deflection in the mapping
compared to typical coordinate values), but when testing E against
x'Ex = 0 using normalised image correspondences the typical error in
that mapping is >100% of the normalised coordinates themselves. Is
testing E against xEx = 0 valid, and if so where is that jump in error
coming from?
Again, if F is degenerated, x'Fx = 0 can be for every point correspondence.
Another reason for you incorrect E may be the switch of the cameras (K1T * E * K2 instead of K2T * E * K1). Remember to check: x'Ex = 0
'Enforcing the internal constraint' still sits very weirdly with me -
how can it be valid to just manufacture a new Essential matrix from
part of the decomposition of the original?
It is explained in 'Multiple View Geometry in Computer Vision' from Hartley and Zisserman. As far as I know it has to do with the minimization of the Frobenius norm of F.
You can Google it and there are pdf resources.
Is there a more efficient way of determining which combo of R and t to
use than calculating P and triangulating some of the normalised
coordinates?
No as far as I know.
My final re-projection error is hundreds of pixels in 720p images. Am
I likely looking at problems in the calibration, determination of
P-matrices or the triangulation?
Your rigid body transformation P2 is incorrect because E is incorrect.
I am trying to rectify two sequences of images for stereo matching. The usual approach of using stereoCalibrate() with a checkerboard pattern is not of use to me, since I am only working with the footage.
What I have is the correct calibration data of the individual cameras (camera matrix and distortion parameters) as well as measurements of their distance and angle between each other.
How can I construct the rotation matrix and translation vector needed for stereoRectify()?
The naive approach of using
Mat T = (Mat_<double>(3,1) << distance, 0, 0);
Mat R = (Mat_<double>(3,3) << cos(angle), 0, sin(angle), 0, 1, 0, -sin(angle), 0, cos(angle));
resulted in a heavily warped image. Do these matrices need to relate to a different origin point I am not aware of? Or do I need to convert the distance/angle value somehow to be dependent of pixelsize?
Any help would be appreciated.
It's not clear whether you have enough information about the camera poses to perform an accurate rectification.
Both T and R are measured in 3D, but in your case:
T is one-dimensional (along x-axis only), which means that you are confident that the two cameras are perfectly aligned along the other axes (in particular, you have less-than-1 pixel error on the y axis, ie a few microns by today's standards);
R leaves the Y-coordinates untouched. Thus, all you have is rotation around this axis, does it match your experimental setup ?
Finally, you need to check the consistency of the units that you are using for the translation and rotation to match with the units from the intrinsic data.
If it is feasible, you can check your results by finding some matching points between the two cameras and proceeding to a projective calibration: the accurate knowledge of the 3D position of the calibration points is required for metric reconstruction only. Other tasks rely on the essential or fundamental matrices, that can be computed from image-to-image point correspondences.
If intrinsics and extrinsics known, I recommend this method: http://link.springer.com/article/10.1007/s001380050120#page-1
It is easy to implement. Basically you rotate the right camera till both cameras have the same orientation, means both share a common R. Epipols are then transformed to the infinity and you have epipolar lines parallel to the image x-axis.
First row of the new R (x) is simply the baseline, e.g. the subtraction of both camera centers. Second row (y) the cross product of the baseline with the old left z-axis. Third row (z) equals cross product of the first two rows.
At last you need to calculate a 3x3 homography described in the above link and use warpPerspective() to get a rectified version.