Currently I have a dataset of images (sequence of frames) and I have the intrinsic camera calibration matrix. Also, for each image I have the extrinsic parameters (rotation and traslation).
I would like to know if it is possible use that parameters to find the correct pixel correspondences between each pair of images.
I found the relationship traslation (t) and rotation (R) with each correspondence point between two different perspectives.
I guess that using the image above, it is only necessary to fix a "x" point (in homogeneous coordinates) and solve the equation system for "x'", but I do not know what operation is using (notations). If someone know how to do it using matlab, I hope some help.
Also, if there is another way to discover the matching using the same information I hope the help of someone.
Thanks
No, this information is not enough to find point correspondences between the frames. I will first explain what I think that you can do with the given information, and then we'll see why it's impossible to get pixel to pixel matches from the Essential alone.
What you can do. For a point m, you can find the line on the other image where m' lies, by using the Fundamental matrix. Let's assume that the X and X' you give in your question are (respectively) projected to m and m', i.e.
//K denotes the intrinsics matrix
m=KX
m'=KX'
Starting with your equation, we have:
X^{T}EX'=0 ==> m^T K^{-T} E K^{-1} m'
The matrix K^{-T} E K^{-1} , that we will note F, is known as the Fundamental matrix, and now you have a constraint between 2d points in the image plane:
m^TFm'=0
Note that m and m' are 3d vectors expressed in homogeneous coordinates. The interesting thing to notice here, is that Fm' is the line on which m lies on the first image (since the constraint given above is nothing but the dot product between m and Fm'). Similarly, m^TF is the line on the other image in which m' is expected to lie. So, what you can do to find a match for m, is to search in a neighborhood of Fm'.
Why you can't get pixel to pixel matching. Let's look at what the constraint xEx'=0 means from an intuitive point of view. Basically, what it says is that we expect x, x' and the baseline T to be coplanar. Assume that you fix x, and that you look for points that satisfy xEx'=0. Then while the x' in you figure satisfies this constraint, every point n (reprojected from y) such as the one is the figure below will also be a good candidate:
which indicates that the correct match depends on your estimation of the depth of x, which you don't have.
Related
On a very high level, my pose estimation pipeline looks somewhat like this:
Find features in image_1 and image_2 (let's say cv::ORB)
Match the features (let's say using the BruteForce-Hamming descriptor matcher)
Calculate Essential Matrix (using cv::findEssentialMat)
Decompose it to get the proper rotation matrix and translation unit vector (using cv::recoverPose)
Repeat
I noticed that at some point, the yaw angle (calculated using the output rotation matrix R of cv::recoverPose) suddenly jumps by more than 150 degrees. For that particular frame, the number of inliers is 0 (the return value of cv::recoverPose). So, to understand what exactly that means and what's going on, I asked this question on SO.
As per the answer to my question:
So, if the number of inliers is 0, then something went very wrong. Either your E is wrong, or the point matches are wrong, or both. In this case you simply cannot estimate the camera motion from those two images.
For that particular image pair, based on the visualization and from my understanding, matches look good:
The next step in the pipeline is finding the Essential Matrix. So, now, how can I check whether the calculated Essential Matrix is correct or not without decomposing it i.e. without calculating Roll Pitch Yaw angles (which can be done by finding the rotation matrix via cv::recoverPose)?
Basically, I want to double-check whether my Essential Matrix is correct or not before I move on to the next component (which is cv::recoverPose) in the pipeline!
The essential matrix maps each point p in image 1 to its epipolar line in image 2. The point q in image 2 that corresponds to p must be very close to the line. You can plot the epipolar lines and the matching points to see if they make sense. Remember that E is defined in normalized image coordinates, not in pixels. To get the epipolar lines in pixel coordinates, you would need to convert E to F (the fundamental matrix).
The epipolar lines must all intersect in one point, called the epipole. The epipole is the projection of the other camera's center in your image. You can find the epipole by computing the nullspace of F.
If you know something about the camera motion, then you can determine where the epipole should be. For example, if the camera is mounted on a vehicle that is moving directly forward, and you know the pitch and roll angles of the camera relative to the ground, then you can calculate exactly where the epipole will be. If the vehicle can turn in-plane, then you can find the horizontal line on which the epipole must lie.
You can get the epipole from F, and see if it is where it is supposed to be.
I've been working on a pose estimation project and one of the steps is finding the pose using the recoverPose function of OpenCV.
int cv::recoverPose(InputArray E,
InputArray points1,
InputArray points2,
InputArray cameraMatrix,
OutputArray R,
OutputArray t,
InputOutputArray mask = noArray()
)
I have all the required info: essential matrix E, key points in image 1 points1, corresponding key points in image 2 points2, and the cameraMatrix. However, the one thing that still confuses me a lot is the int value (i.e. the number of inliers) returned by the function. As per the documentation:
Recover relative camera rotation and translation from an estimated essential matrix and the corresponding points in two images, using cheirality check. Returns the number of inliers which pass the check.
However, I don't completely understand that yet. I'm concerned with this because, at some point, the yaw angle (calculated using the output rotation matrix R) suddenly jumps by more than 150 degrees. For that particular frame, the number of inliers is 0. So, as per the documentation, no points passed the cheirality check. But still, what does it mean exactly? Can that be the reason for the sudden jump in yaw angle? If yes, what are my options to avoid that? As the process is iterative, that one sudden jump affects all the further poses!
This function decomposes the Essential matrix E into R and t. However, you can get up to 4 solutions, i. e. pairs of R and t. Of these 4, only one is physically realizable, meaning that the other 3 project the 3D points behind one or both cameras.
The cheirality check is what you use to find that one physically realizable solution, and this is why you need to pass matching points into the function. It will use the matching 2D points to triangulate the corresponding 3D points using each of the 4 R and t pairs, and choose the one for which it gets the most 3D points in front of both cameras. This accounts for the possibility that some of the point matches can be wrong. The number of points that end up in front of both cameras is the number of inliers that the functions returns.
So, if the number of inliers is 0, then something went very wrong. Either your E is wrong, or the point matches are wrong, or both. In this case you simply cannot estimate the camera motion from those two images.
There are several things you can check.
After you call findEssentialMat you get the inliers from the RANSAC used to find E. Make sure that you are passing only those inlier points into recoverPose. You don't want to pass in all the points that you passed into findEssentialMat.
Before you pass E into recoverPose check if it is of rank 2. If it is not, then you can enforce the rank 2 constraint on E. You can take the SVD of E, set the smallest eigenvalue to 0, and then reconstitute E.
After you get R and t from recoverPose, you can check that R is indeed a rotation matrix with the determinate equal to 1. If the determinant is equal to -1, then R is a reflection, and things have gone wrong.
I want to reopen a similar question to one which somebody posted a while ago with some major difference.
The previous post is https://stackoverflow.com/questions/52536520/image-matching-using-intrinsic-and-extrinsic-camera-parameters]
and my question is can I do the matching if I do have the depth?
If it is possible can some describe a set of formulas which I have to solve to get the desirable matching ?
Here there is also some correspondence on slide 16/43:
Depth from Stereo Lecture
In what units all the variables here, can some one clarify please ?
Will this formula help me to calculate the desirable point to point correspondence ?
I know the Z (mm, cm, m, whatever unit it is) and the x_l (I guess this is y coordinate of the pixel, so both x_l and x_r are on the same horizontal line, correct if I'm wrong), I'm not sure if T is in mm (or cm, m, i.e distance unit) and f is in pixels/mm (distance unit) or is it something else ?
Thank you in advance.
EDIT:
So as it was said by #fana, the solution is indeed a projection.
For my understanding it is P(v) = K (Rv+t), where R is 3 x 3 rotation matrix (calculated for example from calibration), t is the 3 x 1 translation vector and K is the 3 x 3 intrinsics matrix.
from the following video:
It can be seen that there is translation only in one dimension (because the situation is where the images are parallel so the translation takes place only on X-axis) but in other situation, as much as I understand if the cameras are not on the same parallel line, there is also translation on Y-axis. What is the translation on the Z-axis which I get through the calibration, is it some rescale factor due to different image resolutions for example ? Did I wrote the projection formula correctly in the general case?
I also want to ask about the whole idea.
Suppose I have 3 cameras, one with large FOV which gives me color and depth for each pixel, lets call it the first (3d tensor, color stacked with depth correspondingly), and two with which I want to do stereo, lets call them second and third.
Instead of calibrating the two cameras, my idea is to use the depth from the first camera to calculate the xyz of pixel u,v of its correspondent color frame, that can be done easily and now to project it on the second and the third image using the R,t found by calibration between the first camera and the second and the third, and using the K intrinsics matrices so the projection matrix seems to be full known, am I right ?
Assume for the case that FOV of color is big enough to include all that can be seen from the second and the third cameras.
That way, by projection each x,y,z of the first camera I can know where is the corresponding pixels on the two other cameras, is that correct ?
I'm working on a problem where I have a calibrated stereo pair and am identifying stereo matches. I then project those matches using perspectiveTransform to give me (x, y, z) coordinates.
Later, I'm taking those coordinates and reprojecting them into my original, unrectified image using projectPoints with takes my left camera's M and D parameters. I was surprised to find that, despite all of this happening within the same calibration, the points do not project on the correct part of the image (they have about a 5 pixel offset, depending where they are in the image). This offset seems to change with different calibrations.
My question is: should I expect this, or am I likely doing something wrong? It seems like the calibration ought to be internally consistent.
Here's a screenshot of just a single point being remapped (drawn with the two lines):
(ignore the little boxes, those are something else)
I was doing something slightly wrong. When reprojecting from 3D to 2D, I missed that stereoRectify returns R1, the output rectification rotation matrix. When calling projectPoints, I needed to pass the inverse of that matrix as the second parameter (rvec).
Suppose you have an homography H relating a planar surface depicted in two different images, Ir and I. Ir is the reference image, where the planar surface is parallel to the image plane (and practically occupy the entire image). I is a run-time image (a photo of the planar surface taken at an arbitrary viewpoint). Let H be such that:
p = Hp', where p is a point in Ir and p' is the corresponding point in I.
Suppose you have two points p1=(x1,y) and p2=(x2,y), with x1 < x2, relative to the image Ir. Note that they belong to the same row (common y). Let H'=H^(-1). Using H', you can compute the corresponding points in I of the following points: (x1,y),(x1+1,y),...,(x2,y).
The question is: is there a way to avoid the matrix-vector multiplication to compute all those points? The easiest way that comes to me is to use the homography to compute the corresponding point of p1 and p2 (call them p1' and p2'). To obtain the others (that is: (x1+1,y), (x1+2,y),...,(x2-1, y)), linear interpolate p1' and p2' in the image I.
But since there is a projective transformation between Ir and I, i think that this method is quite imprecise.
Any other idea? This question is relative to the fact that i need a computational efficient way to extract a lot of (small) patches (of around 10x10 pixels) around a point p in Ir in a real-time software.
Thank you.
Ps.
Maybe the fact that i am using smal patches would make using linear interpolation a suitable approach?
You have a projective transform and, unfortunately, ratio of lengths are not invariant under this type of transformation.
My suggestion: explore the cross ratio because it is invariant under projective transformations. I think that for each 3 points you can get a "cheaper" 4th, avoiding the matrix-vector computation and using the cross ratio instead. But I have not put all the stuff in paper to verify if the cross ratio alternative is that much "computationally cheaper" than the matrix-vector multiplication.