Image point to point matching using intrinsics, extrinsics and third-party depth - image-processing

I want to reopen a similar question to one which somebody posted a while ago with some major difference.
The previous post is https://stackoverflow.com/questions/52536520/image-matching-using-intrinsic-and-extrinsic-camera-parameters]
and my question is can I do the matching if I do have the depth?
If it is possible can some describe a set of formulas which I have to solve to get the desirable matching ?
Here there is also some correspondence on slide 16/43:
Depth from Stereo Lecture
In what units all the variables here, can some one clarify please ?
Will this formula help me to calculate the desirable point to point correspondence ?
I know the Z (mm, cm, m, whatever unit it is) and the x_l (I guess this is y coordinate of the pixel, so both x_l and x_r are on the same horizontal line, correct if I'm wrong), I'm not sure if T is in mm (or cm, m, i.e distance unit) and f is in pixels/mm (distance unit) or is it something else ?
Thank you in advance.
EDIT:
So as it was said by #fana, the solution is indeed a projection.
For my understanding it is P(v) = K (Rv+t), where R is 3 x 3 rotation matrix (calculated for example from calibration), t is the 3 x 1 translation vector and K is the 3 x 3 intrinsics matrix.
from the following video:
It can be seen that there is translation only in one dimension (because the situation is where the images are parallel so the translation takes place only on X-axis) but in other situation, as much as I understand if the cameras are not on the same parallel line, there is also translation on Y-axis. What is the translation on the Z-axis which I get through the calibration, is it some rescale factor due to different image resolutions for example ? Did I wrote the projection formula correctly in the general case?
I also want to ask about the whole idea.
Suppose I have 3 cameras, one with large FOV which gives me color and depth for each pixel, lets call it the first (3d tensor, color stacked with depth correspondingly), and two with which I want to do stereo, lets call them second and third.
Instead of calibrating the two cameras, my idea is to use the depth from the first camera to calculate the xyz of pixel u,v of its correspondent color frame, that can be done easily and now to project it on the second and the third image using the R,t found by calibration between the first camera and the second and the third, and using the K intrinsics matrices so the projection matrix seems to be full known, am I right ?
Assume for the case that FOV of color is big enough to include all that can be seen from the second and the third cameras.
That way, by projection each x,y,z of the first camera I can know where is the corresponding pixels on the two other cameras, is that correct ?

Related

How to check whether Essential Matrix is correct or not without decomposing it?

On a very high level, my pose estimation pipeline looks somewhat like this:
Find features in image_1 and image_2 (let's say cv::ORB)
Match the features (let's say using the BruteForce-Hamming descriptor matcher)
Calculate Essential Matrix (using cv::findEssentialMat)
Decompose it to get the proper rotation matrix and translation unit vector (using cv::recoverPose)
Repeat
I noticed that at some point, the yaw angle (calculated using the output rotation matrix R of cv::recoverPose) suddenly jumps by more than 150 degrees. For that particular frame, the number of inliers is 0 (the return value of cv::recoverPose). So, to understand what exactly that means and what's going on, I asked this question on SO.
As per the answer to my question:
So, if the number of inliers is 0, then something went very wrong. Either your E is wrong, or the point matches are wrong, or both. In this case you simply cannot estimate the camera motion from those two images.
For that particular image pair, based on the visualization and from my understanding, matches look good:
The next step in the pipeline is finding the Essential Matrix. So, now, how can I check whether the calculated Essential Matrix is correct or not without decomposing it i.e. without calculating Roll Pitch Yaw angles (which can be done by finding the rotation matrix via cv::recoverPose)?
Basically, I want to double-check whether my Essential Matrix is correct or not before I move on to the next component (which is cv::recoverPose) in the pipeline!
The essential matrix maps each point p in image 1 to its epipolar line in image 2. The point q in image 2 that corresponds to p must be very close to the line. You can plot the epipolar lines and the matching points to see if they make sense. Remember that E is defined in normalized image coordinates, not in pixels. To get the epipolar lines in pixel coordinates, you would need to convert E to F (the fundamental matrix).
The epipolar lines must all intersect in one point, called the epipole. The epipole is the projection of the other camera's center in your image. You can find the epipole by computing the nullspace of F.
If you know something about the camera motion, then you can determine where the epipole should be. For example, if the camera is mounted on a vehicle that is moving directly forward, and you know the pitch and roll angles of the camera relative to the ground, then you can calculate exactly where the epipole will be. If the vehicle can turn in-plane, then you can find the horizontal line on which the epipole must lie.
You can get the epipole from F, and see if it is where it is supposed to be.

Measure distance to object with a single camera in a static scene

let's say I am placing a small object on a flat floor inside a room.
First step: Take a picture of the room floor from a known, static position in the world coordinate system.
Second step: Detect the bottom edge of the object in the image and map the pixel coordinate to the object position in the world coordinate system.
Third step: By using a measuring tape measure the real distance to the object.
I could move the small object, repeat this three steps for every pixel coordinate and create a lookup table (key: pixel coordinate; value: distance). This procedure is accurate enough for my use case. I know that it is problematic if there are multiple objects (an object could cover an other object).
My question: Is there an easier way to create this lookup table? Accidentally changing the camera angle by a few degrees destroys the hard work. ;)
Maybe it is possible to execute the three steps for a few specific pixel coordinates or positions in the world coordinate system and perform some "calibration" to calculate the distances with the computed parameters?
If the floor is flat, its equation is that of a plane, let
a.x + b.y + c.z = 1
in the camera coordinates (the origin is the optical center of the camera, XY forms the focal plane and Z the viewing direction).
Then a ray from the camera center to a point on the image at pixel coordinates (u, v) is given by
(u, v, f).t
where f is the focal length.
The ray hits the plane when
(a.u + b.v + c.f) t = 1,
i.e. at the point
(u, v, f) / (a.u + b.v + c.f)
Finally, the distance from the camera to the point is
p = √(u² + v² + f²) / (a.u + b.v + c.f)
This is the function that you need to tabulate. Assuming that f is known, you can determine the unknown coefficients a, b, c by taking three non-aligned points, measuring the image coordinates (u, v) and the distances, and solving a 3x3 system of linear equations.
From the last equation, you can then estimate the distance for any point of the image.
The focal distance can be measured (in pixels) by looking at a target of known size, at a known distance. By proportionality, the ratio of the distance over the size is f over the length in the image.
Most vision libraries (including opencv) have built in functions that will take a couple points from a camera reference frame and the related points from a Cartesian plane and generate your warp matrix (affine transformation) for you. (some are fancy enough to include non-linearity mappings with enough input points, but that brings you back to your time to calibrate issue)
A final note: most vision libraries use some type of grid to calibrate off of ie a checkerboard patter. If you wrote your calibration to work off of such a sheet, then you would only need to measure distances to 1 target object as the transformations would be calculated by the sheet and the target would just provide the world offsets.
I believe what you are after is called a Projective Transformation. The link below should guide you through exactly what you need.
Demonstration of calculating a projective transformation with proper math typesetting on the Math SE.
Although you can solve this by hand and write that into your code... I strongly recommend using a matrix math library or even writing your own matrix math functions prior to resorting to hand calculating the equations as you will have to solve them symbolically to turn it into code and that will be very expansive and prone to miscalculation.
Here are just a few tips that may help you with clarification (applying it to your problem):
-Your A matrix (source) is built from the 4 xy points in your camera image (pixel locations).
-Your B matrix (destination) is built from your measurements in in the real world.
-For fast recalibration, I suggest marking points on the ground to be able to quickly place the cube at the 4 locations (and subsequently get the altered pixel locations in the camera) without having to remeasure.
-You will only have to do steps 1-5 (once) during calibration, after that whenever you want to know the position of something just get the coordinates in your image and run them through step 6 and step 7.
-You will want your calibration points to be as far away from eachother as possible (within reason, as at extreme distances in a vanishing point situation, you start rapidly losing pixel density and therefore source image accuracy). Make sure that no 3 points are colinear (simply put, make your 4 points approximately square at almost the full span of your camera fov in the real world)
ps I apologize for not writing this out here, but they have fancy math editing and it looks way cleaner!
Final steps to applying this method to this situation:
In order to perform this calibration, you will have to set a global home position (likely easiest to do this arbitrarily on the floor and measure your camera position relative to that point). From this position, you will need to measure your object's distance from this position in both x and y coordinates on the floor. Although a more tightly packed calibration set will give you more error, the easiest solution for this may simply be to have a dimension-ed sheet(I am thinking piece of printer paper or a large board or something). The reason that this will be easier is that it will have built in axes (ie the two sides will be orthogonal and you will just use the four corners of the object and used canned distances in your calibration). EX: for a piece of paper your points would be (0,0), (0,8.5), (11,8.5), (11,0)
So using those points and the pixels you get will create your transform matrix, but that still just gives you a global x,y position on axes that may be hard to measure on (they may be skew depending on how you measured/ calibrated). So you will need to calculate your camera offset:
object in real world coords (from steps above): x1, y1
camera coords (Xc, Yc)
dist = sqrt( pow(x1-Xc,2) + pow(y1-Yc,2) )
If it is too cumbersome to try to measure the position of the camera from global origin by hand, you can instead measure the distance to 2 different points and feed those values into the above equation to calculate your camera offset, which you will then store and use anytime you want to get final distance.
As already mentioned in the previous answers you'll need a projective transformation or simply a homography. However, I'll consider it from a more practical view and will try to summarize it short and simple.
So, given the proper homography you can warp your picture of a plane such that it looks like you took it from above (like here). Even simpler you can transform a pixel coordinate of your image to world coordinates of the plane (the same is done during the warping for each pixel).
A homography is basically a 3x3 matrix and you transform a coordinate by multiplying it with the matrix. You may now think, wait 3x3 matrix and 2D coordinates: You'll need to use homogeneous coordinates.
However, most frameworks and libraries will do this handling for you. What you need to do is finding (at least) four points (x/y-coordinates) on your world plane/floor (preferably the corners of a rectangle, aligned with your desired world coordinate system), take a picture of them, measure the pixel coordinates and pass both to the "find-homography-function" of your desired computer vision or math library.
In OpenCV that would be findHomography, here an example (the method perspectiveTransform then performs the actual transformation).
In Matlab you can use something from here. Make sure you are using a projective transformation as transform type. The result is a projective tform, which can be used in combination with this method, in order to transform your points from one coordinate system to another.
In order to transform into the other direction you just have to invert your homography and use the result instead.

Nature of relationship between optic flow and depth

Assuming the static scene, with a single camera moving exactly sideways at small distance, there are two frames and a following computed optic flow (I use opencv's calcOpticalFlowFarneback):
Here scatter points are detected features, which are painted in pseudocolor with depth values (red is little depth, close to the camera, blue is more distant). Now, I obtain those depth values by simply inverting optic flow magnitude, like d = 1 / flow. Seems kinda intuitive, in a motion-parallax-way - the brighter the object, the closer it is to the observer. So there's a cube, exposing a frontal edge and a bit of a side edge to the camera.
But then I'm trying to project those feature points from camera plane to the real-life coordinates to make a kind of top view map (where X = (x * d) / f and Y = d (where d is depth, x is pixel coordinate, f is focal length, and X and Y are real-life coordinates). And here's what I get:
Well, doesn't look cubic to me. Looks like the picture is skewed to the right. I've spent some time thinking about why, and it seems that 1 / flow is not an accurate depth metric. Playing with different values, say, if I use 1 / power(flow, 1 / 3), I get a better picture:
But, of course, power of 1 / 3 is just a magic number out of my head. The question is, what is the relationship between optic flow in depth in general, and how do I suppose to estimate it for a given scene? We're just considering camera translation here. I've stumbled upon some papers, but no luck trying to find a general equation yet. Some, like that one, propose a variation of 1 / flow, which isn't going to work, I guess.
Update
What bothers me a little is that simple geometry points me to 1 / flow answer too. Like, optic flow is the same (in my case) as disparity, right? Then using this formula I get d = Bf / (x2 - x1), where B is distance between two camera positions, f is focal length, x2-x1 is precisely the optic flow. Focal length is a constant, and B is constant for any two given frames, so that leaves me with 1 / flow again multiplied by a constant. Do I misunderstand something about what optic flow is?
for a static scene, moving a camera precisely sideways a known amount, is exactly the same as a stereo camera setup. From this, you can indeed estimate depth, if your system is calibrated.
Note that calibration in this sense is rather broad. In order to get real accurate depth, you will need to in the end supply a scale parameter on top of the regular calibration stuff you have in openCV, or else there is a single uniform ambiguity of the 3D (This last step is often called going to the "metric" reconstruction from only the "Euclidean").
Another thing which is apart of broad calibration is lens distortion compensation. Before anything else, you probably want to force your cameras to behave like pin-hole cameras (which real-world cameras usually dont).
With that said, optical flow is definetely very different from a metric depth map. If you properly calibraty and rectify your system first, then optical flow is still not equivalent to disparity estimation. If your system is rectified, there is no point in doing a full optical flow estimation (such as Farnebäck), because the problem is thereafter constrained along the horizontal lines of the image. Doing a full optical flow estimation (giving 2 d.o.f) will introduce more error after said rectification likely.
A great reference for all this stuff is the classic "Multiple View Geometry in Computer Vision"

Should stereo projection be internally consistent?

I'm working on a problem where I have a calibrated stereo pair and am identifying stereo matches. I then project those matches using perspectiveTransform to give me (x, y, z) coordinates.
Later, I'm taking those coordinates and reprojecting them into my original, unrectified image using projectPoints with takes my left camera's M and D parameters. I was surprised to find that, despite all of this happening within the same calibration, the points do not project on the correct part of the image (they have about a 5 pixel offset, depending where they are in the image). This offset seems to change with different calibrations.
My question is: should I expect this, or am I likely doing something wrong? It seems like the calibration ought to be internally consistent.
Here's a screenshot of just a single point being remapped (drawn with the two lines):
(ignore the little boxes, those are something else)
I was doing something slightly wrong. When reprojecting from 3D to 2D, I missed that stereoRectify returns R1, the output rectification rotation matrix. When calling projectPoints, I needed to pass the inverse of that matrix as the second parameter (rvec).

Calibrate camera with opencv, how does it work and how do i have to move my chessboard

I'm using openCV the calibrateCamera function to calibrate my camera. I started from the tutorial implementation, but there seems something wrong.
The camera is looking down on a table and i use a chessboard with an area that covers about 1/2 or 1/4 of my total image. Since I aim to track a flat object that slides over this table, I also slide my chessboard over this table.
So my first question is: is it ok that i move my chessboard over this table? Or do I have to make some 3D movements in order to get some good result?
Because I was wondering: how does the function guesses the distance between the table and the camera? He has only a guess of his focal point, and he has only one "eye", so there is no depth vision.
My second question: how does the bloody thing work? :p Can anyone show me some implementation of this function?
Thx!
the camera calibration needs a seed of points to calculate the camera matrix and the the position of the central point of the camera, and the distortion matrices , if you want to use a chessboard you have to take in consideration its dimension(I never used the circles function because the detection of chessboard is easier ) , the dimension of the chessboard should be pair X unpair number so you can get a correct rotation matrix ! the calibration function needs minimum 8x set of chessboardCorners and ( I use 30 tell 50) it depends on how precise you want to be .the return value of the calibration function is the re-projection error this should be near to zero if the calibration is good.
the cameraCalibration take a the size of used chessboards ( you can use different chessboardSize) and the dimension ( in mm or cm or even m etc.. ) your result will depend on your given dimension.
By the way after getting the chessboardCorners you have to refines them with the function CornerSubPix you can set how good the refinement is in the function parameter.
In the internet you can find a lot docs about this subject.
http://www.ics.uci.edu/~majumder/vispercep/cameracalib.pdf
I hope it helps !
regarding the chessboard positions, I got best results with 25-30 images
First I do 3 -4 images that show the chessboard at different distances full frame half 1/3 1/4
then I make sure to go to each corner, each center of each edge plus 4 rotation on each axis XYZ. When using a 640x480 sensor my reprojection error was mostly around 0.1 or even better
here a few links that got me in the right direction:
How to verify the correctness of calibration of a webcam?

Resources