Is translation vector in Cross Calibration of camera and LiDAR equals to physical distance in millimeters? - opencv

I have an RGB camera and a LiDAR. I implemented a paper to cross-calibrate them in order to fuse their data.
The cross-calibration gives a rotation matrix (R) and a translation vector (t) to move points from the Lidar coordinate system to the camera coordinate system.
My question is this:
is the translation vector (t) equal to the distance between the center of the lidar and the camera when we use a ruler?
For example, if t=[23, 450, 300], does it mean in the x direction we should move 23 mm (millimeters), in y we should move 450 mm, and in the z-direction, we should move 300 mm?
I asked this question because the projection of LiDR point-cloud to an image seems pretty good in my eyes; however, the translation vector (t) has a significant difference when I use a ruler to measure the distance between the two sensors.

For example, if t=[23, 450, 300], does it mean in the x direction we should move 23 mm (millimeters), in y we should move 450 mm, and in the z-direction, we should move 300 mm?
-> yes it is as long as you have used the same unit to represent your point cloud during the optimization.
The reason why your estimation and the actual device distance are different might be because of the observability issue. Did you make enough motion or use enough data to observe the translation space well?
Small translations between the sensors might not make a notable difference if the scene points are far from the sensor. That might be why your result doesn't look too bad.
See my paper for more detail:
Spatiotemporal Camera-LiDAR Calibration: A Targetless and Structureless Approach

Related

Opencv get R and t from Essential Matrix

I'm new to opencv and computer vision. I want to find the R and t matrix between two camera pose. So I generally follows the wikipedia:
https://en.wikipedia.org/wiki/Essential_matrix#Determining_R_and_t_from_E
I find a group of the related pixel location of the same point in the two images. I get the essential matrix. Then I run the SVD. And print the 2 possible R and 2 possible t.
[What runs as expected]
If I only change the rotation alone (one of roll, pitch, yaw) or translation alone (one of x, y, z), it works perfect. For example, if I increase pitch to 15 degree, then I would get R and find the delta pitch is +14.9 degree. If I only increase x for 10 cm, then the t matrix is like [0.96, -0.2, -0.2].
[What goes wrong]
However, if I change both rotation and translation, the R and t is non-sense. For example, if I increase x for 10 cm and increase pitch to 15 degree, then the delta degree is like [-23, 8,0.5], and the t matrix is like [0.7, 0.5, 0.5].
[Question]
I'm wondering why I could not get a good result if I change the rotation and translation at the same time. And it is also confusing why the unrelated rotation or translation (roll, yaw, y, z) also changes so much.
Would anyone be willing to figure me out? Thanks.
[Solved and the reason]
OpenCV use a right-hand coordinate system. This is to say that the z-axis is projected from xy plane to the viewer direction. And our system is using a left-hand coordinate system. So as long as the changes are related to the z-axis, the result is non-sense.
This is solved due to the difference of the coordinates using.
OpenCV use a right-hand coordinate system. This is to say that the z-axis is projected from xy plane to the viewer direction. And our system is using a left-hand coordinate system. So as long as the changes are related to the z-axis, the result is non-sense.

Overhead camera's pose estimation with OpenCV SolvePnP results in a few centimeter off height

I'd like to get the pose (translation: x, y, z and rotation: Rx, Ry, Rz in World coordinate system) of the overhead camera. I got many object points and image points by moving the ChArUco calibration board with a robotic arm (like this https://www.youtube.com/watch?v=8q99dUPYCPs). Because of that, I already have exact positions of all the object points.
In order to feed many points to solvePnP, I set the first detected pattern (ChArUco board) as the first object and used it as the object coordinate system's origin. Then, I added the detected object points (from the second pattern to the last) to the first detected object points' coordinate system (the origin of the object frame is the origin of the first object).
After I got the transformation between the camera and the object's coordinate frame, I calculated the camera's pose based on that transformation.
The result looked pretty good at first, but when I measured the camera's absolute pose by using a ruler or a tape measure, I noticed that the extrinsic calibration result was around 15-20 millimeter off for z direction (the height of the camera), though almost correct for the others (x, y, Rx, Ry, Rz). The result was same even I changed the range of the object points by moving a robotic arm differently, it always ended up to have a few centimeters off for the height.
Has anyone experienced the similar problem before? I'd like to know anything I can try. What is the common mistake when the depth direction (z) is inaccurate?
I don't know how you measure the z but I believe that what you're measuring with the ruler is not z but the euclidean distance which is computed like so:
d=std::sqrt(x*x+y*y+z*z);
Let's take an example, if x=2; y=2; z=2;
then d will be d~3,5 so 3.5-2=1.5 is the difference you get between z and the ruler when you said around 15-20 millimeter off for z direction.

Measure distance to object with a single camera in a static scene

let's say I am placing a small object on a flat floor inside a room.
First step: Take a picture of the room floor from a known, static position in the world coordinate system.
Second step: Detect the bottom edge of the object in the image and map the pixel coordinate to the object position in the world coordinate system.
Third step: By using a measuring tape measure the real distance to the object.
I could move the small object, repeat this three steps for every pixel coordinate and create a lookup table (key: pixel coordinate; value: distance). This procedure is accurate enough for my use case. I know that it is problematic if there are multiple objects (an object could cover an other object).
My question: Is there an easier way to create this lookup table? Accidentally changing the camera angle by a few degrees destroys the hard work. ;)
Maybe it is possible to execute the three steps for a few specific pixel coordinates or positions in the world coordinate system and perform some "calibration" to calculate the distances with the computed parameters?
If the floor is flat, its equation is that of a plane, let
a.x + b.y + c.z = 1
in the camera coordinates (the origin is the optical center of the camera, XY forms the focal plane and Z the viewing direction).
Then a ray from the camera center to a point on the image at pixel coordinates (u, v) is given by
(u, v, f).t
where f is the focal length.
The ray hits the plane when
(a.u + b.v + c.f) t = 1,
i.e. at the point
(u, v, f) / (a.u + b.v + c.f)
Finally, the distance from the camera to the point is
p = √(u² + v² + f²) / (a.u + b.v + c.f)
This is the function that you need to tabulate. Assuming that f is known, you can determine the unknown coefficients a, b, c by taking three non-aligned points, measuring the image coordinates (u, v) and the distances, and solving a 3x3 system of linear equations.
From the last equation, you can then estimate the distance for any point of the image.
The focal distance can be measured (in pixels) by looking at a target of known size, at a known distance. By proportionality, the ratio of the distance over the size is f over the length in the image.
Most vision libraries (including opencv) have built in functions that will take a couple points from a camera reference frame and the related points from a Cartesian plane and generate your warp matrix (affine transformation) for you. (some are fancy enough to include non-linearity mappings with enough input points, but that brings you back to your time to calibrate issue)
A final note: most vision libraries use some type of grid to calibrate off of ie a checkerboard patter. If you wrote your calibration to work off of such a sheet, then you would only need to measure distances to 1 target object as the transformations would be calculated by the sheet and the target would just provide the world offsets.
I believe what you are after is called a Projective Transformation. The link below should guide you through exactly what you need.
Demonstration of calculating a projective transformation with proper math typesetting on the Math SE.
Although you can solve this by hand and write that into your code... I strongly recommend using a matrix math library or even writing your own matrix math functions prior to resorting to hand calculating the equations as you will have to solve them symbolically to turn it into code and that will be very expansive and prone to miscalculation.
Here are just a few tips that may help you with clarification (applying it to your problem):
-Your A matrix (source) is built from the 4 xy points in your camera image (pixel locations).
-Your B matrix (destination) is built from your measurements in in the real world.
-For fast recalibration, I suggest marking points on the ground to be able to quickly place the cube at the 4 locations (and subsequently get the altered pixel locations in the camera) without having to remeasure.
-You will only have to do steps 1-5 (once) during calibration, after that whenever you want to know the position of something just get the coordinates in your image and run them through step 6 and step 7.
-You will want your calibration points to be as far away from eachother as possible (within reason, as at extreme distances in a vanishing point situation, you start rapidly losing pixel density and therefore source image accuracy). Make sure that no 3 points are colinear (simply put, make your 4 points approximately square at almost the full span of your camera fov in the real world)
ps I apologize for not writing this out here, but they have fancy math editing and it looks way cleaner!
Final steps to applying this method to this situation:
In order to perform this calibration, you will have to set a global home position (likely easiest to do this arbitrarily on the floor and measure your camera position relative to that point). From this position, you will need to measure your object's distance from this position in both x and y coordinates on the floor. Although a more tightly packed calibration set will give you more error, the easiest solution for this may simply be to have a dimension-ed sheet(I am thinking piece of printer paper or a large board or something). The reason that this will be easier is that it will have built in axes (ie the two sides will be orthogonal and you will just use the four corners of the object and used canned distances in your calibration). EX: for a piece of paper your points would be (0,0), (0,8.5), (11,8.5), (11,0)
So using those points and the pixels you get will create your transform matrix, but that still just gives you a global x,y position on axes that may be hard to measure on (they may be skew depending on how you measured/ calibrated). So you will need to calculate your camera offset:
object in real world coords (from steps above): x1, y1
camera coords (Xc, Yc)
dist = sqrt( pow(x1-Xc,2) + pow(y1-Yc,2) )
If it is too cumbersome to try to measure the position of the camera from global origin by hand, you can instead measure the distance to 2 different points and feed those values into the above equation to calculate your camera offset, which you will then store and use anytime you want to get final distance.
As already mentioned in the previous answers you'll need a projective transformation or simply a homography. However, I'll consider it from a more practical view and will try to summarize it short and simple.
So, given the proper homography you can warp your picture of a plane such that it looks like you took it from above (like here). Even simpler you can transform a pixel coordinate of your image to world coordinates of the plane (the same is done during the warping for each pixel).
A homography is basically a 3x3 matrix and you transform a coordinate by multiplying it with the matrix. You may now think, wait 3x3 matrix and 2D coordinates: You'll need to use homogeneous coordinates.
However, most frameworks and libraries will do this handling for you. What you need to do is finding (at least) four points (x/y-coordinates) on your world plane/floor (preferably the corners of a rectangle, aligned with your desired world coordinate system), take a picture of them, measure the pixel coordinates and pass both to the "find-homography-function" of your desired computer vision or math library.
In OpenCV that would be findHomography, here an example (the method perspectiveTransform then performs the actual transformation).
In Matlab you can use something from here. Make sure you are using a projective transformation as transform type. The result is a projective tform, which can be used in combination with this method, in order to transform your points from one coordinate system to another.
In order to transform into the other direction you just have to invert your homography and use the result instead.

Recover plane from homography

I have used openCV to calculate the homography relating to views of the same plane by using features and matching them. Is there any way to recover the plane itsself or the plane normal from this homography? (I am looking for an equation where H is the input and the normal n is the output.)
If you have the calibration of the cameras, you can extract the normal of the plane, but not the distance to the plane (i.e. the transformation that you obtain is up to scale), as Wikipedia explains. I don't know any implementation to do it, but here you are a couple of papers that deal with that problem (I warn you it is not straightforward): Faugeras & Lustman 1988, Vargas & Malis 2005.
You can recover the real translation of the transformation (i.e. the distance to the plane) if you have at least a real distance between two points on the plane. If that is the case, the easiest way to go with OpenCV is to first calculate the homography, then obtain four points on the plane with their 2D coordinates and the real 3D ones (you should be able to obtain them if you have a real measurement on the plane), and using PnP finally. PnP will give you a real transformation.
Rectifying an image is defined as making epipolar lines horizontal and lying in the same row in both images. From your description I get that you simply want to warp the plane such that it is parallel to the camera sensor or the image plane. This has nothing to do with rectification - I’d rather call it an obtaining a bird’s-eye view or a top view.
I see the source of confusion though. Rectification of images usually involves multiplication of each image with a homography matrix. In your case though each point in sensor plane b:
Xb = Hab * Xa = (Hb * Ha^-1) * Xa, where Ha is homography from the plane in the world to the sensor a; Ha and intrinsic camera matrix will give you a plane orientation but I don’t see an easy way to decompose Hab into Ha and Hb.
A classic (and hard) way is to find a Fundamental matrix, restore the Essential matrix from it, decompose the Essential matrix into camera rotation and translation (up to scale), rectify both images, perform a dense stereo, then fit a plane equation into 3d points you reconstruct.
If you interested in the ground plane and you operate an embedded device though, you don’t even need two frames - a top view can be easily recovered from a single photo, camera elevation from the ground (H) and a gyroscope (or orientation vector) readings. A simple diagram below explains the process in 2D case: first picture shows how to restore Z (depth) coordinate to every point on the ground plane; the second picture shows a plot of the top view with vertical axis being z and horizontal axis x = (img.col-w/2)*Z/focal; Here is img.col is image column, w - image width, and focal is camera focal length. Note that a camera frustum looks like a trapezoid in a birds eye view.

how can i measure distance of an detected object from camera in video using opencv?

All i know is that the height and width of an object in video. can someone guide me to calculate distance of an detected object from camera in video using c or c++? is there any algorithm or formula to do that?
thanks in advance
Martin Ch was correct in saying that you need to calibrate your camera, but as vasile pointed out, it is not a linear change. Calibrating your camera means finding this matrix
camera_matrix = [fx,0 ,cx,
0,fy,cy,
0,0, 1];
This matrix operates on a 3 dimensional coordinate (x,y,z) and converts it into a 2 dimensional homogeneous coordinate. To convert to your regular euclidean (x,y) coordinate just divide the first and second component by the third. So now what are those variables doing?
cx/cy: They exist to let you change coordinate systems if you like. For instance you might want the origin in camera space to be in the top left of the image and the origin in world space to be in the center. In that case
cx = -width/2;
cy = -height/2;
If you are not changing coordinate systems just leave these as 0.
fx/fy: These specify your focal length in units of x pixels and y pixels, these are very often close to the same value so you may be able to just give them the same value f. These parameters essentially define how strong perspective effects are. The mapping from a world coordinate to a screen coordinate (as you can work out for yourself from the above matrix) assuming no cx and cy is
xsc = fx*xworld/zworld;
ysc = fy*yworld/zworld;
As you can see the important quantity that makes things bigger closer up and smaller farther away is the ratio f/z. It is not linear, but by using homogenous coordinates we can still use linear transforms.
In short. With a calibrated camera, and a known object size in world coordinates you can calculate its distance from the camera. If you are missing either one of those it is impossible. Without knowing the object size in world coordinates the best you can do is map its screen position to a ray in world coordinates by determining the ration xworld/zworld (knowing fx).
i don´t think it is easy if have to use camera only,
consider about to use 3rd device/sensor like kinect/stereo camera,
then you will get the depth(z) from the data.
https://en.wikipedia.org/wiki/OpenNI

Resources