To calculate world coordinates from screen coordinates with OpenCV

To calculate world coordinates from screen coordinates with OpenCV - opencv

I have calculated the intrinsic and extrinsic parameters of the camera with OpenCV.
Now, I want to calculate world coordinates (x,y,z) from screen coordinates (u,v).
How I do this?
N.B. as I use the kinect, I already know the z coordinate.
Any help is much appreciated. Thanks!

First to understand how you calculate it, it would help you if you read some things about the pinhole camera model and simple perspective projection. For a quick glimpse, check this. I'll try to update with more.
So, let's start by the opposite which describes how a camera works: project a 3d point in the world coordinate system to a 2d point in our image. According to the camera model:
P_screen = I * P_world
or (using homogeneous coordinates)
| x_screen | = I * | x_world |
| y_screen | | y_world |
| 1 | | z_world |
| 1 |
where
I = | f_x 0 c_x 0 |
| 0 f_y c_y 0 |
| 0 0 1 0 |
is the 3x4 intrinsics matrix, f being the focal point and c the center of projection.
If you solve the system above, you get:
x_screen = (x_world/z_world)*f_x + c_x
y_screen = (y_world/z_world)*f_y + c_y
But, you want to do the reverse, so your answer is:
x_world = (x_screen - c_x) * z_world / f_x
y_world = (y_screen - c_y) * z_world / f_y
z_world is the depth the Kinect returns to you and you know f and c from your intrinsics calibration, so for every pixel, you apply the above to get the actual world coordinates.
Edit 1 (why the above correspond to world coordinates and what are the extrinsics we get during calibration):
First, check this one, it explains the various coordinates systems very well.
Your 3d coordinate systems are: Object ---> World ---> Camera. There is a transformation that takes you from object coordinate system to world and another one that takes you from world to camera (the extrinsics you refer to). Usually you assume that:
Either the Object system corresponds with the World system,
or, the Camera system corresponds with the World system
1. While capturing an object with the Kinect
When you use the Kinect to capture an object, what is returned to you from the sensor is the distance from the camera. That means that the z coordinate is already in camera coordinates. By converting x and y using the equations above, you get the point in camera coordinates.
Now, the world coordinate system is defined by you. One common approach is to assume that the camera is located at (0,0,0) of the world coordinate system. So, in that case, the extrinsics matrix actually corresponds to the identity matrix and the camera coordinates you found, correspond to world coordinates.
Sidenote: Because the Kinect returns the z in camera coordinates, there is also no need from transformation from the object coordinate system to the world coordinate system. Let's say for example that you had a different camera that captured faces and for each point it returned the distance from the nose (which you considered to be the center of the object coordinate system). In that case, since the values returned would be in the object coordinate system, we would indeed need a rotation and translation matrix to bring them to the camera coordinate system.
2. While calibrating the camera
I suppose you are calibrating the camera using OpenCV using a calibration board with various poses. The usual way is to assume that the board is actually stable and the camera is moving instead of the opposite (the transformation is the same in both cases). That means that now the world coordinate system corresponds to the object coordinate system. This way, for every frame, we find the checkerboard corners and assign them 3d coordinates, doing something like:
std::vector<cv::Point3f> objectCorners;
for (int i=0; i<noOfCornersInHeight; i++)
{
for (int j=0; j<noOfCornersInWidth; j++)
{
objectCorners.push_back(cv::Point3f(float(i*squareSize),float(j*squareSize), 0.0f));
}
}
where noOfCornersInWidth, noOfCornersInHeight and squareSize depend on your calibration board. If for example noOfCornersInWidth = 4, noOfCornersInHeight = 3 and squareSize = 100, we get the 3d points
(0 ,0,0) (0 ,100,0) (0 ,200,0) (0 ,300,0)
(100,0,0) (100,100,0) (100,200,0) (100,300,0)
(200,0,0) (200,100,0) (200,200,0) (200,300,0)
So, here our coordinates are actually in the object coordinate system. (We have assumed arbitrarily that the upper left corner of the board is (0,0,0) and the rest corners' coordinates are according to that one). So here we indeed need the rotation and transformation matrix to take us from the object(world) to the camera system. These are the extrinsics that OpenCV returns for each frame.
To sum up in the Kinect case:
Camera and World coodinate systems are considered the same, so no need for extrinsics there.
No need for Object to World(Camera) transformation, since Kinect return value is already in Camera system.
Edit 2 (On the coordinate system used):
This is a convention and I think it depends also on which drivers you use and the kind of data you get back. Check for example that, that and that one.
Sidenote: It would help you a lot if you visualized a point cloud and played a little bit with it. You can save your points in a 3d object format (e.g. ply or obj) and then just import it into a program like Meshlab (very easy to use).

Edit 2 (On the coordinate system used):
This is a convention and I think it depends also on which drivers you use and the kind of data you get back. Check for example that, that and that one.
if you for instance use microsoft sdk: then Z is not the distance to the camera but the "planar" distance to the camera. This might change the appropriate formulas.

Related

Find the Transformation Matrix that maps 3D local coordinates to global coordinates

I'm coding a calibration algorithm for my depth-camera. This camera outputs an one channel 2D image with the distance of every object in the image.
From that image, and using the camera and distortion matrices, I was able to create a 3D point cloud, from the camera perspective. Now I wish to convert those 3D coordinates to a global/world coordinates. But, since I can't use any patterns like the chessboard to calibrate the camera, I need another alternative.
So I was thinking: If I provide some ground points (in the camera perspective), I would define a plane that I know should have the Z coordinate close to zero, in the global perspective. So, how should I proceed to find the transformation matrix that horizontalizes the plane.
Local coordinates ground plane, with an object on top
I tried using the OpenCV's solvePnP, but it didn't gave me the correct transformation. Also I thought in using the OpenCV's estimateAffine3D, but I don't know where should the global coordinates be mapped to, since the provided ground points do not need to lay on any specific pattern/shape.
Thanks in advance

What you need is what's commonly called extrinsic calibration: a rigid transformation relating the 3D camera reference frame to the 'world' reference frame. Usually, this is done by finding known 3D points in the world reference frame and their corresponding 2D projections in the image. This is what SolvePNP does.
To find the best rotation/translation between two sets of 3D points, in the sense of minimizing the root mean square error, the solution is:
Theory: https://igl.ethz.ch/projects/ARAP/svd_rot.pdf
Easier explanation: http://nghiaho.com/?page_id=671
Python code (from the easier explanation site): http://nghiaho.com/uploads/code/rigid_transform_3D.py_
So, if you want to transform 3D points from the camera reference frame, do the following:
As you proposed, define some 3D points with known position in the world reference frame, for example (but not necessarily) with Z=0. Put the coordinates in a Nx3 matrix P.
Get the corresponding 3D points in the camera reference frame. Put them in a Nx3 matrix Q.
From the file defined in point 3 above, call rigid_transform_3D(P, Q). This will return a 3x3 matrix R and a 3x1 vector t.
Then, for any 3D point in the world reference frame p, as a 3x1 vector, you can obtain the corresponding camera point, q with:
q = R.dot(p)+t
EDIT: answer when 3D position of points in world are unspecified
Indeed, for this procedure to work, you need to know (or better, to specify) the 3D coordinates of the points in your world reference frame. As stated in your comment, you only know the points are in a plane but don't have their coordinates in that plane.
Here is a possible solution:
Take the selected 3D points in camera reference frame, let's call them q'i.
Fit a plane to these points, for example as described in https://www.ilikebigbits.com/2015_03_04_plane_from_points.html. The result of this will be a normal vector n. To fully specify the plane, you need also to choose a point, for example the centroid (average) of q'i.
As the points surely don't perfectly lie in the plane, project them onto the plane, for example as described in: How to project a point onto a plane in 3D?. Let's call these projected points qi.
At this point you have a set of 3D points, qi, that lie on a perfect plane, which should correspond closely to the ground plane (z=0 in world coordinate frame). The coordinates are in the camera reference frame, though.
Now we need to specify an origin and the direction of the x and y axes in this ground plane. You don't seem to have any criteria for this, so an option is to arbitrarily set the origin just "below" the camera center, and align the X axis with the camera optical axis. For this:
Project the point (0,0,0) into the plane, as you did in step 4. Call this o. Project the point (0,0,1) into the plane and call it a. Compute the vector a-o, normalize it and call it i.
o is the origin of the world reference frame, and i is the X axis of the world reference frame, in camera coordinates. Call j=nxi ( cross product). j is the Y-axis and we are almost finished.
Now, obtain the X-Y coordinates of the points qi in the world reference frame, by projecting them on i and j. That is, do the dot product between each qi and i to get the X values and the dot product between each qi and j to get the Y values. The Z values are all 0. Call these X, Y, 0 coordinates pi.
Use these values of pi and qi to estimate R and t, as in the first part of the answer!
Maybe there is a simpler solution. Also, I haven't tested this, but I think it should work. Hope this helps.

Overhead camera's pose estimation with OpenCV SolvePnP results in a few centimeter off height

I'd like to get the pose (translation: x, y, z and rotation: Rx, Ry, Rz in World coordinate system) of the overhead camera. I got many object points and image points by moving the ChArUco calibration board with a robotic arm (like this https://www.youtube.com/watch?v=8q99dUPYCPs). Because of that, I already have exact positions of all the object points.
In order to feed many points to solvePnP, I set the first detected pattern (ChArUco board) as the first object and used it as the object coordinate system's origin. Then, I added the detected object points (from the second pattern to the last) to the first detected object points' coordinate system (the origin of the object frame is the origin of the first object).
After I got the transformation between the camera and the object's coordinate frame, I calculated the camera's pose based on that transformation.
The result looked pretty good at first, but when I measured the camera's absolute pose by using a ruler or a tape measure, I noticed that the extrinsic calibration result was around 15-20 millimeter off for z direction (the height of the camera), though almost correct for the others (x, y, Rx, Ry, Rz). The result was same even I changed the range of the object points by moving a robotic arm differently, it always ended up to have a few centimeters off for the height.
Has anyone experienced the similar problem before? I'd like to know anything I can try. What is the common mistake when the depth direction (z) is inaccurate?

I don't know how you measure the z but I believe that what you're measuring with the ruler is not z but the euclidean distance which is computed like so:
d=std::sqrt(x*x+y*y+z*z);
Let's take an example, if x=2; y=2; z=2;
then d will be d~3,5 so 3.5-2=1.5 is the difference you get between z and the ruler when you said around 15-20 millimeter off for z direction.

How do I determine the camera rotation in world space (not orientation of a Charuco board) in OpenCV?

In OpenCV, I am using a Charuco board, have calibrated the camera, and use SolvePnP to get rvec and tvec. (similar to the sample code). I am using a stationary board, with a camera on a circular rig which orbits the board. I'm a noob with this, so please bear with me if this is something simple I am missing.
I understand that I can use Rodrigues() to get the 3x3 rotation matrix for the board orientation from rvec, and that I can transform the tvec value into world coordinates using -R.t() * tvec (in c++).
However, as I understand it, this 3x3 rotation R gives the orientation of the board with respect to the camera, so it's not quite what need. I want the rotation of the camera itself, which is offset from R by (I think) the angle between tvec and the z axis in camera space. (because the camera is not always pointing at the board origin, but it is always pointing down the z axis in camera space). Is this correct?
How do I find the additional rotation offset and convert it to a 3x3 rotation matrix which I can combine with R to get the actual camera orientation?
Thanks!

Let's say you capture N frames of the Charuco board from the camera. Then, you have N transformations taking a point in the camera frame to the same point in the Charuco board frame. This is obtained from the Charuco board poses for every frame.
Say we denote a linear transformation from one coordinate frame to another as T4x4 = [R3x3, t3x1; 01x3, 1] If we look at a point P from the board coordinate frame and refer to it as Pboard. Similarly, we look at that point from camera 1 referred to as c1, camera 2 as c2 and so on.
So, we could write:
Pboard = T1Pc1
Pboard = T2Pc2
.
.
.
Pboard = TNPcN
From what I understand, you need camera's rotation say from the starting point (assume that camera is at zero rotation in the frame 1). So, you could express every subsequent frame in terms of Pc1 instead of Pboard.
So, we could say
T2Pc2 = T1Pc1
or, Pc2 = T2-1T1Pc1
Similarly, PcN = TN-1T1Pc1
You could recover RN for taking point from camera position N to camera position 1 by looking at the rotation part of TN-1T1.

Converting a 2D image point to a 3D world point

I know that in the general case, making this conversion is impossible since depth information is lost going from 3d to 2d.
However, I have a fixed camera and I know its camera matrix. I also have a planar calibration pattern of known dimensions - let's say that in world coordinates it has corners (0,0,0) (2,0,0) (2,1,0) (0,1,0). Using opencv I can estimate the pattern's pose, giving the translation and rotation matrices needed to project a point on the object to a pixel in the image.
Now: this 3d to image projection is easy, but how about the other way? If I pick a pixel in the image that I know is part of the calibration pattern, how can I get the corresponding 3d point?
I could iteratively choose some random 3d point on the calibration pattern, project to 2d, and refine the 3d point based on the error. But this seems pretty horrible.
Given that this unknown point has world coordinates something like (x,y,0) -- since it must lie on the z=0 plane -- it seems like there should be some transformation that I can apply, instead of doing the iterative nonsense. My maths isn't very good though - can someone work out this transformation and explain how you derive it?

Here is a closed form solution that I hope can help someone. Using the conventions in the image from your comment above, you can use centered-normalized pixel coordinates (usually after distortion correction) u and v, and extrinsic calibration data, like this:
|Tx| |r11 r21 r31| |-t1|
|Ty| = |r12 r22 r32|.|-t2|
|Tz| |r13 r23 r33| |-t3|
|dx| |r11 r21 r31| |u|
|dy| = |r12 r22 r32|.|v|
|dz| |r13 r23 r33| |1|
With these intermediate values, the coordinates you want are:
X = (-Tz/dz)*dx + Tx
Y = (-Tz/dz)*dy + Ty
Explanation:
The vector [t1, t2, t3]t is the position of the origin of the world coordinate system (the (0,0) of your calibration pattern) with respect to the camera optical center; by reversing signs and inversing the rotation transformation we obtain vector T = [Tx, Ty, Tz]t, which is the position of the camera center in the world reference frame.
Similarly, [u, v, 1]t is the vector in which lies the observed point in the camera reference frame (starting from camera center). By inversing the rotation transformation we obtain vector d = [dx, dy, dz]t, which represents the same direction in world reference frame.
To inverse the rotation transformation we take advantage of the fact that the inverse of a rotation matrix is its transpose (link).
Now we have a line with direction vector d starting from point T, the intersection of this line with plane Z=0 is given by the second set of equations. Note that it would be similarly easy to find the intersection with the X=0 or Y=0 planes or with any plane parallel to them.

Yes, you can. If you have a transformation matrix that maps a point in the 3d world to the image plane, you can just use the inverse of this transformation matrix to map a image plane point to the 3d world point. If you already know that z = 0 for the 3d world point, this will result in one solution for the point. There will be no need to iteratively choose some random 3d point. I had a similar problem where I had a camera mounted on a vehicle with a known position and camera calibration matrix. I needed to know the real world location of a lane marking captured on the image place of the camera.

If you have Z=0 for you points in world coordinates (which should be true for planar calibration pattern), instead of inversing rotation transformation, you can calculate homography for your image from camera and calibration pattern.
When you have homography you can select point on image and then get its location in world coordinates using inverse homography.
This is true as long as the point in world coordinates is on the same plane as the points used for calculating this homography (in this case Z=0)
This approach to this problem was also discussed below this question on SO: Transforming 2D image coordinates to 3D world coordinates with z = 0

how can i measure distance of an detected object from camera in video using opencv?

All i know is that the height and width of an object in video. can someone guide me to calculate distance of an detected object from camera in video using c or c++? is there any algorithm or formula to do that?
thanks in advance

Martin Ch was correct in saying that you need to calibrate your camera, but as vasile pointed out, it is not a linear change. Calibrating your camera means finding this matrix
camera_matrix = [fx,0 ,cx,
0,fy,cy,
0,0, 1];
This matrix operates on a 3 dimensional coordinate (x,y,z) and converts it into a 2 dimensional homogeneous coordinate. To convert to your regular euclidean (x,y) coordinate just divide the first and second component by the third. So now what are those variables doing?
cx/cy: They exist to let you change coordinate systems if you like. For instance you might want the origin in camera space to be in the top left of the image and the origin in world space to be in the center. In that case
cx = -width/2;
cy = -height/2;
If you are not changing coordinate systems just leave these as 0.
fx/fy: These specify your focal length in units of x pixels and y pixels, these are very often close to the same value so you may be able to just give them the same value f. These parameters essentially define how strong perspective effects are. The mapping from a world coordinate to a screen coordinate (as you can work out for yourself from the above matrix) assuming no cx and cy is
xsc = fx*xworld/zworld;
ysc = fy*yworld/zworld;
As you can see the important quantity that makes things bigger closer up and smaller farther away is the ratio f/z. It is not linear, but by using homogenous coordinates we can still use linear transforms.
In short. With a calibrated camera, and a known object size in world coordinates you can calculate its distance from the camera. If you are missing either one of those it is impossible. Without knowing the object size in world coordinates the best you can do is map its screen position to a ray in world coordinates by determining the ration xworld/zworld (knowing fx).

i don´t think it is easy if have to use camera only,
consider about to use 3rd device/sensor like kinect/stereo camera,
then you will get the depth(z) from the data.
https://en.wikipedia.org/wiki/OpenNI

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart