Here's my Setup: Kinect mounted on an actuator for horizontal movement.
Here's a short demo of what I am doing. http://www.youtube.com/watch?v=X1aSMvDQhDM
Here's my Scenario:
Please refer to above figure. Assume the distance between the center of the Actuator,'M', and the Center of the optic axis of Kinect, 'C', is 'dx'(millimeters), the depth information 'D'(millimeters) obtained from Kinect is relative to the optic axis. Since I now have a actuator mounted onto the Center of Kinect, the actual depth between object and Kinect is 'Z'.
X is the distance between optical axis and object, in pixels. Theta2 is the angle between optic axis and object. 'dy' can be ignored.
Here's my Problem.
To obtain Z, I can simply use the distance equation in Figure 2. However I do not know the real world value of X in mm. If I have the angle between the object and optical axis 'theta2', I could use Dsin(theta2) to obtain X in mm. However theta2 is also unknown. Since if X (in mm) is know, I can get theta2, if theta2 is known, I can get X. So how should I obtain either the X value in mm or the angle between optic axis and Object P?
Here's what I've tried:
Since I know the max field of view for Kinect is 57degrees, and the max horizontal resolution of Kinect is 640pixels, I can say that 1 degree for kinect covers 11.228 (640/57) pixels. However, through experiments I discover that this results in error of at least 2 degrees. I suspect its due to lens distortion on the Kinect. But I don't know how to compensate/normalize it.
Any ideas/helps are greatly appreciated.
Related
I set up a stereo-vision system to triangulate 3D point given two 2D points from 2 views (corresponding to the same point.) I have some questions on the interpretability of the results.
So the size of my calibration squares are '25mm' a side, and after triangulating and normalizing the homogeneous coordinates (dividing the array of points by the fourth coordinate), I multiply all of them by 25mm and divide by 10 (to get in cm) to get the actual distance from the camera set up.
For eg - the final coordinates that I got were something like ([-13.29, -5.94, 68.41]) So how do I interpret this? 68.41 is the distance in the z direction, and -5.94 is the position in the y and -13.29 is the position in the x. But what is the origin here? By convention is it the left camera? Or is it the center of the epipolar baseline? I am using OpenCV for reference.
As I understand OpenCV's coordinate system, as in this diagram.
The left camera of a calibrated stereo pair is located at the origin facing the Z direction.
I have a pair of 2464x2056 pixel cameras that I have calibrated (with a stereo rms of around 0.35), computed the disparity on a pair of images and reprojected this to get the 3D pointcloud. However, I've noticed that the Z axis is not in line with the optical centre of the camera.
This does kind of mess with some of the pointcloud manipulation I'm hoping to do- is this expected, or does it indicate that that something has gone wrong along the way?
Below is the point I've generated, plus the axis- the red green and blue lines indicate the x,y and z axes respectively, coming out from the origin.
As you can see, the Z axis intercepts the pointcloud between the head and the post- this corresponds to a pixel coordinate of approximately x = 637, y = 1028 when I fix the principal point during calibration to cx = 1232,y=1028. When I remove the CV_FIX_PRINCIPAL_POINT flag, this is calculated as approximatly cx = 1310, cy=1074, and the Z axis intercepts at around x=310,y=1050.
Compared to the rectified image here where the midpoint x = 1232,y=1028 is marked by a yellow cross, the centre of the image is over the mannequin had, the intersection between the Z axis is significantly off from where I would expect.
Does anyone have any idea as to why this could be occuring? Any help would be greatly appreciated.
I'm new to opencv and computer vision. I want to find the R and t matrix between two camera pose. So I generally follows the wikipedia:
https://en.wikipedia.org/wiki/Essential_matrix#Determining_R_and_t_from_E
I find a group of the related pixel location of the same point in the two images. I get the essential matrix. Then I run the SVD. And print the 2 possible R and 2 possible t.
[What runs as expected]
If I only change the rotation alone (one of roll, pitch, yaw) or translation alone (one of x, y, z), it works perfect. For example, if I increase pitch to 15 degree, then I would get R and find the delta pitch is +14.9 degree. If I only increase x for 10 cm, then the t matrix is like [0.96, -0.2, -0.2].
[What goes wrong]
However, if I change both rotation and translation, the R and t is non-sense. For example, if I increase x for 10 cm and increase pitch to 15 degree, then the delta degree is like [-23, 8,0.5], and the t matrix is like [0.7, 0.5, 0.5].
[Question]
I'm wondering why I could not get a good result if I change the rotation and translation at the same time. And it is also confusing why the unrelated rotation or translation (roll, yaw, y, z) also changes so much.
Would anyone be willing to figure me out? Thanks.
[Solved and the reason]
OpenCV use a right-hand coordinate system. This is to say that the z-axis is projected from xy plane to the viewer direction. And our system is using a left-hand coordinate system. So as long as the changes are related to the z-axis, the result is non-sense.
This is solved due to the difference of the coordinates using.
OpenCV use a right-hand coordinate system. This is to say that the z-axis is projected from xy plane to the viewer direction. And our system is using a left-hand coordinate system. So as long as the changes are related to the z-axis, the result is non-sense.
I'd like to get the pose (translation: x, y, z and rotation: Rx, Ry, Rz in World coordinate system) of the overhead camera. I got many object points and image points by moving the ChArUco calibration board with a robotic arm (like this https://www.youtube.com/watch?v=8q99dUPYCPs). Because of that, I already have exact positions of all the object points.
In order to feed many points to solvePnP, I set the first detected pattern (ChArUco board) as the first object and used it as the object coordinate system's origin. Then, I added the detected object points (from the second pattern to the last) to the first detected object points' coordinate system (the origin of the object frame is the origin of the first object).
After I got the transformation between the camera and the object's coordinate frame, I calculated the camera's pose based on that transformation.
The result looked pretty good at first, but when I measured the camera's absolute pose by using a ruler or a tape measure, I noticed that the extrinsic calibration result was around 15-20 millimeter off for z direction (the height of the camera), though almost correct for the others (x, y, Rx, Ry, Rz). The result was same even I changed the range of the object points by moving a robotic arm differently, it always ended up to have a few centimeters off for the height.
Has anyone experienced the similar problem before? I'd like to know anything I can try. What is the common mistake when the depth direction (z) is inaccurate?
I don't know how you measure the z but I believe that what you're measuring with the ruler is not z but the euclidean distance which is computed like so:
d=std::sqrt(x*x+y*y+z*z);
Let's take an example, if x=2; y=2; z=2;
then d will be d~3,5 so 3.5-2=1.5 is the difference you get between z and the ruler when you said around 15-20 millimeter off for z direction.
All i know is that the height and width of an object in video. can someone guide me to calculate distance of an detected object from camera in video using c or c++? is there any algorithm or formula to do that?
thanks in advance
Martin Ch was correct in saying that you need to calibrate your camera, but as vasile pointed out, it is not a linear change. Calibrating your camera means finding this matrix
camera_matrix = [fx,0 ,cx,
0,fy,cy,
0,0, 1];
This matrix operates on a 3 dimensional coordinate (x,y,z) and converts it into a 2 dimensional homogeneous coordinate. To convert to your regular euclidean (x,y) coordinate just divide the first and second component by the third. So now what are those variables doing?
cx/cy: They exist to let you change coordinate systems if you like. For instance you might want the origin in camera space to be in the top left of the image and the origin in world space to be in the center. In that case
cx = -width/2;
cy = -height/2;
If you are not changing coordinate systems just leave these as 0.
fx/fy: These specify your focal length in units of x pixels and y pixels, these are very often close to the same value so you may be able to just give them the same value f. These parameters essentially define how strong perspective effects are. The mapping from a world coordinate to a screen coordinate (as you can work out for yourself from the above matrix) assuming no cx and cy is
xsc = fx*xworld/zworld;
ysc = fy*yworld/zworld;
As you can see the important quantity that makes things bigger closer up and smaller farther away is the ratio f/z. It is not linear, but by using homogenous coordinates we can still use linear transforms.
In short. With a calibrated camera, and a known object size in world coordinates you can calculate its distance from the camera. If you are missing either one of those it is impossible. Without knowing the object size in world coordinates the best you can do is map its screen position to a ray in world coordinates by determining the ration xworld/zworld (knowing fx).
i donĀ“t think it is easy if have to use camera only,
consider about to use 3rd device/sensor like kinect/stereo camera,
then you will get the depth(z) from the data.
https://en.wikipedia.org/wiki/OpenNI