How to find Camera matrix for Augmented Reality? - opencv

I want to augment a virtual object at x,y,z meters wrt camera. OpenCV has camera calibration functions but I don't understand how exactly I can give coordinates in meters
I tried simulating a Camera in Unity but don't get expected result.
I set the projection matrix as follows and create a unit cube at z = 2.415 + 0.5 .
Where 2.415 is the distance between the eye and the projection plane (Pinhole camera model)
Since the cube's face is at the front clipping plane and it's dimension are unit shouldn't it cover the whole viewport?
Matrix4x4 m = new Matrix4x4();
m[0, 0] = 1;
m[0, 1] = 0;
m[0, 2] = 0;
m[0, 3] = 0;
m[1, 0] = 0;
m[1, 1] = 1;
m[1, 2] = 0;
m[1, 3] = 0;
m[2, 0] = 0;
m[2, 1] = 0;
m[2, 2] = -0.01f;
m[2, 3] = 0;
m[3, 0] = 0;
m[3, 1] = 0;
m[3, 2] = -2.415f;
m[3, 3] = 0;

The global scale of your calibration (i.e. the units of measure of 3D space coordinates) is determined by the geometry of the calibration object you use. For example, when you calibrate in OpenCV using images of a flat checkerboard, the inputs to the calibration procedure are corresponding pairs (P, p) of 3D points P and their images p, the (X, Y, Z) coordinates of the 3D points are expressed in mm, cm, inches, miles, whatever, as required by the size of target you use (and the optics that images it), and the 2D coordinates of the images are in pixels. The output of the calibration routine is the set of parameters (the components of the projection matrix P and the non-linear distortion parameters k) that "convert" 3D coordinates expressed in those metrical units into pixels.
If you don't know (or don't want to use) the actual dimensions of the calibration target, you can just fudge them but leave their ratios unchanged (so that, for example, a square remains a square even though the true length of its side may be unknown). In this case your calibration will be determined up to an unknown global scale. This is actually the common case: in most virtual reality applications you don't really care what the global scale is, as long as the results look correct in the image.
For example, if you want to add an even puffier pair of 3D lips on a video of Angelina Jolie, and composite them with the original video so that the brand new fake lips stay attached and look "natural" on her face, you just need to rescale the 3D model of the fake lips so that it overlaps correctly the image of the lips. Whether the model is 1 yard or one mile away from the CG camera in which you render the composite is completely irrelevant.

For finding augmenting an object you need to find camera pose and orientation. That is the same as finding the camera extrinsics. You also have to calculate first the camera intrinsics (which is called calibraiton).
OpenCV allows you to do all of this, but is not trivial, it requires work on your own. I give you a clue, you first need to recognize something in the scene that you know how it looks, so you can calculate the camera pose by analyzing this object, call it a marker. You can start by the tipical fiducials, they are easy to detect.
Have a look at this thread.

I ended up measuring the field of view manually. Once you know FOV you can easily create the projection matrix. No need to worry about units because in the end the projection is of the form ( X*d/Z, Y*d/Z ). Whatever the units of X,Y,Z may be the ratio of X/Z remains the same.

Related

How to obtain the orientation of a square in the real world from an image knowing the centroid position in the real world and the image projection

I'm trying to obtain the orientation of a square in the real world from an image. I know the projection of each vertex in the image and with this and a depth camera I can obtain the position of the centroid in the real world.
I need the orientation of the square (actually, the normal vector to the plane) and the depth camera has not enough resolution. The camera parameters are also known.
I've search and I've only found estimation algorithms too overkill for problems with much less information. But in this case, I have a lot of data of the shape, distance, camera, image, etc. but I am not being able to get it.
Thanks in advance.
I assume the image is captured with an ordinary camera, and that your "square" is well approximated by an actual geometrical rectangle, with parallel opposite sides and orthogonal adjacent ones
If you only need the square's normal, and the camera is calibrated (in particular, the nonlinear lens distortion is removed from the image), then it can trivially be obtained from the vanishing points and the center. The algorithm is as follows:
Express the images of the four vertices p_i, i=1..4, in homogeneous coordinates: p_i = (u_i, v_i, 1). The ordering of i is unimportant, but in the following I assume it's clockwise starting from any one vertex. Also, for convenience, where in the following I write, say, i + n, it's assumed that the addition is modulo 4, so that, e.g., i + 1 = 1 when i = 4.
Compute the equations of the lines covering the square sides: l_i = p_(i+1) X p_i, where X represents the cross product.
Compute the equations of the diagonals: d_13 = p_1 X p_3, d_24 = p_2 X p_4.
Compute the center: c = d_13 X d_24.
Compute the vanishing points of the pairs of parallel sides: v_13 = l_1 X l_3, v_24 = l_2 X l_4. They represent the directions of the images of two lines which, in 3D, are orthogonal to each other.
Compute the images of the axes the 3D orthogonal coordinate frame rooted at the square center, and with two of the axes parallel to the square sides: x = c X v_13, y = c X v_24.
Lastly, the plane normal, in 3D camera coordinate frame, is their cross product: z = x X y .
Note that removing the distortion is important, because even a small amount of distortion can greatly affects the location of the vanishing points when the square sides are nearly parallel.
If you want to know why this works, the following excerpt from Hartley and Zisserman's "Multiple View Geometry in Computer Vision" should sufice:

How to get the transformation matrix of a 3d model to object in a 2d image

Given an object's 3D mesh file and an image that contains the object, what are some techniques to get the orientation/pose parameters of the 3d object in the image?
I tried searching for some techniques, but most seem to require texture information of the object or at least some additional information. Is there a way to get the pose parameters using just an image and a 3d mesh file (wavefront .obj)?
Here's an example of a 2D image that can be expected.
FOV of camera
Field of view of camera is absolute minimum to know to even start with this (how can you determine how to place object when you have no idea how it would affect scene). Basically you need transform matrix that maps from world GCS (global coordinate system) to Camera/Screen space and back. If you do not have a clue what about I am writing then perhaps you should not try any of this before you learn the math.
For unknown camera you can do some calibration based on markers or etalones (known size and shape) in the view. But much better is use real camera values (like FOV angles in x,y direction, focal length etc ...)
The goal for this is to create function that maps world GCS(x,y,z) into Screen LCS(x,y).
For more info read:
transform matrix anatomy
3D graphic pipeline
Perspective projection
Silhouette matching
In order to compare rendered and real image similarity you need some kind of measure. As you need to match geometry I think silhouette matching is the way (ignoring textures, shadows and stuff).
So first you need to obtain silhouettes. Use image segmentation for that and create ROI mask of your object. For rendered image is this easy as you van render the object with single color without any lighting directly into ROI mask.
So you need to construct function that compute the difference between silhouettes. You can use any kind of measure but I think you should start with non overlapping areas pixel count (it is easy to compute).
Basically you count pixels that are present only in one ROI (region of interest) mask.
estimate position
as you got the mesh then you know its size so place it in the GCS so rendered image has very close bounding box to real image. If you do not have FOV parameters then you need to rescale and translate each rendered image so it matches images bounding box (and as result you obtain only orientation not position of object of coarse). Cameras have perspective so the more far from camera you place your object the smaller it will be.
fit orientation
render few fixed orientations covering all orientations with some step 8^3 orientations. For each compute the difference of silhouette and chose orientation with smallest difference.
Then fit the orientation angles around it to minimize difference. If you do not know how optimization or fitting works see this:
How approximation search works
Beware too small amount of initial orientations can cause false positioves or missed solutions. Too high amount will be slow.
Now that was some basics in a nutshell. As your mesh is not very simple you may need to tweak this like use contours instead of silhouettes and using distance between contours instead of non overlapping pixels count which is really hard to compute ... You should start with simpler meshes like dice , coin etc ... and when grasping all of this move to more complex shapes ...
[Edit1] algebraic approach
If you know some points in the image that coresponds to known 3D points (in your mesh) then you can along with the FOV of the camera used compute the transform matrix placing your object ...
if the transform matrix is M (OpenGL style):
M = xx,yx,zx,ox
xy,yy,zy,oy
xz,yz,zz,oz
0, 0, 0, 1
Then any point from your mesh (x,y,z) is transformed to global world (x',y',z') like this:
(x',y',z') = M * (x,y,z)
The pixel position (x'',y'') is done by camera FOV perspective projection like this:
y''=FOVy*(z'+focus)*y' + ys2;
x''=FOVx*(z'+focus)*x' + xs2;
where camera is at (0,0,-focus), projection plane is at z=0 and viewing direction is +z so for any focal length focus and screen resolution (xs,ys):
xs2=xs*0.5;
ys2=ys*0.5;
FOVx=xs2/focus;
FOVy=ys2/focus;
When put all this together you obtain this:
xi'' = ( xx*xi + yx*yi + zx*zi + ox ) * ( xz*xi + yz*yi + zz*zi + ox + focus ) * FOVx
yi'' = ( xy*xi + yy*yi + zy*zi + oy ) * ( xz*xi + yz*yi + zz*zi + oy + focus ) * FOVy
where (xi,yi,zi) is i-th known point 3D position in mesh local coordinates and (xi'',yi'') is corresponding known 2D pixel positions. So unknowns are the M values:
{ xx,xy,xz,yx,yy,yx,zx,zy,zz,ox,oy,oz }
So we got 2 equations per each known point and 12 unknowns total. So you need to know 6 points. Solve the system of equations and construct your matrix M.
Also you can exploit that M is a uniform orthogonal/orthonormal matrix so vectors
X = (xx,xy,xz)
Y = (yx,yy,yz)
Z = (zx,zy,zz)
Are perpendicular to each other so:
(X.Y) = (Y.Z) = (Z.X) = 0.0
Which can lower the number of needed points by introducing these to your system. Also you can exploit cross product so if you know 2 vectors the thirth can be computed
Z = (X x Y)*scale
So instead of 3 variables you need just single scale (which is 1 for orthonormal matrix). If I assume orthonormal matrix then:
|X| = |Y| = |Z| = 1
so we got 6 additional equations (3 x dot, and 3 for cross) without any additional unknowns so 3 point are indeed enough.

calculate the real distance between to point using image

I do some image processing task in 3D and I have a problem.
I use a simulator which provides me an special kind of cameras which can tell the distance between the position of camera and any arbitrary point, using the pixels of that point in the image of camera. For example I can get the distance between camera and the object which is placed in pixel 21:34.
Now I need to calculate the real distance between two arbitrary pixels in the image of camera.
It is easy when camera is vertical and placed on the above of the field and all objects are on the ground but when camera is horizontal the depth of objects in image is different.
So, how should I do?
Simple 3D reconstruction will accomplish this. The distance from camera to points in 3D is along optical axis that is Z, which you already have. You will need X, Y as well:
X = u*Z/f;
Y = v*Z/f,
where f is camera focal length in pixels, Z your distance in mm or meters and u,v is an image centered coordinates: u = column-width/2, v = height/2-row. Note the asymmetry due to the fact that rows go down while Y and v go up. As soon as you get your X, Y, Z the distance in 3D is given by Euclidean formula:
dist = sqrt((X1-X2)2+(Y1-Y2)2+(Z1-Z2)2)

Camera motion from corresponding images

I'm trying to calculate a new camera position based on the motion of corresponding images.
the images conform to the pinhole camera model.
As a matter of fact, I don't get useful results, so I try to describe my procedure and hope that somebody can help me.
I match the features of the corresponding images with SIFT, match them with OpenCV's FlannBasedMatcher and calculate the fundamental matrix with OpenCV's findFundamentalMat (method RANSAC).
Then I calculate the essential matrix by the camera intrinsic matrix (K):
Mat E = K.t() * F * K;
I decompose the essential matrix to rotation and translation with singular value decomposition:
SVD decomp = SVD(E);
Matx33d W(0,-1,0,
1,0,0,
0,0,1);
Matx33d Wt(0,1,0,
-1,0,0,
0,0,1);
R1 = decomp.u * Mat(W) * decomp.vt;
R2 = decomp.u * Mat(Wt) * decomp.vt;
t1 = decomp.u.col(2); //u3
t2 = -decomp.u.col(2); //u3
Then I try to find the correct solution by triangulation. (this part is from http://www.morethantechnical.com/2012/01/04/simple-triangulation-with-opencv-from-harley-zisserman-w-code/ so I think that should work correct).
The new position is then calculated with:
new_pos = old_pos + -R.t()*t;
where new_pos & old_pos are vectors (3x1), R the rotation matrix (3x3) and t the translation vector (3x1).
Unfortunately I got no useful results, so maybe anyone has an idea what could be wrong.
Here are some results (just in case someone can confirm that any of them is definitely wrong):
F = [8.093827077399547e-07, 1.102681999632987e-06, -0.0007939604310854831;
1.29246107737264e-06, 1.492629957878578e-06, -0.001211264339006535;
-0.001052930954975217, -0.001278667878010564, 1]
K = [150, 0, 300;
0, 150, 400;
0, 0, 1]
E = [0.01821111092414898, 0.02481034499174221, -0.01651092283654529;
0.02908037424088439, 0.03358417405226801, -0.03397110489649674;
-0.04396975675562629, -0.05262169424538553, 0.04904210357279387]
t = [0.2970648246214448; 0.7352053067682792; 0.6092828956013705]
R = [0.2048034356172475, 0.4709818957303019, -0.858039396912323;
-0.8690270040802598, -0.3158728880490416, -0.3808101689488421;
-0.4503860776474556, 0.8236506374002566, 0.3446041331317597]
First of all you should check if
x' * F * x = 0
for your point correspondences x' and x. This should be of course only the case for the inliers of the fundamental matrix estimation with RANSAC.
Thereafter, you have to transform your point correspondences to normalized image coordinates (NCC) like this
xn = inv(K) * x
xn' = inv(K') * x'
where K' is the intrinsic camera matrix of the second image and x' are the points of the second image. I think in your case it is K = K'.
With these NCCs you can decompose your essential matrix like you described. You triangulate the normalized camera coordinates and check the depth of your triangulated points. But be careful, in literature they say that one point is sufficient to get the correct rotation and translation. From my experience you should check a few points since one point can be an outlier even after RANSAC.
Before you decompose the essential matrix make sure that E=U*diag(1,1,0)*Vt. This condition is required to get correct results for the four possible choices of the projection matrix.
When you've got the correct rotation and translation you can triangulate all your point correspondences (the inliers of the fundamental matrix estimation with RANSAC). Then, you should compute the reprojection error. Firstly, you compute the reprojected position like this
xp = K * P * X
xp' = K' * P' * X
where X is the computed (homogeneous) 3D position. P and P' are the 3x4 projection matrices. The projection matrix P is normally given by the identity. P' = [R, t] is given by the rotation matrix in the first 3 columns and rows and the translation in the fourth column, so that P is a 3x4 matrix. This only works if you transform your 3D position to homogeneous coordinates, i.e. 4x1 vectors instead of 3x1. Then, xp and xp' are also homogeneous coordinates representing your (reprojected) 2D positions of your corresponding points.
I think the
new_pos = old_pos + -R.t()*t;
is incorrect since firstly, you only translate the old_pos and you do not rotate it and secondly, you translate it with a wrong vector. The correct way is given above.
So, after you computed the reprojected points you can calculate the reprojection error. Since you are working with homogeneous coordinates you have to normalize them (xp = xp / xp(2), divide by last coordinate). This is given by
error = (x(0)-xp(0))^2 + (x(1)-xp(1))^2
If the error is large such as 10^2 your intrinsic camera calibration or your rotation/translation are incorrect (perhaps both). Depending on your coordinate system you can try to inverse your projection matrices. On that account you need to transform them to homogeneous coordinates before since you cannot invert a 3x4 matrix (without the pseudo inverse). Thus, add the fourth row [0 0 0 1], compute the inverse and remove the fourth row.
There is one more thing with reprojection error. In general, the reprojection error is the squared distance between your original point correspondence (in each image) and the reprojected position. You can take the square root to get the Euclidean distance between both points.
To update your camera position, you have to update the translation first, then update the rotation matrix.
t_ref += lambda * (R_ref * t);
R_ref = R * R_ref;
where t_ref and R_ref are your camera state, R and t are new calculated camera rotation and translation, and lambda is the scale factor.

OpenCV extrinsic camera from feature points

How do I retrieve the rotation matrix, the translation vector and maybe some scaling factors of each camera using OpenCV when I have pictures of an object from the view of each of these cameras? For every picture I have the image coordinates of several feature points. Not all feature points are visible in all of the pictures.
I want to map the computed 3D coordinates of the feature points of the object to a slightly different object to align the shape of the second object to the first object.
I heard it is possible using cv::calibrateCamera(...) but I can't get quite through it...
Does someone have experiences with that kind of problem?
I was confronted with the same problem as you, in OpenCV. I had a stereo image pair and I wanted to computed the external parameters of the cameras and the world coordinates of all observed points. This problem has been treated here:
Berthold K. P. Horn. Relative orientation revisited. Berthold K. P. Horn. Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 545 Technology ...
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.64.4700
However, I wasn't able to find a suitable implementation of this problem (perhaps you will find one). Due to time limitations I did not have time to understand all the maths in this paper and implement it myself, so I came up with a quick-and-dirty solution that works for me. I will explain what I did to solve it:
Assuming we have two cameras, where the first camera has external parameters RT = Matx::eye(). Now make a guess about the the rotation R of the second camera. For every pair of image points observed in both images, we compute the directions of their corresponding rays in world coordinates and store them in a 2d-array dirs (EDIT: The internal camera parameters are assumed to be known). We can do this since we assume that we know the orientation of every camera. Now we build an overdetermined linear system AC = 0 where C is the centre of the second camera. I provide you with the function to compute A:
Mat buildA(Matx<double, 3, 3> &R, Array<Vec3d, 2> dirs)
{
CV_Assert(dirs.size(0) == 2);
int pointCount = dirs.size(1);
Mat A(pointCount, 3, DataType<double>::type);
Vec3d *a = (Vec3d *)A.data;
for (int i = 0; i < pointCount; i++)
{
a[i] = dirs(0, i).cross(toVec(R*dirs(1, i)));
double length = norm(a[i]);
if (length == 0.0)
{
CV_Assert(false);
}
else
{
a[i] *= (1.0/length);
}
}
return A;
}
Then calling cv::SVD::solveZ(A) will give you the least-squares solution of norm 1 to this system. This way, you obtain the rotation and translation of the second camera. However, since I just made a guess about the rotation of the second camera, I make several guesses about its rotation (parameterized using a 3x1 vector omega from which i compute the rotation matrix using cv::Rodrigues) and then I refine this guess by solving the system AC = 0 repetedly in a Levenberg-Marquardt optimizer with numeric jacobian. It works for me but it is a bit dirty, so you if you have time, I encourage you to implement what is explained in the paper.
EDIT:
Here is the routine in the Levenberg-Marquardt optimizer for evaluating the vector of residues:
void Stereo::eval(Mat &X, Mat &residues, Mat &weights)
{
Matx<double, 3, 3> R2Ref = getRot(X); // Map the 3x1 euler angle to a rotation matrix
Mat A = buildA(R2Ref, _dirs); // Compute the A matrix that measures the distance between ray pairs
Vec3d c;
Mat cMat(c, false);
SVD::solveZ(A, cMat); // Find the optimum camera centre of the second camera at distance 1 from the first camera
residues = A*cMat; // Compute the output vector whose length we are minimizing
weights.setTo(1.0);
}
By the way, I searched a little more on the internet and found some other code that could be useful for computing the relative orientation between cameras. I haven't tried any code yet, but it seems useful:
http://www9.in.tum.de/praktika/ppbv.WS02/doc/html/reference/cpp/toc_tools_stereo.html
http://lear.inrialpes.fr/people/triggs/src/
http://www.maths.lth.se/vision/downloads/
Are these static cameras which you wish to calibrate for future use as a stereo pair? In this case you would want to use the cv::stereoCalibrate() function. OpenCV contains some sample code, one of which is stereo_calib.cpp which may be worth investigating.

Resources