I am trying to find the bird's eye image from a given image. I also have the rotations and translations (also intrinsic matrix) required to convert it into the bird's eye plane. My aim is to find an inverse homography matrix(3x3).
rotation_x = np.asarray([[1,0,0,0],
[0,np.cos(R_x),-np.sin(R_x),0],
[0,np.sin(R_x),np.cos(R_x),0],
[0,0,0,1]],np.float32)
translation = np.asarray([[1, 0, 0, 0],
[0, 1, 0, 0 ],
[0, 0, 1, -t_y/(dp_y * np.sin(R_x))],
[0, 0, 0, 1]],np.float32)
intrinsic = np.asarray([[s_x * f / (dp_x ),0, 0, 0],
[0, 1 * f / (dp_y ) ,0, 0 ],
[0,0,1,0]],np.float32)
#The Projection matrix to convert the image coordinates to 3-D domain from (x,y,1) to (x,y,0,1); Not sure if this is the right approach
projection = np.asarray([[1, 0, 0],
[0, 1, 0],
[0, 0, 0],
[0, 0, 1]], np.float32)
homography_matrix = intrinsic # translation # rotation # projection
inv = cv2.warpPerspective(source_image, homography_matrix,(w,h),flags = cv2.INTER_CUBIC | cv2.WARP_INVERSE_MAP)
My question is, Is this the right approach, as I can manual set a suitable ty,rx, but not for the one (ty,rx) which is provided.
First premise: your bird's eye view will be correct only for one specific plane in the image, since a homography can only map planes (including the plane at infinity, corresponding to a pure camera rotation).
Second premise: if you can identify a quadrangle in the first image that is the projection of a rectangle in the world, you can directly compute the homography that maps the quad into the rectangle (i.e. the "birds's eye view" of the quad), and warp the image with it, setting the scale so the image warps to a desired size. No need to use the camera intrinsics. Example: you have the image of a building with rectangular windows, and you know the width/height ratio of these windows in the world.
Sometimes you can't find rectangles, but your camera is calibrated, and thus the problem you describe comes into play. Let's do the math. Assume the plane you are observing in the given image is Z=0 in world coordinates. Let K be the 3x3 intrinsic camera matrix and [R, t] the 3x4 matrix representing the camera pose in XYZ world frame, so that if Pc and Pw represent the same 3D point respectively in camera and world coordinates, it is Pc = R*Pw + t = [R, t] * [Pw.T, 1].T, where .T means transposed. Then you can write the camera projection as:
s * p = K * [R, t] * [Pw.T, 1].T
where s is an arbitrary scale factor and p is the pixel that Pw projects onto. But if Pw=[X, Y, Z].T is on the Z=0 plane, the 3rd column of R only multiplies zeros, so we can ignore it. If we then denote with r1 and r2 the first two columns of R, we can rewrite the above equation as:
s * p = K * [r1, r2, t] * [X, Y, 1].T
But K * [r1, r2, t] is a 3x3 matrix that transforms points on a 3D plane to points on the camera plane, so it is a homography.
If the plane is not Z=0, you can repeat the same argument replacing [R, t] with [R, t] * inv([Rp, tp]), where [Rp, tp] is the coordinate transform that maps a frame on the plane, with the plane normal being the Z axis, to the world frame.
Finally, to obtain the bird's eye view, you select a rotation R whose third column (the components of the world's Z axis in camera frame) is opposite to the plane's normal.
Related
I have input image and grid passed in torch.nn.functional.grid_sample(). Now if I have a random pixel location (x, y) from the input image, how can I find out its location in the output of grid_sample().
To be precise I am looking for the delta of each pixel in terms of coordinates.
Would this be sufficient for finding new location of pixel:
ix = ((ix + 1) / 2) * (IW-1);
iy = ((iy + 1) / 2) * (IH-1);
as mentioned in https://github.com/pytorch/pytorch/blob/f064c5aa33483061a48994608d890b968ae53fb5/aten/src/THNN/generic/SpatialGridSamplerBilinear.c
How did you compute the grid? It must be based on some transform. Often, the affine_grid function is used. And this function takes the transformation matrix as input.
Given this transformation matrix (and its inverse), you can go in both directions: from input image pixel location to output image pixel location, and the other way round.
Here a sample code showing how to compute the transforms both for forward and backward direction. In the last line you see how to map a pixel location in both directions.
import torch
import torch.nn.functional as F
# given a transform mapping from output to input, create the sample grid
input_tensor = torch.zeros([1, 1, 2, 2]) # batch x channels x height x width
transform = torch.tensor([[[0.5, 0, 0], [0, 1, 3]]]).float()
grid = F.affine_grid(transform, input_tensor.size(), align_corners=True)
# show the grid
print('GRID')
print('y', grid[0, ..., 0])
print('x', grid[0, ..., 1])
# compute both transformation matrices (forward and backward) with shape 3x3
print('TRANSFORM AND INVERSE')
transform_full = torch.zeros([1, 3, 3])
transform_full[0, 2, 2] = 1
transform_full[0, :2, :3] = transform
transform_inv_full = torch.inverse(transform_full)
print(transform_full)
print(transform_inv_full)
# map pixel location x=2, y=3 in both directions (forward and backward)
print('TRANSFORMED PIXEL LOCATIONS')
print(transform_full#torch.tensor([[2, 3, 1]]).float().T)
print(transform_inv_full#torch.tensor([[2, 3, 1]]).float().T)
Sample code: https://developer.apple.com/documentation/arkit/visualizing_a_point_cloud_using_scene_depth
In the code, when unprojecting depthmap into world point, we are using a positive z value(depth value). But in my understanding, ARKit uses right-handed coordinate system which means points with positive z value are behind the camera. So maybe we need to do some extra work to align the coordinate system(using rotateToARCamera matrix?). But I cannot understand why we need to flip both Y and Z plane.
static func makeRotateToARCameraMatrix(orientation: UIInterfaceOrientation) -> matrix_float4x4 {
// flip to ARKit Camera's coordinate
let flipYZ = matrix_float4x4(
[1, 0, 0, 0],
[0, -1, 0, 0],
[0, 0, -1, 0],
[0, 0, 0, 1] )
let rotationAngle = Float(cameraToDisplayRotation(orientation: orientation)) * .degreesToRadian
return flipYZ * matrix_float4x4(simd_quaternion(rotationAngle, Float3(0, 0, 1)))
}
Update: I guess the key point is the coordinate system used for camera intrinsics matrix's pin-hole model has an inverse direction compared to the normal camera space in ARKit.
Depth Map is a coordinate system where the Y coordinate is smaller at the top and larger at the bottom like image data, but ARKit is a coordinate system where the Y coordinate is smaller from the bottom and larger at the top.
For this reason, I think it is necessary to invert the Y coordinate.
I am trying to create a 2D perspective transform matrix from individual components like translation, rotation, scale, shear. But at the end the matrix is not producing a true perspective effect like the image below. I think I am missing some component in the code that I wrote to create the matrix. Could some one help me add the missing components and their formulation in the below function? I have used opencv library for my code
cv::Mat getPerspMatrix2D( double rz, double s, double tx, double ty ,double shx, double shy)
{
cv::Mat R = (cv::Mat_<double>(3,3) <<
cos(rz), -sin(rz), 0,
sin(rz), cos(rz), 0,
0, 0, 1);
cv::Mat S = (cv::Mat_<double>(3,3) <<
s, 0, 0,
0, s, 0,
0, 0, 1);
cv::Mat Sh = (cv::Mat_<double>(3,3) <<
1, shx, 0,
shy, 1, 0,
0, 0, 1);
cv::Mat T = (cv::Mat_<double>(3,3) <<
1, 0, tx,
0, 1, ty,
0, 0, 1);
return T * Sh * S * R;
}
Keywords are Homography and 8DOF. Taken from 1 and 2 there exists two coefficients for perspective transformation. But it needs a 2nd step to calculate it. I'm not familiar with OpenCV but I'm hoping to answer your question a bit late in a basically way ;-)
Step 1
You can imagine lx describes a vanishing point on the x axis. The image shows a31=lx=1. lx=100 is less transformation. For lx=0 the position is infinite far means no perspective transform = identity matrix.
[1 0 0]
PL = [0 1 0]
[lx ly 1]
lx/ly are perspective foreshortening parameters
Step 2
When you apply a right hand matrix multiplication P x [u; v; 1] you will recognize that the last value in the result is sometimes other than 1. For affine transformation it is always 1 for perspective projection not. In the 2nd step the result is scaled to make the last coefficient 1. This is a part of the effect.
Your Example Image
Image' = P4 x P3 x P2 x P1 x Image
I would translate the center of the blue rectangle to the origin tx=-w/2 and ty=-h/2 = P1.
Apply projective projection with ly = h (to make both sides at an angle)
Eventually translate back that all point are located in one quadrant
Eventually scale to desired size
Step 2 from the perspective projection can be done after 2.) or at the end.
How can I calculate distance from camera to a point on a ground plane from an image?
I have the intrinsic parameters of the camera and the position (height, pitch).
Is there any OpenCV function that can estimate that distance?
You can use undistortPoints to compute the rays backprojecting the pixels, but that API is rather hard to use for your purpose. It may be easier to do the calculation "by hand" in your code. Doing it at least once will also help you understand what exactly that API is doing.
Express your "position (height, pitch)" of the camera as a rotation matrix R and a translation vector t, representing the coordinate transform from the origin of the ground plane to the camera. That is, given a point in ground plane coordinates Pg = [Xg, Yg, Zg], its coordinates in camera frame are given by
Pc = R * Pg + t
The camera center is Cc = [0, 0, 0] in camera coordinates. In ground coordinates it is then:
Cg = inv(R) * (-t) = -R' * t
where inv(R) is the inverse of R, R' is its transpose, and the last equality is due to R being an orthogonal matrix.
Let's assume, for simplicity, that the the ground plane is Zg = 0.
Let K be the matrix of intrinsic parameters. Given a pixel q = [u, v], write it in homogeneous image coordinates Q = [u, v, 1]. Its location in camera coordinates is
Qc = Ki * Q
where Ki = inv(K) is the inverse of the intrinsic parameters matrix. The same point in world coordinates is then
Qg = R' * Qc + Cg
All the points Pg = [Xg, Yg, Zg] that belong to the ray from the camera center through that pixel, expressed in ground coordinates, are then on the line
Pg = Cg + lambda * (Qg - Cg)
for lambda going from 0 to positive infinity. This last formula represents three equations in ground XYZ coordinates, and you want to find the values of X, Y, Z and lambda where the ray intersects the ground plane. But that means Zg=0, so you have only 3 unknowns. Solve them (you recover lambda from the 3rd equation, then substitute in the first two), and you get Xg and Yg of the solution to your problem.
I already know that solvePnP() finds the position (rotation and translation) of the camera using the 2d point coordinates and corresponding 3d point coordinates, but i don't really understand why i have to use it after i triangulated some 3d points with 2 cameras and their corresponding 2d points.
Because while triangulating a new 3D point, i already have (need) the projection matrices P1 and P2 of the two cameras (which contains of the R1, R2 and t1, t2 rotation and translations and is already the location of the cameras w.r.t. the new triangulating 3D point).
My workflow is:
Get 2D-correspondences from 2 images.
Get Essential Matrix E using these 2D-correspondences.
Get relative orientation (R, t) of the 2 images from the Essential Matrix E.
Set Projection Matrix P1 of camera1 to
P1 = (1, 0, 0, 0,
0, 1, 0, 0,
0, 0, 1, 0,
0, 0, 0, 0);
and set Projection Matrix P2 of camera2 to
P2 = (R.at<double>(0, 0), R.at<double>(0, 1), R.at<double>(0, 2), t.at<double>(0),
R.at<double>(1, 0), R.at<double>(1, 1), R.at<double>(1, 2), t.at<double>(1),
R.at<double>(2, 0), R.at<double>(2, 1), R.at<double>(2, 2), t.at<double>(2));
Solve least squares problem
P1 * X = x1
P2 * X = x2
(solving for X = 3D Point). and so on.....
After that i get a triangulated 3D Point X from these Projection Matrices P1 and P2 and the x1 and x2 2D Point correspondences.
My question is now again:
Why i need to use now solvePnP() to get the camera location? Because I already have P1 and P2 which should be already the locations of the cameras (w.r.t. the triangulated 3D points).
You dont have to have each camera pose - only relative R|t is needed. You can't assume projection matrix for first or second camera as identity - there should be projection modeled. You could calculate intrinsic matrices with planar pattern and calibrate camera method.
You can assume R1= I and t1=0 (rotation and translation for first camera). Therefore R2=R and t2=t. Triangulated 3d points will have coordinates in first camera coordinate system.