Point Cloud from KITTI stereo images - opencv

I try to create a Point Cloud based on the images from the KITTI stereo images dataset so then later I could estimate 3D position of some objects.
Original images looks like this.
What I have so far:
generated disparity with cv2.StereoSGBM_create
window_size = 9
minDisparity = 1
stereo = cv2.StereoSGBM_create(
blockSize=10,
numDisparities=64,
preFilterCap=10,
minDisparity=minDisparity,
P1=4 * 3 * window_size ** 2,
P2=32 * 3 * window_size ** 2
)
calculated Q matrix with cv2.stereoRectify using data from KITTI calibration files.
# K_xx: 3x3 calibration matrix of camera xx before rectification
K_L = np.matrix(
[[9.597910e+02, 0.000000e+00, 6.960217e+02],
[0.000000e+00, 9.569251e+02, 2.241806e+02],
[0.000000e+00, 0.000000e+00, 1.000000e+00]])
K_R = np.matrix(
[[9.037596e+02, 0.000000e+00, 6.957519e+02],
[0.000000e+00, 9.019653e+02, 2.242509e+02],
[0.000000e+00, 0.000000e+00, 1.000000e+00]])
# D_xx: 1x5 distortion vector of camera xx before rectification
D_L = np.matrix([-3.691481e-01, 1.968681e-01, 1.353473e-03, 5.677587e-04, -6.770705e-02])
D_R = np.matrix([-3.639558e-01, 1.788651e-01, 6.029694e-04, -3.922424e-04, -5.382460e-02])
# R_xx: 3x3 rotation matrix of camera xx (extrinsic)
R_L = np.transpose(np.matrix([[9.999758e-01, -5.267463e-03, -4.552439e-03],
[5.251945e-03, 9.999804e-01, -3.413835e-03],
[4.570332e-03, 3.389843e-03, 9.999838e-01]]))
R_R = np.matrix([[9.995599e-01, 1.699522e-02, -2.431313e-02],
[-1.704422e-02, 9.998531e-01, -1.809756e-03],
[2.427880e-02, 2.223358e-03, 9.997028e-01]])
# T_xx: 3x1 translation vector of camera xx (extrinsic)
T_L = np.transpose(np.matrix([5.956621e-02, 2.900141e-04, 2.577209e-03]))
T_R = np.transpose(np.matrix([-4.731050e-01, 5.551470e-03, -5.250882e-03]))
IMG_SIZE = (1392, 512)
rotation = R_L * R_R
translation = T_L - T_R
# output matrices from stereoRectify init
R1 = np.zeros(shape=(3, 3))
R2 = np.zeros(shape=(3, 3))
P1 = np.zeros(shape=(3, 4))
P2 = np.zeros(shape=(3, 4))
Q = np.zeros(shape=(4, 4))
R1, R2, P1, P2, Q, validPixROI1, validPixROI2 = cv2.stereoRectify(K_L, D_L, K_R, D_R, IMG_SIZE, rotation, translation,
R1, R2, P1, P2, Q,
newImageSize=(1242, 375))
The resulting matrix look like this (at this point I have a doubt that it is correct):
[[ 1. 0. 0. -614.37893072]
[ 0. 1. 0. -162.12583194]
[ 0. 0. 0. 680.05186262]
[ 0. 0. -1.87703644 0. ]]
Generated Point Cloud with reprojectImageTo3D which looks like this: point cloud
And now the questions part begins :)
Is it OK that all values returned by reprojectImageTo3D are negative?
What are the units of those values, taking into account that it is the KITTI dataset and their camera calibration data is available?
And finally, is it possible to convert those values to something like longitude\latitude if I have GPS coordinate of the camera that took those photos?
Would be appreciated for any help!

Is it OK for all values returned by reprojectImageTo3D to be negative?
Generally speaking, no, at least for Z values. The values returned by reprojectImageTo3D are real-world coordinates relative to the camera origin, so for a Z value to be negative it means the point is behind the camera (which is geometrically incorrect). The X and Y values can be negative, since the camera origin is at the center of the FOV, so a negative X value means the point is "to the left" and a negative Y value means the point is "below". But for Z values, no, they should not be negative.
Your Q matrix is turning out almost the identity, since I think you are incorrectly setting up the rotation matrices in your call to stereoRectify. When you pass rotation and translation, that is the single rotation from camera 1 to camera 2, not the combined rotation from camera 1 to camera 2. What you are doing is multiplying the two rotations together after transposing one of them; instead you should be passing only R_L (since from your description I assume this means that it is the rotation from left to right camera).
What are the units of those values, taking into account that it is the KITTI dataset and their camera calibration data is available?
I am not familiar with the KITTI dataset, but the values returned after calling reprojectImageTo3D are in real-world units, typically meters.
And finally, is it possible to convert those values to something like longitude\latitude if I have GPS coordinate of the camera that took those photos?
The coordinates returned by reprojectImageTo3D are in real-world coordinates relative to the camera origin. If you have the GPS coordinate of the camera that took the photos, you can manipulate the latitude/longitude values with the (X, Y, Z) coordinates returned from the reprojection.

Related

Apply relative radial distortion function to image w/o knowing anything about the camera

I've a radial distortion function which gives me relative distortion from 0 (image center) to the relative full image field (field height 1) in percent. For example this function would give me a distortion of up to 5% at the full relative field height of 1.
I tried to use this together with opencv undistort function to apply distortion but don't know how to fill the matrices.
As said, I've a source image only and don't know anything about the camera parameters like focal length, except that I know the distortion function.
How should I set the matrix in cv2.undistort(src_image, matrix, ...) ?
The OpenCv routine that's easier to use in your case is cv::remap, not undistort.
In the following I assume your distortion purely radial. Similar considerations apply if you have it already decomposed in (x, y).
So you have a distortion function d(r) of the distance r = sqrt((x - x_c)^2 + (y - y_c)^2) of a pixel (x, y) from the image center (x_c, y_c). The function expresses the relative change of the radius r_d of a pixel in the distorted image from the undistorted one r: (r_d - r) / r = d(r), or, equivalently, r_d = r * (1 - d(r)).
If you are given a distorted image, and want to remove the distortion, you need to invert the above equation (i.e. solve it analytically or numerically), finding the value of r for every r_d in the range of interest. Then you can trivially create two arrays, map_x and map_y, that represent the mapping from distorted to undistorted coordinates: for a given pair (x_d, y_d) of integer pixel coordinates in the distorted image, you compute the associated r_d = sqrt(((x_d - x_c)^2 + (y_d - y_c)^2), then the corresponding r as function of r_d from solving the equation, go back to (x, y), and assign map_x[y_d, x_d] = x; map_y[y_d, x_d] = y. Finally, you pass those to cv::remap.

Correct non-nadir view for GSD calculation (UAV)

Hello stackoverflow community,
So I am working on a project that requires calculating the ground sampling distance (GSD) in order to retrive the meter/pixel scale.
The GSD for nadir view (camera looking directly to the ground) formula is as follow :
GSD = (flight altitude x sensor height) / (focal length x image height and/or width).
and I read on multiple article like : https://www.mdpi.com/2072-4292/13/4/573
That if the camera has a tilt angle on one axis a correction as follow is requried :
where θ is the tilt angle and phi as they said in the article :
φ describes the angular position of the pixel in the image: it is
zero in correspondence of the optical axis of the camera, while it can
have positive or negative values for the other pixels
and the figure on their article is this :
So I hope you are on the same page as me, now I have two questions :
1- First how do I exactly calculate the angular position of a given pixel with respect to the optical axis (how to calculate the phi)
2- The camera in my case is rotated on two axis & not just one like their example, like the camera doesn't look exactly to the road but like oriented to one of the sides, more like this one :
So would there be more changes on the formula ? I am not sure how to get the right formula geometrically
The angular position of a pixel
As explained in the article you linked, you can compute the pixel angle by knowing the camera intrinsic parameters. Firstly let's do a bit of theory: the intrinsics matrix is used to compute the projection of a world point in the image plane of the camera. The OpenCV documentation explains it very well, it is expressed like this:
( x ) ( fx 0 cx ) ( X )
s * ( y ) = ( 0 fy cy ) * ( Y )
( 1 ) ( 0 0 1 ) ( Z )
where fx,fy is your focals, cx,cy is the optical centre, x,y is the position of the pixel in your image and X,Y,Z is your world point in meters or millimetres or whatever.
Now by inverting the matrix you can instead compute the world vector from the pixel position. World vector and not world point because the distance d between the camera and the real object is unknown.
( X ) ( x )
d * ( Y ) = A^-1 * ( y )
( Z ) ( 1 )
And then you can simply compute the angle between the optical axis and this world vector to get your phi angle, for example with the formula detailed in this answer using the y-axis of the camera as normal. In pseudo-code:
intrinsic_inv = invert(intrinsic)
world_vector = multiply(intrinsic_inv, (x, y, 1))
optical axis = (0, 0, 1)
normal = (0, 1, 0)
dot = dot_product(world_vector, optical_axis)
det = dot_product(normal, cross_product(world_vector, optical_axis))
phi = atan2(det, dot)
The camera angles
You can express the rotation of the camera by three angles: the tilt, the pan, and the roll angles. Take a look at this image I quickly googled if you want to visualize what they correspond to.
The tilt angle is the one named theta in your article, you already know it. The pan angle doesn't have an impact on the GSD, at least if we suppose that the ground is perfectly flat. If the pan angle was what you were referring to with the second rotation axis, then you'll have nothing to do.
However, if you have a non-zero roll angle this will become tricky. If you are in that case I would recommend a paradigm change to avoid dealing with angles. You can instead express the camera position using an affine transformation (rotation matrix and translation vector). This will allow you to transform the problem into a general analytical geometry problem, and then estimate the depths and scales by doing the intersection of the world vector with the ground plane. It would change the previous pseudo-code to give something like:
intrinsic_inv = invert(intrinsic)
world_vector = multiply(intrinsic_inv, (x, y, 1))
world_vector = multiply(rotation, world_vector) + translation
world_point = intersection(world_vector, ground_plane)
And then the scale can be computed by doing the differences between adjacent pixel world points.

Camera motion from corresponding images

I'm trying to calculate a new camera position based on the motion of corresponding images.
the images conform to the pinhole camera model.
As a matter of fact, I don't get useful results, so I try to describe my procedure and hope that somebody can help me.
I match the features of the corresponding images with SIFT, match them with OpenCV's FlannBasedMatcher and calculate the fundamental matrix with OpenCV's findFundamentalMat (method RANSAC).
Then I calculate the essential matrix by the camera intrinsic matrix (K):
Mat E = K.t() * F * K;
I decompose the essential matrix to rotation and translation with singular value decomposition:
SVD decomp = SVD(E);
Matx33d W(0,-1,0,
1,0,0,
0,0,1);
Matx33d Wt(0,1,0,
-1,0,0,
0,0,1);
R1 = decomp.u * Mat(W) * decomp.vt;
R2 = decomp.u * Mat(Wt) * decomp.vt;
t1 = decomp.u.col(2); //u3
t2 = -decomp.u.col(2); //u3
Then I try to find the correct solution by triangulation. (this part is from http://www.morethantechnical.com/2012/01/04/simple-triangulation-with-opencv-from-harley-zisserman-w-code/ so I think that should work correct).
The new position is then calculated with:
new_pos = old_pos + -R.t()*t;
where new_pos & old_pos are vectors (3x1), R the rotation matrix (3x3) and t the translation vector (3x1).
Unfortunately I got no useful results, so maybe anyone has an idea what could be wrong.
Here are some results (just in case someone can confirm that any of them is definitely wrong):
F = [8.093827077399547e-07, 1.102681999632987e-06, -0.0007939604310854831;
1.29246107737264e-06, 1.492629957878578e-06, -0.001211264339006535;
-0.001052930954975217, -0.001278667878010564, 1]
K = [150, 0, 300;
0, 150, 400;
0, 0, 1]
E = [0.01821111092414898, 0.02481034499174221, -0.01651092283654529;
0.02908037424088439, 0.03358417405226801, -0.03397110489649674;
-0.04396975675562629, -0.05262169424538553, 0.04904210357279387]
t = [0.2970648246214448; 0.7352053067682792; 0.6092828956013705]
R = [0.2048034356172475, 0.4709818957303019, -0.858039396912323;
-0.8690270040802598, -0.3158728880490416, -0.3808101689488421;
-0.4503860776474556, 0.8236506374002566, 0.3446041331317597]
First of all you should check if
x' * F * x = 0
for your point correspondences x' and x. This should be of course only the case for the inliers of the fundamental matrix estimation with RANSAC.
Thereafter, you have to transform your point correspondences to normalized image coordinates (NCC) like this
xn = inv(K) * x
xn' = inv(K') * x'
where K' is the intrinsic camera matrix of the second image and x' are the points of the second image. I think in your case it is K = K'.
With these NCCs you can decompose your essential matrix like you described. You triangulate the normalized camera coordinates and check the depth of your triangulated points. But be careful, in literature they say that one point is sufficient to get the correct rotation and translation. From my experience you should check a few points since one point can be an outlier even after RANSAC.
Before you decompose the essential matrix make sure that E=U*diag(1,1,0)*Vt. This condition is required to get correct results for the four possible choices of the projection matrix.
When you've got the correct rotation and translation you can triangulate all your point correspondences (the inliers of the fundamental matrix estimation with RANSAC). Then, you should compute the reprojection error. Firstly, you compute the reprojected position like this
xp = K * P * X
xp' = K' * P' * X
where X is the computed (homogeneous) 3D position. P and P' are the 3x4 projection matrices. The projection matrix P is normally given by the identity. P' = [R, t] is given by the rotation matrix in the first 3 columns and rows and the translation in the fourth column, so that P is a 3x4 matrix. This only works if you transform your 3D position to homogeneous coordinates, i.e. 4x1 vectors instead of 3x1. Then, xp and xp' are also homogeneous coordinates representing your (reprojected) 2D positions of your corresponding points.
I think the
new_pos = old_pos + -R.t()*t;
is incorrect since firstly, you only translate the old_pos and you do not rotate it and secondly, you translate it with a wrong vector. The correct way is given above.
So, after you computed the reprojected points you can calculate the reprojection error. Since you are working with homogeneous coordinates you have to normalize them (xp = xp / xp(2), divide by last coordinate). This is given by
error = (x(0)-xp(0))^2 + (x(1)-xp(1))^2
If the error is large such as 10^2 your intrinsic camera calibration or your rotation/translation are incorrect (perhaps both). Depending on your coordinate system you can try to inverse your projection matrices. On that account you need to transform them to homogeneous coordinates before since you cannot invert a 3x4 matrix (without the pseudo inverse). Thus, add the fourth row [0 0 0 1], compute the inverse and remove the fourth row.
There is one more thing with reprojection error. In general, the reprojection error is the squared distance between your original point correspondence (in each image) and the reprojected position. You can take the square root to get the Euclidean distance between both points.
To update your camera position, you have to update the translation first, then update the rotation matrix.
t_ref += lambda * (R_ref * t);
R_ref = R * R_ref;
where t_ref and R_ref are your camera state, R and t are new calculated camera rotation and translation, and lambda is the scale factor.

How to transform an image based on the position of camera

I'm trying to create a perspective projection of an image based on the look direction. I'm unexperienced on this field and can't manage to do that myself, however. Will you help me, please?
There is an image and an observer (camera). If camera can be considered an object on an invisible sphere and the image a plane going through the middle of the sphere, then camera position can be expressed as:
x = d cos(θ) cos(φ)
y = d sin(θ)
z = d sin(φ) cos(θ)
Where θ is latitude, φ is longitude and d is the distance (radius) from the middle of the sphere where the middle of the image is.
I found these formulae somwhere, but I'm not sure about the coordinates (I don't know but it looks to me that x should be z but I guess it depends on the coordinate system).
Now, what I need to do is make a proper transformation of my image so it looks as if viewed from the camera (in a proper perspective). Would you be so kind to tell me a few words how this could be done? What steps should I take?
I'm developing an iOS app and I thought I could use the following method from the QuartzCore. But I have no idea what angle I should pass to this method and how to derive the new x, y, z coordinates from the camera position.
CATransform3D CATransform3DRotate (CATransform3D t, CGFloat angle,
CGFloat x, CGFloat y, CGFloat z)
So far I have successfully created a simple viewing perspective by:
using an identity matrix (as the CATransform3D parameter) with .m34 set to 1/-1000,
rotating my image by the angle of φ with the (0, 1, 0) vector,
concatenating the result with a rotation by θ and the (1, 0, 0) vector,
scaling based on the d is ignored (I scale the image based on some other criteria).
But the result I got was not what I wanted (which was obvious) :-/. The perspective looks realistic as long as one of these two angles is close to 0. Therefore I thought there could be a way to calculate somehow a proper angle and the x, y and z coordinates to achieve a proper transformation (which might be wrong because it's just my guess).
I think I managed to find a solution, but unfortunately based on my own calculations, thoughts and experiments, so I have no idea if it is correct. Seems to be OK, but you know...
So if the coordinate system is like this:
and the plane of the image to be transformed goes through the X and the Y axis, and its centre is in the origin of the system, then the following coordinates:
x = d sin(φ) cos(θ)
y = d sin(θ)
z = d cos(θ) cos(φ)
define a vector that starts in the origin of the coordinate system and points to the position of the camera that is observing the image. The d can be set to 1 so we get a unit vector at once without further normalization. Theta is the angle in the ZY plane and phi is the angle in the ZX plane. Theta raises from 0° to 90° from the Z+ to the Y+ axis, whereas phi raises from 0° to 90° from the Z+ to the X+ axis (and to -90° in the opposite direction, in both cases).
Hence the transformation vector is:
x1 = -y / z
y1 = -x / z
z1 = 0.
I'm not sure about z1 = 0, however rotation around the Z axis seemed wrong to me.
The last thing to calculate is the angle by which the image has to be transformed. In my humble opinion this should be the angle between the vector that points to the camera (x, y, z) and the vector normal to the image, which is the Z axis (0, 0, 1).
The dot product of two vectors gives the cosine of the angle between them, so the angle is:
α = arccos(x * 0 + y * 0 + z * 1) = arccos(z).
Therefore the alpha angle and the x1, y1, z1 coordinates are the parameters of CATransform3DRotate method I mentioned in my question.
I would be grateful if somebody could tell me if this approach is correct. Thanks a lot!

Calculating homography matrix using arbitrary known geometrical relations

I am using OpenCV for an optical measurement system. I need to carry out a perspective transformation between two images, captured by a digital camera. In the field of view of the camera I placed a set of markers (which lie in a common plane), which I use as corresponding points in both images. Using the markers' positions I can calculate the homography matrix. The problem is, that the measured object, whose images I actually want to transform is positioned in a small distance from the markers and in parallel to the markers' plane. I can measure this distance.
My question is, how to take that distance into account when calculating the homography matrix, which is necessary to perform the perspective transformation.
In my solution it is a strong requirement not to use the measured object points for calculation of homography (and that is why I need other markers in the field of view).
Please let me know if the description is not precise.
Presented in the figure is the exemplary image.
The red rectangle is the measured object. It is physically placed in a small distance behind the circular markers.
I capture images of the object from different camera's positions. The measured object can deform between each acquisition. Using circular markers, I want to transform the object's image to the same coordinates. I can measure the distance between object and markers but I do not know, how should I modify the homography matrix in order to work on the measured object (instead of the markers).
This question is quite old, but it is interesting and it might be useful to someone.
First, here is how I understood the problem presented in the question:
You have two images I1 and I2 acquired by the same digital camera at two different positions. These images both show a set of markers which all lie in a common plane pm. There is also a measured object, whose visible surface lies in a plane po parallel to the marker's plane but with a small offset. You computed the homography Hm12 mapping the markers positions in I1 to the corresponding markers positions in I2 and you measured the offset dm-o between the planes po and pm. From that, you would like to calculate the homography Ho12 mapping points on the measured object in I1 to the corresponding points in I2.
A few remarks on this problem:
First, notice that an homography is a relation between image points, whereas the distance between the markers' plane and the object's plane is a distance in world coordinates. Using the latter to infer something about the former requires to have a metric estimation of the camera poses, i.e. you need to determine the euclidian and up-to-scale relative position & orientation of the camera for each of the two images. The euclidian requirement implies that the digital camera must be calibrated, which should not be a problem for an "optical measurement system". The up-to-scale requirement implies that the true 3D distance between two given 3D points must be known. For instance, you need to know the true distance l0 between two arbitrary markers.
Since we only need the relative pose of the camera for each image, we may choose to use a 3D coordinate system centered and aligned with the coordinate system of the camera for I1. Hence, we will denote the projection matrix for I1 by P1 = K1 * [ I | 0 ]. Then, we denote the projection matrix for I2 (in the same 3D coordinate system) by P2 = K2 * [ R2 | t2 ]. We will also denote by D1 and D2 the coefficients modeling lens distortion respectively for I1 and I2.
As a single digital camera acquired both I1 and I2, you may assume that K1 = K2 = K and D1 = D2 = D. However, if I1 and I2 were acquired with a long delay between the acquisitions (or with a different zoom, etc), it will be more accurate to consider that two different camera matrices and two sets of distortion coefficients are involved.
Here is how you could approach such a problem:
The steps in order to estimate P1 and P2 are as follows:
Estimate K1, K2 and D1, D2 via calibration of the digital camera
Use D1 and D2 to correct images I1 and I2 for lens distortion, then determine the marker positions in the corrected images
Compute the fundamental matrix F12 (mapping points in I1 to epilines in I2) from the corresponding markers positions and infer the essential matrix E12 = K2T * F12 * K1
Infer R2 and t2 from E12 and one point correspondence (see this answer to a related question). At this point, you have an affine estimation of the camera poses, but not an up-to-scale one since t2 has unit norm.
Use the measured distance l0 between two arbitrary markers to infer the correct norm for t2.
For the best accuracy, you may refine P1 and P2 using a bundle adjustment, with K1 and ||t2|| fixed, based on the corresponding marker positions in I1 and I2.
At this point, you have an accurate metric estimation of the camera poses P1 = K1 * [ I | 0 ] and P2 = K2 * [ R2 | t2 ]. Now, the steps to estimate Ho12 are as follows:
Use D1 and D2 to correct images I1 and I2 for lens distortion, then determine the marker positions in the corrected images (same as 2. above, no need to re-do that) and estimate Hm12 from these corresponding positions
Compute the 3x1 vector v describing the markers' plane pm by solving this linear equation: Z * Hm12 = K2 * ( R2 - t2 * vT ) * K1-1 (see HZ00 chapter 13, result 13.5 and equation 13.2 for a reference on that), where Z is a scaling factor. Infer the distance to origin dm = ||v|| and the normal n = v / ||v||, which describe the markers' plane pm in 3D.
Since the object plane po is parallel to pm, they have the same normal n. Hence, you can infer the distance to origin do for po from the distance to origin dm for pm and from the measured plane offset dm-o, as follows: do = dm ± dm-o (the sign depends of the relative position of the planes: positive if pm is closer to the camera for I1 than po, negative otherwise).
From n and do describing the object plane in 3D, infer the homography Ho12 = K2 * ( R2 - t2 * nT / do ) * K1-1 (see HZ00 chapter 13, equation 13.2)
The homography Ho12 maps points on the measured object in I1 to the corresponding points in I2, where both I1 and I2 are assumed to be corrected for lens distortion. If you need to map points from and to the original distorted image, don't forget to use the distortion coefficients D1 and D2 to transform the input and output points of Ho12.
The reference I used:
[HZ00] "Multiple view geometry for computer vision", by R.Hartley and A.Zisserman, 2000.

Resources