opencv depth map accuracy - opencv

I want to measure distance to an object using a 3d stereoscopic camera phone with opencv. I am looking for a formula which will measure the accuracy of the distance measurement, depending on the focal length, the distance between the 2 cameras, the image resolution, and the size of the measured object.
Googling a little, I found this formula:
d = Z^2 * p / (f*b)
Z - distance to object, p - disparity accuracy, f - focal length, b - baseline (distance between cameras).
I know the baseline and the focal length, but I don't know the disparity accuracy.
Is this formula what I need? If so, how do I find the disparity accuracy?
Thanks.

I realize this is a year late, but just in case someone finds this.
The formula is this:
dD = dd * D^2 / fB
where:
dd = disparity error
dD = depth error
D = depth
f = focal length
B = baseline
if f = 6mm = 0.006m, B = 24mm = 0.024m, D = 10m, dd is 1 pixel [let's call it P for now, but it's usually about 1.4um].
Plugging all the numbers in gives:
dD = P * 10^2 / (0.006 * 0.024) ~ 694444 P
For P=1.4um, dD = 0.97 m (which is about 9.7%).
Now this is assuming that your correspondence gives single pixel error. You can do sub pixel search and depending on the noise level and texture in the image, you can get sub pixel accurate correspondence. In which case, your accuracy would be a little better.
NOTE that this formula is for error. The map between disparity and depth is as follows:
d = fB / D
where:
d = disparity
D = depth
f = focal length
B = baseline
Similarly, plugging the numbers in gives:
d = (0.006 * 0.024 / 10) m = 0.0000144 m = 0.0144 mm = 14.4 um.
if you assume that your pixel size is about 1.4um, then 14.4um is about 10 pixels. This is consistent with the error above -- meaning that a 1 pixel error represents roughly 10%.
A car that is 10 meters away is shifted 10 pixels between the left and right sensors.
I hope that helps.

If you look at the paragraph after formula 8 in the document you link you can see that they have a disparity accuracy 0.18*10^-6m. Reading a bit further, I conclude that the disparity accuracy they use is the distance in m between two pixels on the CCD of the cameras used. For a 1/4" CCD (which measures 3.2mm by 2.4mm) with resolution 640X480 (a very old VGA camera) this would be 5*10^-6. I don't know what the sensor size for the LG Optimus 3D is, but assuming 1/4" CCD's and 2592 pixel horizontal resolution, the baseline for the disparity accuracy would be: 1.23*10^-6, giving a depth accuracy at 10m of about 0.85m. Which looks reasonable to me. If the CCD is smaller it will improve (i.e. the accuracy value lowers).
This is the lowest possible value that assumes perfect matching of features between the two stereo images. This value just represents the physical limitations of your stereo pair.

Related

Code for a multiple quadratic (or polynomial) least squares (surface fit)?

for a machine vision project I am trying to search image data for quadratic surfaces (f(x,y) = Ax^2+Bx+Cy^2+Dy+Exy+F). My plan is to iterate through regions of data and perform a surface-fit, look at the error, see if it's a continuous surface (which would probably indicate a feature in the image).
I was previously able to find quadratic curves (f(x) = Ax^2+Bx+C) in the image data by sampling lines, by using the equations on this site
Link
this worked well, was promising, but it would be much more useful for my task to find 2-D regions that form continuous surfaces.
I see lots of articles indicating that least squares regressions scales up to multiple dimensions, but I'm not able to find code for this Hopefully there is a "closed form" (non-iterative, just compute from your data points) solution, like described above for 1D data. Does anybody know of some source or pseudocode that accomplishes this? Thanks.
(Sorry if my terminology is a bit off.)
I'm not sure what your background is, but if you know some linear algebra you will find linear least squares on wikipedia useful.
Lets take the following example. Say we have the following image
and we want to know how well this fits to a 2D quadratic function in a least squares sense.
Probably the most straightforward way to solve the problem is to compute the optimal coefficients in a least squares sense, then check the error.
First we need to describe the matrices.
Let X be a matrix containing every x,y coordinate in the image, taking the form
X = [x1 x1^2 y1 y1^2 x1*y1 1;
x2 x2^2 y2 y2^2 x2*y2 1;
...
xN xN^2 yN yN^2 xN*yN 1];
For the example image above, X would be a 100x6 matrix.
Let y be the image intensity values in a vector of the form
y = [img(x1,y1);
img(x2,y2);
...
img(xN,yN)]
In this case y is a 100 element column vector.
We want to minimize the least squares objective function S with respect to the vector of coefficients b
S(b) = |y - X*b|^2
where |.| is the L2 norm and b is the desired coefficients
b = [A;
B;
C;
D;
E;
F]
Taking the vector derivative of S(b) with respect to b, setting to zero, and solving for b leads to the standard least squares solution.
b = inv(X'X)*X'*y
where inv is the matrix inverse, ' is transpose, and * is matrix multiplication.
MATLAB example.
% Generate an image
% define x,y coordinates for each location in the image
[x,y] = meshgrid(1:10,1:10);
% true coefficients
b_true = [0.1 0.5 0.3 -0.4 0.4 124];
% magnitude of noise
P = 2;
% create image
img = b_true(1).*x + b_true(2).*x.^2 + b_true(3).*y + b_true(4).*y.^2 + b_true(5).*x.*y + b_true(6);
noise = P*randn(10,10);
img = img + noise;
% Begin least squares optimization
% create matrices
X = [x(:) x(:).^2 y(:) y(:).^2 x(:).*y(:) ones(size(x(:)))];
y = img(:);
% estimated coefficients
b = (X.'*X)\(X.')*y
% mean square error (expected to be near P^2)
E = 1/numel(y) * sum((y - X*b).^2)
Output
b =
0.0906
0.5093
0.1245
-0.3733
0.3776
124.5412
E =
3.4699
In your application you would probably want to define some threshold such that when E < threshold you accept the image (or image region) as a quadratic polynomial.

Point Cloud from KITTI stereo images

I try to create a Point Cloud based on the images from the KITTI stereo images dataset so then later I could estimate 3D position of some objects.
Original images looks like this.
What I have so far:
generated disparity with cv2.StereoSGBM_create
window_size = 9
minDisparity = 1
stereo = cv2.StereoSGBM_create(
blockSize=10,
numDisparities=64,
preFilterCap=10,
minDisparity=minDisparity,
P1=4 * 3 * window_size ** 2,
P2=32 * 3 * window_size ** 2
)
calculated Q matrix with cv2.stereoRectify using data from KITTI calibration files.
# K_xx: 3x3 calibration matrix of camera xx before rectification
K_L = np.matrix(
[[9.597910e+02, 0.000000e+00, 6.960217e+02],
[0.000000e+00, 9.569251e+02, 2.241806e+02],
[0.000000e+00, 0.000000e+00, 1.000000e+00]])
K_R = np.matrix(
[[9.037596e+02, 0.000000e+00, 6.957519e+02],
[0.000000e+00, 9.019653e+02, 2.242509e+02],
[0.000000e+00, 0.000000e+00, 1.000000e+00]])
# D_xx: 1x5 distortion vector of camera xx before rectification
D_L = np.matrix([-3.691481e-01, 1.968681e-01, 1.353473e-03, 5.677587e-04, -6.770705e-02])
D_R = np.matrix([-3.639558e-01, 1.788651e-01, 6.029694e-04, -3.922424e-04, -5.382460e-02])
# R_xx: 3x3 rotation matrix of camera xx (extrinsic)
R_L = np.transpose(np.matrix([[9.999758e-01, -5.267463e-03, -4.552439e-03],
[5.251945e-03, 9.999804e-01, -3.413835e-03],
[4.570332e-03, 3.389843e-03, 9.999838e-01]]))
R_R = np.matrix([[9.995599e-01, 1.699522e-02, -2.431313e-02],
[-1.704422e-02, 9.998531e-01, -1.809756e-03],
[2.427880e-02, 2.223358e-03, 9.997028e-01]])
# T_xx: 3x1 translation vector of camera xx (extrinsic)
T_L = np.transpose(np.matrix([5.956621e-02, 2.900141e-04, 2.577209e-03]))
T_R = np.transpose(np.matrix([-4.731050e-01, 5.551470e-03, -5.250882e-03]))
IMG_SIZE = (1392, 512)
rotation = R_L * R_R
translation = T_L - T_R
# output matrices from stereoRectify init
R1 = np.zeros(shape=(3, 3))
R2 = np.zeros(shape=(3, 3))
P1 = np.zeros(shape=(3, 4))
P2 = np.zeros(shape=(3, 4))
Q = np.zeros(shape=(4, 4))
R1, R2, P1, P2, Q, validPixROI1, validPixROI2 = cv2.stereoRectify(K_L, D_L, K_R, D_R, IMG_SIZE, rotation, translation,
R1, R2, P1, P2, Q,
newImageSize=(1242, 375))
The resulting matrix look like this (at this point I have a doubt that it is correct):
[[ 1. 0. 0. -614.37893072]
[ 0. 1. 0. -162.12583194]
[ 0. 0. 0. 680.05186262]
[ 0. 0. -1.87703644 0. ]]
Generated Point Cloud with reprojectImageTo3D which looks like this: point cloud
And now the questions part begins :)
Is it OK that all values returned by reprojectImageTo3D are negative?
What are the units of those values, taking into account that it is the KITTI dataset and their camera calibration data is available?
And finally, is it possible to convert those values to something like longitude\latitude if I have GPS coordinate of the camera that took those photos?
Would be appreciated for any help!
Is it OK for all values returned by reprojectImageTo3D to be negative?
Generally speaking, no, at least for Z values. The values returned by reprojectImageTo3D are real-world coordinates relative to the camera origin, so for a Z value to be negative it means the point is behind the camera (which is geometrically incorrect). The X and Y values can be negative, since the camera origin is at the center of the FOV, so a negative X value means the point is "to the left" and a negative Y value means the point is "below". But for Z values, no, they should not be negative.
Your Q matrix is turning out almost the identity, since I think you are incorrectly setting up the rotation matrices in your call to stereoRectify. When you pass rotation and translation, that is the single rotation from camera 1 to camera 2, not the combined rotation from camera 1 to camera 2. What you are doing is multiplying the two rotations together after transposing one of them; instead you should be passing only R_L (since from your description I assume this means that it is the rotation from left to right camera).
What are the units of those values, taking into account that it is the KITTI dataset and their camera calibration data is available?
I am not familiar with the KITTI dataset, but the values returned after calling reprojectImageTo3D are in real-world units, typically meters.
And finally, is it possible to convert those values to something like longitude\latitude if I have GPS coordinate of the camera that took those photos?
The coordinates returned by reprojectImageTo3D are in real-world coordinates relative to the camera origin. If you have the GPS coordinate of the camera that took the photos, you can manipulate the latitude/longitude values with the (X, Y, Z) coordinates returned from the reprojection.

OpenCV calculate distance from object with known size

Is it possible to calculate the distance of an object with known size?
I would like to do this with an ball which has 7cm diamater. Now for the first calculation I would put him in 30cm distance to the webcam and in the second 50cm.
Is there a linear function or formular to calculate somehow the distance?
Lets say in the first measure it has a diamater of 6 pixel and in the second only 4. There must be a formular for this?
Best regards
In optical scheme you have two similar right triangles with edges F (objective focus distance), PixelSize, Distance and Size
Distance / Size = F / PixelSize
So having parameters for some known Distance0, you can get F (in pixel units, consider it as some constant)
F = Distance0 * PixelSize0 / Size0
and use it to calculate unknown distance (until zoom changes)
Distance = F * Size / PixelSize
(Note that you can vary object size)

Essential Matrix from Fundamental Matrix in OpenCV

I've already computed the Fundamental Matrix of a stereo pair through corresponding points, found using SURF. According to Hartley and Zisserman, the Essential Matrix is computed doing:
E = K.t() * F * K
How I get K? Is there another way to compute E?
I don't know where you got that formulae, but the correct one is
E = K'^T . F . K (see Hartley & Zisserman, ยง9.6, page 257 of second edition)
K is the intrinsic camera parameters, holding scale factors and positions of the center of the image, expressed in pixel units.
| \alpha_u 0 u_0 |
K = | 0 \alpha_u v_0 |
| 0 0 1 |
(sorry, Latex not supported on SO)
Edit : To get those values, you can either:
calibrate the camera
compute an approximate value if you have the manufacturer data. If the lens is correctly centered on the sensor, then u_0 and v_0 are the half of, respectively, width and height of image resolution. And alpha = k.f with f: focal length (m.), and k the pixel scale factor: if you have a pixel of, say, 6 um, then k=1/6um.
Example, if the lens is 8mm and pixel size 8um, then alpha=1000
Computing E
Sure, there are several of ways to compute E, for example, if you have strong-calibrated the rig of cameras, then you can extract R and t (rotation matrix and translation vector) between the two cameras, and E is defined as the product of the skew-symmetric matrix t and the matrix R.
But if you have the book, all of this is inside.
Edit Just noticed, there is even a Wikipedia page on this topic!

Camera motion from corresponding images

I'm trying to calculate a new camera position based on the motion of corresponding images.
the images conform to the pinhole camera model.
As a matter of fact, I don't get useful results, so I try to describe my procedure and hope that somebody can help me.
I match the features of the corresponding images with SIFT, match them with OpenCV's FlannBasedMatcher and calculate the fundamental matrix with OpenCV's findFundamentalMat (method RANSAC).
Then I calculate the essential matrix by the camera intrinsic matrix (K):
Mat E = K.t() * F * K;
I decompose the essential matrix to rotation and translation with singular value decomposition:
SVD decomp = SVD(E);
Matx33d W(0,-1,0,
1,0,0,
0,0,1);
Matx33d Wt(0,1,0,
-1,0,0,
0,0,1);
R1 = decomp.u * Mat(W) * decomp.vt;
R2 = decomp.u * Mat(Wt) * decomp.vt;
t1 = decomp.u.col(2); //u3
t2 = -decomp.u.col(2); //u3
Then I try to find the correct solution by triangulation. (this part is from http://www.morethantechnical.com/2012/01/04/simple-triangulation-with-opencv-from-harley-zisserman-w-code/ so I think that should work correct).
The new position is then calculated with:
new_pos = old_pos + -R.t()*t;
where new_pos & old_pos are vectors (3x1), R the rotation matrix (3x3) and t the translation vector (3x1).
Unfortunately I got no useful results, so maybe anyone has an idea what could be wrong.
Here are some results (just in case someone can confirm that any of them is definitely wrong):
F = [8.093827077399547e-07, 1.102681999632987e-06, -0.0007939604310854831;
1.29246107737264e-06, 1.492629957878578e-06, -0.001211264339006535;
-0.001052930954975217, -0.001278667878010564, 1]
K = [150, 0, 300;
0, 150, 400;
0, 0, 1]
E = [0.01821111092414898, 0.02481034499174221, -0.01651092283654529;
0.02908037424088439, 0.03358417405226801, -0.03397110489649674;
-0.04396975675562629, -0.05262169424538553, 0.04904210357279387]
t = [0.2970648246214448; 0.7352053067682792; 0.6092828956013705]
R = [0.2048034356172475, 0.4709818957303019, -0.858039396912323;
-0.8690270040802598, -0.3158728880490416, -0.3808101689488421;
-0.4503860776474556, 0.8236506374002566, 0.3446041331317597]
First of all you should check if
x' * F * x = 0
for your point correspondences x' and x. This should be of course only the case for the inliers of the fundamental matrix estimation with RANSAC.
Thereafter, you have to transform your point correspondences to normalized image coordinates (NCC) like this
xn = inv(K) * x
xn' = inv(K') * x'
where K' is the intrinsic camera matrix of the second image and x' are the points of the second image. I think in your case it is K = K'.
With these NCCs you can decompose your essential matrix like you described. You triangulate the normalized camera coordinates and check the depth of your triangulated points. But be careful, in literature they say that one point is sufficient to get the correct rotation and translation. From my experience you should check a few points since one point can be an outlier even after RANSAC.
Before you decompose the essential matrix make sure that E=U*diag(1,1,0)*Vt. This condition is required to get correct results for the four possible choices of the projection matrix.
When you've got the correct rotation and translation you can triangulate all your point correspondences (the inliers of the fundamental matrix estimation with RANSAC). Then, you should compute the reprojection error. Firstly, you compute the reprojected position like this
xp = K * P * X
xp' = K' * P' * X
where X is the computed (homogeneous) 3D position. P and P' are the 3x4 projection matrices. The projection matrix P is normally given by the identity. P' = [R, t] is given by the rotation matrix in the first 3 columns and rows and the translation in the fourth column, so that P is a 3x4 matrix. This only works if you transform your 3D position to homogeneous coordinates, i.e. 4x1 vectors instead of 3x1. Then, xp and xp' are also homogeneous coordinates representing your (reprojected) 2D positions of your corresponding points.
I think the
new_pos = old_pos + -R.t()*t;
is incorrect since firstly, you only translate the old_pos and you do not rotate it and secondly, you translate it with a wrong vector. The correct way is given above.
So, after you computed the reprojected points you can calculate the reprojection error. Since you are working with homogeneous coordinates you have to normalize them (xp = xp / xp(2), divide by last coordinate). This is given by
error = (x(0)-xp(0))^2 + (x(1)-xp(1))^2
If the error is large such as 10^2 your intrinsic camera calibration or your rotation/translation are incorrect (perhaps both). Depending on your coordinate system you can try to inverse your projection matrices. On that account you need to transform them to homogeneous coordinates before since you cannot invert a 3x4 matrix (without the pseudo inverse). Thus, add the fourth row [0 0 0 1], compute the inverse and remove the fourth row.
There is one more thing with reprojection error. In general, the reprojection error is the squared distance between your original point correspondence (in each image) and the reprojected position. You can take the square root to get the Euclidean distance between both points.
To update your camera position, you have to update the translation first, then update the rotation matrix.
t_ref += lambda * (R_ref * t);
R_ref = R * R_ref;
where t_ref and R_ref are your camera state, R and t are new calculated camera rotation and translation, and lambda is the scale factor.

Resources