I'm working on a swift app that should measure physical correct z-values of a person standing in front of a statically mounted iPhone 7+. Therefore, I am using AVDepthData objects that contain the depth maps coming from the Dual Camera System.
However, the resulting point clouds indicate that the depth maps do not have subpixel accuracy since the point clouds consist of slices along the z-direction and the distances of adjacent slices increase with increasing depth. This seems to be caused by the integer discretization.
Here are two files that visualize the problem:
Captured Depthmap, cropped Z-Values after 4.0m: DepthMap with Z-values in Legend
Textured Pointcloud, view from the side (90°): Pointcloud rendered from iPhone
According to Apple's documentation, I've already deactivated the temporal filtering and unwarped the images using the distortion coefficients from the lookup table, in order to have correct world coordinates.
Filtering depth data makes it more useful for applying visual effects to a companion image, but alters the data such that it may no longer be suitable for computer vision tasks. (In an unfiltered depth map, missing values are represented as NaN.)
Is there any way to retrieve depth maps that have subpixel accuracy in order to perform good measurements of a person standing in front of the camera?
Below you can find the python code I wrote to create the pointclouds offline, the method calculate_rectified_point was provided by Apple to remove lense distortion from the images.
for v in range(height):
for u in range(width):
r, g, b = rgb_texture[v, u]
z = depth_map[v, u]
if z <= 0:
# Step 1: inverse the intrinsic parameters
x = (u - center[0]) / focal_lengths[0]
y = (v - center[1]) / focal_lengths[1]
# Step 2: remove the radial and tangential distortion
x_un, y_un = calculate_rectified_point((x, y), dist_coefficients, optical_center, (width, height))
# Step 3: inverse extrinsic parameters
x, y, z = extrinsic_matrix_inv.dot(np.array([x_un * z, y_un * z, z]))
I'm trying to obtain the orientation of a square in the real world from an image. I know the projection of each vertex in the image and with this and a depth camera I can obtain the position of the centroid in the real world.
I need the orientation of the square (actually, the normal vector to the plane) and the depth camera has not enough resolution. The camera parameters are also known.
I've search and I've only found estimation algorithms too overkill for problems with much less information. But in this case, I have a lot of data of the shape, distance, camera, image, etc. but I am not being able to get it.
Thanks in advance.
I assume the image is captured with an ordinary camera, and that your "square" is well approximated by an actual geometrical rectangle, with parallel opposite sides and orthogonal adjacent ones
If you only need the square's normal, and the camera is calibrated (in particular, the nonlinear lens distortion is removed from the image), then it can trivially be obtained from the vanishing points and the center. The algorithm is as follows:
Express the images of the four vertices p_i, i=1..4, in homogeneous coordinates: p_i = (u_i, v_i, 1). The ordering of i is unimportant, but in the following I assume it's clockwise starting from any one vertex. Also, for convenience, where in the following I write, say, i + n, it's assumed that the addition is modulo 4, so that, e.g., i + 1 = 1 when i = 4.
Compute the equations of the lines covering the square sides: l_i = p_(i+1) X p_i, where X represents the cross product.
Compute the equations of the diagonals: d_13 = p_1 X p_3, d_24 = p_2 X p_4.
Compute the center: c = d_13 X d_24.
Compute the vanishing points of the pairs of parallel sides: v_13 = l_1 X l_3, v_24 = l_2 X l_4. They represent the directions of the images of two lines which, in 3D, are orthogonal to each other.
Compute the images of the axes the 3D orthogonal coordinate frame rooted at the square center, and with two of the axes parallel to the square sides: x = c X v_13, y = c X v_24.
Lastly, the plane normal, in 3D camera coordinate frame, is their cross product: z = x X y .
Note that removing the distortion is important, because even a small amount of distortion can greatly affects the location of the vanishing points when the square sides are nearly parallel.
If you want to know why this works, the following excerpt from Hartley and Zisserman's "Multiple View Geometry in Computer Vision" should sufice:
Given an object's 3D mesh file and an image that contains the object, what are some techniques to get the orientation/pose parameters of the 3d object in the image?
I tried searching for some techniques, but most seem to require texture information of the object or at least some additional information. Is there a way to get the pose parameters using just an image and a 3d mesh file (wavefront .obj)?
Here's an example of a 2D image that can be expected.
FOV of camera
Field of view of camera is absolute minimum to know to even start with this (how can you determine how to place object when you have no idea how it would affect scene). Basically you need transform matrix that maps from world GCS (global coordinate system) to Camera/Screen space and back. If you do not have a clue what about I am writing then perhaps you should not try any of this before you learn the math.
For unknown camera you can do some calibration based on markers or etalones (known size and shape) in the view. But much better is use real camera values (like FOV angles in x,y direction, focal length etc ...)
The goal for this is to create function that maps world GCS(x,y,z) into Screen LCS(x,y).
For more info read:
transform matrix anatomy
3D graphic pipeline
Perspective projection
Silhouette matching
In order to compare rendered and real image similarity you need some kind of measure. As you need to match geometry I think silhouette matching is the way (ignoring textures, shadows and stuff).
So first you need to obtain silhouettes. Use image segmentation for that and create ROI mask of your object. For rendered image is this easy as you van render the object with single color without any lighting directly into ROI mask.
So you need to construct function that compute the difference between silhouettes. You can use any kind of measure but I think you should start with non overlapping areas pixel count (it is easy to compute).
Basically you count pixels that are present only in one ROI (region of interest) mask.
estimate position
as you got the mesh then you know its size so place it in the GCS so rendered image has very close bounding box to real image. If you do not have FOV parameters then you need to rescale and translate each rendered image so it matches images bounding box (and as result you obtain only orientation not position of object of coarse). Cameras have perspective so the more far from camera you place your object the smaller it will be.
fit orientation
render few fixed orientations covering all orientations with some step 8^3 orientations. For each compute the difference of silhouette and chose orientation with smallest difference.
Then fit the orientation angles around it to minimize difference. If you do not know how optimization or fitting works see this:
How approximation search works
Beware too small amount of initial orientations can cause false positioves or missed solutions. Too high amount will be slow.
Now that was some basics in a nutshell. As your mesh is not very simple you may need to tweak this like use contours instead of silhouettes and using distance between contours instead of non overlapping pixels count which is really hard to compute ... You should start with simpler meshes like dice , coin etc ... and when grasping all of this move to more complex shapes ...
[Edit1] algebraic approach
If you know some points in the image that coresponds to known 3D points (in your mesh) then you can along with the FOV of the camera used compute the transform matrix placing your object ...
if the transform matrix is M (OpenGL style):
M = xx,yx,zx,ox
0, 0, 0, 1
Then any point from your mesh (x,y,z) is transformed to global world (x',y',z') like this:
(x',y',z') = M * (x,y,z)
The pixel position (x'',y'') is done by camera FOV perspective projection like this:
y''=FOVy*(z'+focus)*y' + ys2;
x''=FOVx*(z'+focus)*x' + xs2;
where camera is at (0,0,-focus), projection plane is at z=0 and viewing direction is +z so for any focal length focus and screen resolution (xs,ys):
When put all this together you obtain this:
xi'' = ( xx*xi + yx*yi + zx*zi + ox ) * ( xz*xi + yz*yi + zz*zi + ox + focus ) * FOVx
yi'' = ( xy*xi + yy*yi + zy*zi + oy ) * ( xz*xi + yz*yi + zz*zi + oy + focus ) * FOVy
where (xi,yi,zi) is i-th known point 3D position in mesh local coordinates and (xi'',yi'') is corresponding known 2D pixel positions. So unknowns are the M values:
{ xx,xy,xz,yx,yy,yx,zx,zy,zz,ox,oy,oz }
So we got 2 equations per each known point and 12 unknowns total. So you need to know 6 points. Solve the system of equations and construct your matrix M.
Also you can exploit that M is a uniform orthogonal/orthonormal matrix so vectors
X = (xx,xy,xz)
Y = (yx,yy,yz)
Z = (zx,zy,zz)
Are perpendicular to each other so:
(X.Y) = (Y.Z) = (Z.X) = 0.0
Which can lower the number of needed points by introducing these to your system. Also you can exploit cross product so if you know 2 vectors the thirth can be computed
Z = (X x Y)*scale
So instead of 3 variables you need just single scale (which is 1 for orthonormal matrix). If I assume orthonormal matrix then:
|X| = |Y| = |Z| = 1
so we got 6 additional equations (3 x dot, and 3 for cross) without any additional unknowns so 3 point are indeed enough.
I have a parallel trinocular setup where all 3 cameras are alligned in a collinear fashion as depicted below.
The baseline (distance between cameras) between left and centre camera is the shortest and the baseline between left and right camera is the longest.
In theory I can obtain 3 sets of disparity images using different camera combinations (L-R, L-C and C-R).I can generate depth maps (3D points) for each disparity map using Triangulation. I now have 3 depth maps.
The L-C combination has higher depth accuracy (measured distance is more accurate) for objects that are near (since the baseline is short) whereas
the L-R combination has higher depth accuracy for objects that are far(since the baseline is long). Similarly the C-R combination is accurate for objects at medium distance.
In stereo setups, normally we define the left (RGB) image as the reference image. In my project, by thresholding the depth values, I obtain an ROI on the reference image. For example I find all the pixels that have a depth value between 10-20m and find their respective pixel locations. In this case, I have a relationship between 3D points and their corresponding pixel location.
Since in normal stereo setups, we can have higher depth accuracy only for one of the two regions depending upon the baseline (near and far), I plan on using 3 cameras. This helps me to generate 3D points of higher accuracy for three regions (near, medium and far).
I now want to merge the 3 depth maps to obtain a global map. My problems are as follows -
How to merge the three depth maps ?
After merging, how do I know which depth value corresponds to which pixel location in the reference (Left RGB) image ?
Your help will be much appreciated :)
1) I think that simple "merging" of depth maps (as matrices of values) is not possible, if you are thinking of a global 2D depth map as an image or a matrix of depth values. You can consider instead to merge the 3 set of 3D points with some similarity criteria like the distance (refining your point cloud). If they are too close, delete one of theme (pseudocode)
for i in range(points):
for j in range(i,points):
if distance(i,j) < treshold
or delete the 2 points and add a point that have average coordinates
2) From point one, this question became "how to connect a 3D point to the related pixel in the left image" (it is the only interpretation).
The answer simply is: use the projection equation. If you have K (intrinsic matrix), R (rotation matrix) and t (translation vector) from calibration of the left camera, join R and t in a 3x4 matrix
and then connect the M 3D point in 4-dimensional coordinates (X,Y,Z,1) as an m point (u,v,w)
m = K*[R|t]*M
divide m by its third coordinate w and you obtain
m = (u', v', 1)
u' and v' are the pixel coordinates in the left image.
There are many posts about 3D reconstruction from stereo views of known internal calibration, some of which are excellent. I have read a lot of them, and based on what I have read I am trying to compute my own 3D scene reconstruction with the below pipeline / algorithm. I'll set out the method then ask specific questions at the bottom.
0. Calibrate your cameras:
This means retrieve the camera calibration matrices K1 and K2 for Camera 1 and Camera 2. These are 3x3 matrices encapsulating each camera's internal parameters: focal length, principal point offset / image centre. These don't change, you should only need to do this once, well, for each camera as long as you don't zoom or change the resolution you record in.
Do this offline. Do not argue.
I'm using OpenCV's CalibrateCamera() and checkerboard routines, but this functionality is also included in the Matlab Camera Calibration toolbox. The OpenCV routines seem to work nicely.
1. Fundamental Matrix F:
With your cameras now set up as a stereo rig. Determine the fundamental matrix (3x3) of that configuration using point correspondences between the two images/views.
How you obtain the correspondences is up to you and will depend a lot on the scene itself.
I am using OpenCV's findFundamentalMat() to get F, which provides a number of options method wise (8-point algorithm, RANSAC, LMEDS).
You can test the resulting matrix by plugging it into the defining equation of the Fundamental matrix: x'Fx = 0 where x' and x are the raw image point correspondences (x, y) in homogeneous coordinates (x, y, 1) and one of the three-vectors is transposed so that the multiplication makes sense. The nearer to zero for each correspondence, the better F is obeying it's relation. This is equivalent to checking how well the derived F actually maps from one image plane to another. I get an average deflection of ~2px using the 8-point algorithm.
2. Essential Matrix E:
Compute the Essential matrix directly from F and the calibration matrices.
E = K2TFK1
3. Internal Constraint upon E:
E should obey certain constraints. In particular, if decomposed by SVD into USV.t then it's singular values should be = a, a, 0. The first two diagonal elements of S should be equal, and the third zero.
I was surprised to read here that if this is not true when you test for it, you might choose to fabricate a new Essential matrix from the prior decomposition like so: E_new = U * diag(1,1,0) * V.t which is of course guaranteed to obey the constraint. You have essentially set S = (100,010,000) artificially.
4. Full Camera Projection Matrices:
There are two camera projection matrices P1 and P2. These are 3x4 and obey the x = PX relation. Also, P = K[R|t] and therefore K_inv.P = [R|t] (where the camera calibration has been removed).
The first matrix P1 (excluding the calibration matrix K) can be set to [I|0] then P2 (excluding K) is R|t
Compute the Rotation and translation between the two cameras R, t from the decomposition of E. There are two possible ways to calculate R (U*W*V.t and U*W.t*V.t) and two ways to calculate t (±third column of U), which means that there are four combinations of Rt, only one of which is valid.
Compute all four combinations, and choose the one that geometrically corresponds to the situation where a reconstructed point is in front of both cameras. I actually do this by carrying through and calculating the resulting P2 = [R|t] and triangulating the 3d position of a few correspondences in normalised coordinates to ensure that they have a positive depth (z-coord)
5. Triangulate in 3D
Finally, combine the recovered 3x4 projection matrices with their respective calibration matrices: P'1 = K1P1 and P'2 = K2P2
And triangulate the 3-space coordinates of each 2d point correspondence accordingly, for which I am using the LinearLS method from here.
Are there any howling omissions and/or errors in this method?
My F matrix is apparently accurate (0.22% deflection in the mapping compared to typical coordinate values), but when testing E against x'Ex = 0 using normalised image correspondences the typical error in that mapping is >100% of the normalised coordinates themselves. Is testing E against xEx = 0 valid, and if so where is that jump in error coming from?
The error in my fundamental matrix estimation is significantly worse when using RANSAC than the 8pt algorithm, ±50px in the mapping between x and x'. This deeply concerns me.
'Enforcing the internal constraint' still sits very weirdly with me - how can it be valid to just manufacture a new Essential matrix from part of the decomposition of the original?
Is there a more efficient way of determining which combo of R and t to use than calculating P and triangulating some of the normalised coordinates?
My final re-projection error is hundreds of pixels in 720p images. Am I likely looking at problems in the calibration, determination of P-matrices or the triangulation?
The error in my fundamental matr1ix estimation is significantly worse
when using RANSAC than the 8pt algorithm, ±50px in the mapping between
x and x'. This deeply concerns me.
Using the 8pt algorithm does not exclude using the RANSAC principle.
When using the 8pt algorithm directly which points do you use? You have to choose 8 (good) points by yourself.
In theory you can compute a fundamental matrix from any point correspondences and you often get a degenerated fundamental matrix because the linear equations are not independend. Another point is that the 8pt algorithm uses a overdetermined system of linear equations so that one single outlier will destroy the fundamental matrix.
Have you tried to use the RANSAC result? I bet it represents one of the correct solutions for F.
My F matrix is apparently accurate (0.22% deflection in the mapping
compared to typical coordinate values), but when testing E against
x'Ex = 0 using normalised image correspondences the typical error in
that mapping is >100% of the normalised coordinates themselves. Is
testing E against xEx = 0 valid, and if so where is that jump in error
coming from?
Again, if F is degenerated, x'Fx = 0 can be for every point correspondence.
Another reason for you incorrect E may be the switch of the cameras (K1T * E * K2 instead of K2T * E * K1). Remember to check: x'Ex = 0
'Enforcing the internal constraint' still sits very weirdly with me -
how can it be valid to just manufacture a new Essential matrix from
part of the decomposition of the original?
It is explained in 'Multiple View Geometry in Computer Vision' from Hartley and Zisserman. As far as I know it has to do with the minimization of the Frobenius norm of F.
You can Google it and there are pdf resources.
Is there a more efficient way of determining which combo of R and t to
use than calculating P and triangulating some of the normalised
No as far as I know.
My final re-projection error is hundreds of pixels in 720p images. Am
I likely looking at problems in the calibration, determination of
P-matrices or the triangulation?
Your rigid body transformation P2 is incorrect because E is incorrect.
I am trying to rectify two sequences of images for stereo matching. The usual approach of using stereoCalibrate() with a checkerboard pattern is not of use to me, since I am only working with the footage.
What I have is the correct calibration data of the individual cameras (camera matrix and distortion parameters) as well as measurements of their distance and angle between each other.
How can I construct the rotation matrix and translation vector needed for stereoRectify()?
The naive approach of using
Mat T = (Mat_<double>(3,1) << distance, 0, 0);
Mat R = (Mat_<double>(3,3) << cos(angle), 0, sin(angle), 0, 1, 0, -sin(angle), 0, cos(angle));
resulted in a heavily warped image. Do these matrices need to relate to a different origin point I am not aware of? Or do I need to convert the distance/angle value somehow to be dependent of pixelsize?
Any help would be appreciated.
It's not clear whether you have enough information about the camera poses to perform an accurate rectification.
Both T and R are measured in 3D, but in your case:
T is one-dimensional (along x-axis only), which means that you are confident that the two cameras are perfectly aligned along the other axes (in particular, you have less-than-1 pixel error on the y axis, ie a few microns by today's standards);
R leaves the Y-coordinates untouched. Thus, all you have is rotation around this axis, does it match your experimental setup ?
Finally, you need to check the consistency of the units that you are using for the translation and rotation to match with the units from the intrinsic data.
If it is feasible, you can check your results by finding some matching points between the two cameras and proceeding to a projective calibration: the accurate knowledge of the 3D position of the calibration points is required for metric reconstruction only. Other tasks rely on the essential or fundamental matrices, that can be computed from image-to-image point correspondences.
If intrinsics and extrinsics known, I recommend this method: http://link.springer.com/article/10.1007/s001380050120#page-1
It is easy to implement. Basically you rotate the right camera till both cameras have the same orientation, means both share a common R. Epipols are then transformed to the infinity and you have epipolar lines parallel to the image x-axis.
First row of the new R (x) is simply the baseline, e.g. the subtraction of both camera centers. Second row (y) the cross product of the baseline with the old left z-axis. Third row (z) equals cross product of the first two rows.
At last you need to calculate a 3x3 homography described in the above link and use warpPerspective() to get a rectified version.