how to match rgb image pixels with corresponding pointcloud points - image-processing

I have a color image, corresponding point cloud captured by oak-D camera(see the image below) and i want to get the information of pixels in the color image and corresponding point cloud value in point cloud.
how can i get this information? for instance, i have a pixel value (200,250) in the color image and how to know the corresponding point value in the point cloud?
any help would be appreciated.

It sounds like you want to project a 2D image to a 3D point cloud using the computed disparity map. To do this you will also need to know about your camera intrinsics. Since you are using the oak-D, you should be able to get everything you need with the following piece of code.
with dai.Device(pipeline) as device:
calibData = device.readCalibration()
# get right intrinsic matrix
w, h = monoRight.getResolutionSize()
K_right = calibData.getCameraIntrinsics(dai.CameraBoardSocket.RIGHT, dai.Size2f(w, h))
# get left intrinsic matrix
w, h = monoLeft.getResolutionSize()
K_left = calibData.getCameraIntrinsics(dai.CameraBoardSocket.LEFT, dai.Size2f(w, h))
R_left = calibData.getStereoLeftRectificationRotation()
R_right = calibData.getStereoRightRectificationRotation()
x_baseline = calibData.getBaselineDistance()
Once you have all you camera parameters, you should be able to use opencv to approach this.
First you will need to construct the Q matrix (or rectified transformation matrix).
You will need to provide
The left and right intrinsic calibration matrices
The Translation vector from the coordinate system of the first camera to the second camera
The Rotation matrix from the coordinate system of the first camera to the second camera
Here's a coded example:
import numpy as np
import cv2
Q = np.zeros((4,4))
cv2.stereoRectify(cameraMatrix1=K_left, # left intrinsic matrix
cameraMatrix2=K_right, # right intrinsic matrix
distCoeffs1=0,
distCoeffs2=0,
imageSize=imageSize, # pass in the image size
R=R_left, # Rotation matrix from camera 1 to camera 2
T=x_baseline, # Translation matrix from camera 1 to camera 2
R1=None,
R2=None,
P1= None,
P2= None,
Q=Q);
Next you will need to reproject the image to 3D, using the known disparity map and the Q matrix. The operation is illustrated below, but opencv makes this much easier.
xyz = cv2.reprojectImageTo3D(disparity, Q)
This will give you an array of 3D points. This array specifically has the shape: (rows, columns, 3), where the 3 corresponds the (x,y,z) coordinate of the point cloud. Now you can use the a pixel location to index into xyz and find it's corresponding (x, y, z) point.
pix_row = 200
pix_col = 250
point_cloud_coordinate = xyz[pix_row, pix_col, :]
See the docs for more details
cv2.stereoRectify()
cv2.reprojectImageTo3D()

Related

Reconstruct 3D object with OpenCV

I am following the OpenCV camera calibration tutorial https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_calib3d/py_calibration/py_calibration.html to run camera calibration
ret, mtx, dist, rvecs, tvecs = cv2.calibrateCamera(objpoints, imgpoints, gray.shape[::-1],None,None)
What I want to do next is to reconstruction the 3D location of some feature points. The features points are defined in the image space. Here is what I am planning to do:
Found the new camera matrix:
h, w = my image dimension
newcameramtx, roi=cv2.getOptimalNewCameraMatrix(mtx,dist,(w,h),1,(w,h))
Undistort the feature point location:
new_points= cv2.undistortPoints(my_feature_points, mtx, dist, P=newcameramtx)
Reconstruct the 3D coordinate of the feature points for a given Z. I have two problems here. First, I do not know how to reconstruct the 3D coordinate. 2. When I do it, should I use the original camera matrix "mtx" or the new camera matrix "newcameramtx". And how about "roi"? where should I apply it?
Thank you very much.
Take a look at this version of the docs, which I find easier to read. The key equation is this one:
Once you have undistorted your image this equation applies. The matrix with fx, fy, cx, and cy is your camera matrix, often called M.
This equation tells you how to go from 2D pixel locations, on the left (x, y), to 3D locations in the world (on the right, [X, Y, Z].
First, I do not know how to reconstruct the 3D coordinate
To do that, we can apply the equation. Given a pixel location and a range (plug in as w), we have:
which we can do in code like so:
pixels = [x, y, range]
XYZ = np.mult( np.linalg.inv(mtx), pixels)
I'm not sure that you want to be calling getOptimalNewCameraMatrix, because that is cropping out pixels that may not be valid. I'd skip that for the moment until you have a better understanding of the system. The ROI is telling you where the undistorted image won't have any blank pixels.
I really recommend the book Learning OpenCV (or the new version 3 one); it helped me a huge amount. It took me from getting really frustrated reading the docs (which assume a lot of prior knowledge) to actually understanding what was going on.

How to get back the co-ordinate points corresponding to the intensity points obtained from a faster r-cnn object detection process?

As a result of the faster r-cnn method of object detection, I have obtained a set of boxes of intensity values(each bounding box can be thought of as a 3D matrix with depth of 3 for rgb intensity, a width and a height which can then be converted into a 2D matrix by taking gray scale) corresponding to the region containing the object. What I want to do is to obtain the corresponding co-ordinate points in the original image for each cell of intensity inside of the bounding box. Any ideas how to do so?
From what I understand, you got an R-CNN model that outputs cropped pieces of the input image and you now want to trace those output crops back to their coordinates in the original image.
What you can do is simply use a patch-similarity-measure to find the original position.
Since the output crop should look exactly like itself in the original image, just use Pixel-based distance:
Find the place in the image with the smallest distance (should be zero) and from that you can find your desired coordinates.
In python:
d_min = 10**6
crop_size = crop.shape
for x in range(org_image.shape[0]-crop_size[0]):
for y in range(org_image.shape[1]-crop_size[1]):
d = np.abs(np.sum(np.sum(org_image[x:x+crop_size[0],y:y+crop_size[0]]-crop)))
if d <= d_min:
d_min = d
coord = [x,y]
However, your model should have that info available in it (after all, it crops the output based on some coordinates). Maybe if you add some info on your implementation.

How to get the transformation matrix of a 3d model to object in a 2d image

Given an object's 3D mesh file and an image that contains the object, what are some techniques to get the orientation/pose parameters of the 3d object in the image?
I tried searching for some techniques, but most seem to require texture information of the object or at least some additional information. Is there a way to get the pose parameters using just an image and a 3d mesh file (wavefront .obj)?
Here's an example of a 2D image that can be expected.
FOV of camera
Field of view of camera is absolute minimum to know to even start with this (how can you determine how to place object when you have no idea how it would affect scene). Basically you need transform matrix that maps from world GCS (global coordinate system) to Camera/Screen space and back. If you do not have a clue what about I am writing then perhaps you should not try any of this before you learn the math.
For unknown camera you can do some calibration based on markers or etalones (known size and shape) in the view. But much better is use real camera values (like FOV angles in x,y direction, focal length etc ...)
The goal for this is to create function that maps world GCS(x,y,z) into Screen LCS(x,y).
For more info read:
transform matrix anatomy
3D graphic pipeline
Perspective projection
Silhouette matching
In order to compare rendered and real image similarity you need some kind of measure. As you need to match geometry I think silhouette matching is the way (ignoring textures, shadows and stuff).
So first you need to obtain silhouettes. Use image segmentation for that and create ROI mask of your object. For rendered image is this easy as you van render the object with single color without any lighting directly into ROI mask.
So you need to construct function that compute the difference between silhouettes. You can use any kind of measure but I think you should start with non overlapping areas pixel count (it is easy to compute).
Basically you count pixels that are present only in one ROI (region of interest) mask.
estimate position
as you got the mesh then you know its size so place it in the GCS so rendered image has very close bounding box to real image. If you do not have FOV parameters then you need to rescale and translate each rendered image so it matches images bounding box (and as result you obtain only orientation not position of object of coarse). Cameras have perspective so the more far from camera you place your object the smaller it will be.
fit orientation
render few fixed orientations covering all orientations with some step 8^3 orientations. For each compute the difference of silhouette and chose orientation with smallest difference.
Then fit the orientation angles around it to minimize difference. If you do not know how optimization or fitting works see this:
How approximation search works
Beware too small amount of initial orientations can cause false positioves or missed solutions. Too high amount will be slow.
Now that was some basics in a nutshell. As your mesh is not very simple you may need to tweak this like use contours instead of silhouettes and using distance between contours instead of non overlapping pixels count which is really hard to compute ... You should start with simpler meshes like dice , coin etc ... and when grasping all of this move to more complex shapes ...
[Edit1] algebraic approach
If you know some points in the image that coresponds to known 3D points (in your mesh) then you can along with the FOV of the camera used compute the transform matrix placing your object ...
if the transform matrix is M (OpenGL style):
M = xx,yx,zx,ox
xy,yy,zy,oy
xz,yz,zz,oz
0, 0, 0, 1
Then any point from your mesh (x,y,z) is transformed to global world (x',y',z') like this:
(x',y',z') = M * (x,y,z)
The pixel position (x'',y'') is done by camera FOV perspective projection like this:
y''=FOVy*(z'+focus)*y' + ys2;
x''=FOVx*(z'+focus)*x' + xs2;
where camera is at (0,0,-focus), projection plane is at z=0 and viewing direction is +z so for any focal length focus and screen resolution (xs,ys):
xs2=xs*0.5;
ys2=ys*0.5;
FOVx=xs2/focus;
FOVy=ys2/focus;
When put all this together you obtain this:
xi'' = ( xx*xi + yx*yi + zx*zi + ox ) * ( xz*xi + yz*yi + zz*zi + ox + focus ) * FOVx
yi'' = ( xy*xi + yy*yi + zy*zi + oy ) * ( xz*xi + yz*yi + zz*zi + oy + focus ) * FOVy
where (xi,yi,zi) is i-th known point 3D position in mesh local coordinates and (xi'',yi'') is corresponding known 2D pixel positions. So unknowns are the M values:
{ xx,xy,xz,yx,yy,yx,zx,zy,zz,ox,oy,oz }
So we got 2 equations per each known point and 12 unknowns total. So you need to know 6 points. Solve the system of equations and construct your matrix M.
Also you can exploit that M is a uniform orthogonal/orthonormal matrix so vectors
X = (xx,xy,xz)
Y = (yx,yy,yz)
Z = (zx,zy,zz)
Are perpendicular to each other so:
(X.Y) = (Y.Z) = (Z.X) = 0.0
Which can lower the number of needed points by introducing these to your system. Also you can exploit cross product so if you know 2 vectors the thirth can be computed
Z = (X x Y)*scale
So instead of 3 variables you need just single scale (which is 1 for orthonormal matrix). If I assume orthonormal matrix then:
|X| = |Y| = |Z| = 1
so we got 6 additional equations (3 x dot, and 3 for cross) without any additional unknowns so 3 point are indeed enough.

OpenCV: get perspective matrix from translation & rotation

I'm trying to verify my camera calibration, so I'd like to rectify the calibration images. I expect that this will involve using a call to warpPerspective but I do not see an obvious function that takes the camera matrix, and the rotation and translation vectors to generate the perspective matrix for this call.
Essentially I want to do the process described here (see especially the images towards the end) but starting with a known camera model and pose.
Is there a straightforward function call that takes the camera intrinsic and extrinsic parameters and computes the perspective matrix for use in warpPerspective?
I'll be calling warpPerspective after having called undistort on the image.
In principle, I could derive the solution by solving the system of equations defined at the top of the opencv camera calibration documentation after specifying the constraint Z=0, but I figure that there must be a canned routine that will allow me to orthorectify my test images.
In my seearches, I'm finding it hard to wade through all of the stereo calibration results -- I only have one camera, but want to rectify the image under the constraint that I'm only looking a a planar test pattern.
Actually there is no need to involve an orthographic camera. Here is how you can get the appropriate perspective transform.
If you calibrated the camera using cv::calibrateCamera, you obtained a camera matrix K a vector of lens distortion coefficients D for your camera and, for each image that you used, a rotation vector rvec (which you can convert to a 3x3 matrix R using cv::rodrigues, doc) and a translation vector T. Consider one of these images and the associated R and T. After you called cv::undistort using the distortion coefficients, the image will be like it was acquired by a camera of projection matrix K * [ R | T ].
Basically (as #DavidNilosek intuited), you want to cancel the rotation and get the image as if it was acquired by the projection matrix of form K * [ I | -C ] where C=-R.inv()*T is the camera position. For that, you have to apply the following transformation:
Hr = K * R.inv() * K.inv()
The only potential problem is that the warped image might go outside the visible part of the image plane. Hence, you can use an additional translation to solve that issue, as follows:
[ 1 0 | ]
Ht = [ 0 1 | -K*C/Cz ]
[ 0 0 | ]
where Cz is the component of C along the Oz axis.
Finally, with the definitions above, H = Ht * Hr is a rectifying perspective transform for the considered image.
This is a sketch of what I mean by "solving the system of equations" (in Python):
import cv2
import scipy # I use scipy by habit; numpy would be fine too
#rvec= the rotation vector
#tvec = the translation *emphasized text*matrix
#A = the camera intrinsic
def unit_vector(v):
return v/scipy.sqrt(scipy.sum(v*v))
(fx,fy)=(A[0,0], A[1,1])
Ainv=scipy.array( [ [1.0/fx, 0.0, -A[0,2]/fx],
[ 0.0, 1.0/fy, -A[1,2]/fy],
[ 0.0, 0.0, 1.0] ], dtype=scipy.float32 )
R=cv2.Rodrigues( rvec )
Rinv=scipy.transpose( R )
u=scipy.dot( Rinv, tvec ) # displacement between camera and world coordinate origin, in world coordinates
# corners of the image, for here hard coded
pixel_corners=[ scipy.array( c, dtype=scipy.float32 ) for c in [ (0+0.5,0+0.5,1), (0+0.5,640-0.5,1), (480-0.5,640-0.5,1), (480-0.5,0+0.5,1)] ]
scene_corners=[]
for c in pixel_corners:
lhat=scipy.dot( Rinv, scipy.dot( Ainv, c) ) #direction of the ray that the corner images, in world coordinates
s=u[2]/lhat[2]
# now we have the case that (s*lhat-u)[2]==0,
# i.e. s is how far along the line of sight that we need
# to move to get to the Z==0 plane.
g=s*lhat-u
scene_corners.append( (g[0], g[1]) )
# now we have: 4 pixel_corners (image coordinates), and 4 corresponding scene_coordinates
# can call cv2.getPerspectiveTransform on them and so on..

finding the real world coordinates of an image point

I am searching lots of resources on internet for many days but i couldnt solve the problem.
I have a project in which i am supposed to detect the position of a circular object on a plane. Since on a plane, all i need is x and y position (not z) For this purpose i have chosen to go with image processing. The camera(single view, not stereo) position and orientation is fixed with respect to a reference coordinate system on the plane and are known
I have detected the image pixel coordinates of the centers of circles by using opencv. All i need is now to convert the coord. to real world.
http://www.packtpub.com/article/opencv-estimating-projective-relations-images
in this site and other sites as well, an homographic transformation is named as:
p = C[R|T]P; where P is real world coordinates and p is the pixel coord(in homographic coord). C is the camera matrix representing the intrinsic parameters, R is rotation matrix and T is the translational matrix. I have followed a tutorial on calibrating the camera on opencv(applied the cameraCalibration source file), i have 9 fine chessbordimages, and as an output i have the intrinsic camera matrix, and translational and rotational params of each of the image.
I have the 3x3 intrinsic camera matrix(focal lengths , and center pixels), and an 3x4 extrinsic matrix [R|T], in which R is the left 3x3 and T is the rigth 3x1. According to p = C[R|T]P formula, i assume that by multiplying these parameter matrices to the P(world) we get p(pixel). But what i need is to project the p(pixel) coord to P(world coordinates) on the ground plane.
I am studying electrical and electronics engineering. I did not take image processing or advanced linear algebra classes. As I remember from linear algebra course we can manipulate a transformation as P=[R|T]-1*C-1*p. However this is in euclidian coord system. I dont know such a thing is possible in hompographic. moreover 3x4 [R|T] Vector is not invertible. Moreover i dont know it is the correct way to go.
Intrinsic and extrinsic parameters are know, All i need is the real world project coordinate on the ground plane. Since point is on a plane, coordinates will be 2 dimensions(depth is not important, as an argument opposed single view geometry).Camera is fixed(position,orientation).How should i find real world coordinate of the point on an image captured by a camera(single view)?
EDIT
I have been reading "learning opencv" from Gary Bradski & Adrian Kaehler. On page 386 under Calibration->Homography section it is written: q = sMWQ where M is camera intrinsic matrix, W is 3x4 [R|T], S is an "up to" scale factor i assume related with homography concept, i dont know clearly.q is pixel cooord and Q is real coord. It is said in order to get real world coordinate(on the chessboard plane) of the coord of an object detected on image plane; Z=0 then also third column in W=0(axis rotation i assume), trimming these unnecessary parts; W is an 3x3 matrix. H=MW is an 3x3 homography matrix.Now we can invert homography matrix and left multiply with q to get Q=[X Y 1], where Z coord was trimmed.
I applied the mentioned algorithm. and I got some results that can not be in between the image corners(the image plane was parallel to the camera plane just in front of ~30 cm the camera, and i got results like 3000)(chessboard square sizes were entered in milimeters, so i assume outputted real world coordinates are again in milimeters). Anyway i am still trying stuff. By the way the results are previosuly very very large, but i divide all values in Q by third component of the Q to get (X,Y,1)
FINAL EDIT
I could not accomplish camera calibration methods. Anyway, I should have started with perspective projection and transform. This way i made very well estimations with a perspective transform between image plane and physical plane(having generated the transform by 4 pairs of corresponding coplanar points on the both planes). Then simply applied the transform on the image pixel points.
You said "i have the intrinsic camera matrix, and translational and rotational params of each of the image.” but these are translation and rotation from your camera to your chessboard. These have nothing to do with your circle. However if you really have translation and rotation matrices then getting 3D point is really easy.
Apply the inverse intrinsic matrix to your screen points in homogeneous notation: C-1*[u, v, 1], where u=col-w/2 and v=h/2-row, where col, row are image column and row and w, h are image width and height. As a result you will obtain 3d point with so-called camera normalized coordinates p = [x, y, z]T. All you need to do now is to subtract the translation and apply a transposed rotation: P=RT(p-T). The order of operations is inverse to the original that was rotate and then translate; note that transposed rotation does the inverse operation to original rotation but is much faster to calculate than R-1.

Resources