Related
I am facing the challenge to recover the Z coordinate from an inverse disparity map and the camera matrix K. The Z coordinate does not have to be metric but it has to be scale aware.
K = [[721, 0, 596],
[0, 721, 149],
[0, 0, 1]
I know that z=f*(T/d) where T is the distance between two co-planer cameras f is the focal length and d is the disparity. However, I am not told any extrinsic other than the camera is 1.5m up from the floor. The disparity is already an inverse disparity so I guess z=f*d? Also, what do I do with the other parameters in my K matrix (596, 149)? Is there any implementation in a library like OpenCV to recover the scale-aware 3D coordinates?
When trying to use torch.nn.functional.affine_grid, it requires a theta affine matrix of size (N x 3 x 4) according to the documentation. I thought a general affine matrix is (N x 4 x 4). What is the supposed affine matrix format in pytorch?
An example of 3D rotation affine input would be ideal. Appreciate your help.
The dimensions you mention are applicable for the case of 3D inputs, that is you wish to apply 3D geometric transforms on the input tensor x of shape bxcxdxhxw.
A transformation to points in 3D (represented as 4-vector in homogeneous coordinates as (x, y, z, 1)) should be, in the general case, a 4x4 matrix as you noted.
However, since we restrict ourselves to homogeneous coordinates, i.e., the fourth coordinate must be 1, the 4th row of the matrix must be (0, 0, 0, 1) (see this).
Therefore, there's no need to explicitly code this last row.
To conclude, a 3D transformation composed of a 3x3 rotation R and 3d translation t is simply the 3x4 matrix:
theta = [R t]
I am currently experimenting with ORB SLAM 2 and a stereo camera like this. I am using 2.8mm and optionally 3.6mm lenses with a resolution of 640x480 pixels for the left and right camera/image.
ORB SLAM 2 lets me define several distortion/rectifying parameters withing the settings file (*.yaml), such as:
fx, fy, cx, cy
k1, k2, p1, p2
I conducted the OpenCV camera calibration using a chessboard like described here (9x7 inner corners and 70mm square length). Later on I used this automated calibration program from MRPT which gives me the same results with less stumbling blocks.
However, ORB SLAM 2 lets me define these additional parameters for pre-rectifying the images (if I understand this correctly):
D: 1x5 Matrix -> Distortion Coefficients aquired from calibration (fx,fy,cx,cy) ?
K: 3x3 Matrix -> Intrinsic Matrix aquired from calibration (k1,k2,p1,p2,k3) ?
R: 3x3 Matrix -> Rectification Transformation ?
P: 3x4 Matrix -> New Projection Matrix ?
My questions are the following (see below for an example settings.yaml file):
A.) Is my assumption correct, that D are the distortion coefficients and K is the intrinsic matrix acquired from the checkboard calibration procedure ?
B.) Is defining fx, fy, cx, cy in settings.yaml sufficient for pre-rectifying the images and successful operation of ORB SLAM 2 ?
C.) Do I need R and P matrices for successful operation of ORB SLAM 2 ?
D.) How can I acquired the R and P matrices? The OpenCV camera calibration procedure with the checkboard does not provide me these matrices, correct ?
Here's an example of the above mentioned settings.yaml file of ORB SLAM 2:
%YAML:1.0
#--------------------------------------------------------------------------------------------
# Camera Parameters. Adjust them!
#--------------------------------------------------------------------------------------------
# Camera calibration and distortion parameters (OpenCV)
Camera.fx: 646.53807309613160
Camera.fy: 647.36136487241527
Camera.cx: 320.94123353073792
Camera.cy: 219.07092188981900
Camera.k1: -0.43338537102343577
Camera.k2: 0.46801812273859494
Camera.p1: 0.0039978632628183738
Camera.p2: 0.00023265675941025371
Camera.width: 640
Camera.height: 480
# Camera frames per second
Camera.fps: 20.0
# stereo baseline times fx
Camera.bf: 38.76
# Color order of the images (0: BGR, 1: RGB. It is ignored if images are grayscale)
Camera.RGB: 1
# Close/Far threshold. Baseline times.
ThDepth: 50
#--------------------------------------------------------------------------------------------
# Stereo Rectification. Only if you need to pre-rectify the images.
# Camera.fx, .fy, etc must be the same as in LEFT.P
#--------------------------------------------------------------------------------------------
LEFT.width: 640
LEFT.height: 480
LEFT.D: !!opencv-matrix
rows: 1
cols: 5
dt: d
data:[-0.28340811, 0.07395907, 0.00019359, 1.76187114e-05, 0.0]
LEFT.K: !!opencv-matrix
rows: 3
cols: 3
dt: d
data: [458.654, 0.0, 367.215, 0.0, 457.296, 248.375, 0.0, 0.0, 1.0]
LEFT.R: !!opencv-matrix
rows: 3
cols: 3
dt: d
data: [0.999966347530033, -0.001422739138722922, 0.008079580483432283, 0.001365741834644127, 0.9999741760894847, 0.007055629199258132, -0.008089410156878961, -0.007044357138835809, 0.9999424675829176]
LEFT.P: !!opencv-matrix
rows: 3
cols: 4
dt: d
data: [435.2046959714599, 0, 367.4517211914062, 0, 0, 435.2046959714599, 252.2008514404297, 0, 0, 0, 1, 0]
RIGHT.width: 640
RIGHT.height: 480
RIGHT.D: !!opencv-matrix
rows: 1
cols: 5
dt: d
data:[-0.28368365, 0.07451284, -0.00010473, -3.555907e-05, 0.0]
RIGHT.K: !!opencv-matrix
rows: 3
cols: 3
dt: d
data: [457.587, 0.0, 379.999, 0.0, 456.134, 255.238, 0.0, 0.0, 1]
RIGHT.R: !!opencv-matrix
rows: 3
cols: 3
dt: d
data: [0.9999633526194376, -0.003625811871560086, 0.007755443660172947, 0.003680398547259526, 0.9999684752771629, -0.007035845251224894, -0.007729688520722713, 0.007064130529506649, 0.999945173484644]
RIGHT.P: !!opencv-matrix
rows: 3
cols: 4
dt: d
data: [435.2046959714599, 0, 367.4517211914062, -47.90639384423901, 0, 435.2046959714599, 252.2008514404297, 0, 0, 0, 1, 0]
#--------------------------------------------------------------------------------------------
# ORB Parameters
#--------------------------------------------------------------------------------------------
# ORB Extractor: Number of features per image
ORBextractor.nFeatures: 800
# ORB Extractor: Scale factor between levels in the scale pyramid
ORBextractor.scaleFactor: 1.2
# ORB Extractor: Number of levels in the scale pyramid
ORBextractor.nLevels: 8
# ORB Extractor: Fast threshold
# Image is divided in a grid. At each cell FAST are extracted imposing a minimum response.
# Firstly we impose iniThFAST. If no corners are detected we impose a lower value minThFAST
# You can lower these values if your images have low contrast
ORBextractor.iniThFAST: 12
ORBextractor.minThFAST: 3
#--------------------------------------------------------------------------------------------
# Viewer Parameters
#--------------------------------------------------------------------------------------------
Viewer.KeyFrameSize: 0.05
Viewer.KeyFrameLineWidth: 1
Viewer.GraphLineWidth: 0.9
Viewer.PointSize:2
Viewer.CameraSize: 0.08
Viewer.CameraLineWidth: 3
Viewer.ViewpointX: 0
Viewer.ViewpointY: -0.7
Viewer.ViewpointZ: -1.8
Viewer.ViewpointF: 500
In my opinion, there are several calibration toolboxes used for calibrating monocular, stereo or multi-cameras.
The first one is ros_camera_calibration. when running ORBSLAM, I prefer to use this package to obtain the intrinsic parameters of the single moving camera. The intrinsic parameters and distortion coefficients, and projection matrices would be acquired after moving the calibration board.
the second one, what I recently used is Kalibr. It is not only designed to calibrate multi-cameras but also can calibrate jointly the camera and inertial measurement units(IMU).
Besides, You also can use MATLABto get the camera's intrinsic parameters.
As for your questions, here are my imperfect answers.
Q.A: K(fx, fy, cx,cy) stands for the intrinsic parameters of the camera and distortion Coefficients are k1,k2,p1.p2 separately.
Q.B: as far as I'm concerned, obtaining intrinsic parameters, including fx, fy, cx, cy, are sufficient to run ORBSLAM2 with your own cameras.
Q.C&D, if you choose to use this ROS package, in the end, you will receive the projection matrix and rectification transformation.
I am building a project that is a basic neural network that takes in a 2x2 image with the goal to classify the image as either a forward slash (1-class) or back slash (0-class) shape. The data for the input is a flat numpy array. 1 represents a black pixel 0 represents a white pixel.
0-class: [1, 0, 0, 1]
1-class: [0, 1, 1, 0]
If I start my filter as a random 4x1 matrix, how can I use gradient descent to come to either perfect matrix [1,-1,-1,1] or [-1,1,1,-1] to classify the datapoints.
Side note: Even when multiplied with the "perfect" answer matrix then summed, the label output would be -2 and 2. Would my data labels need to be -2 and 2? What if I want my classes labeled as 0 and 1?
I try to create a Point Cloud based on the images from the KITTI stereo images dataset so then later I could estimate 3D position of some objects.
Original images looks like this.
What I have so far:
generated disparity with cv2.StereoSGBM_create
window_size = 9
minDisparity = 1
stereo = cv2.StereoSGBM_create(
blockSize=10,
numDisparities=64,
preFilterCap=10,
minDisparity=minDisparity,
P1=4 * 3 * window_size ** 2,
P2=32 * 3 * window_size ** 2
)
calculated Q matrix with cv2.stereoRectify using data from KITTI calibration files.
# K_xx: 3x3 calibration matrix of camera xx before rectification
K_L = np.matrix(
[[9.597910e+02, 0.000000e+00, 6.960217e+02],
[0.000000e+00, 9.569251e+02, 2.241806e+02],
[0.000000e+00, 0.000000e+00, 1.000000e+00]])
K_R = np.matrix(
[[9.037596e+02, 0.000000e+00, 6.957519e+02],
[0.000000e+00, 9.019653e+02, 2.242509e+02],
[0.000000e+00, 0.000000e+00, 1.000000e+00]])
# D_xx: 1x5 distortion vector of camera xx before rectification
D_L = np.matrix([-3.691481e-01, 1.968681e-01, 1.353473e-03, 5.677587e-04, -6.770705e-02])
D_R = np.matrix([-3.639558e-01, 1.788651e-01, 6.029694e-04, -3.922424e-04, -5.382460e-02])
# R_xx: 3x3 rotation matrix of camera xx (extrinsic)
R_L = np.transpose(np.matrix([[9.999758e-01, -5.267463e-03, -4.552439e-03],
[5.251945e-03, 9.999804e-01, -3.413835e-03],
[4.570332e-03, 3.389843e-03, 9.999838e-01]]))
R_R = np.matrix([[9.995599e-01, 1.699522e-02, -2.431313e-02],
[-1.704422e-02, 9.998531e-01, -1.809756e-03],
[2.427880e-02, 2.223358e-03, 9.997028e-01]])
# T_xx: 3x1 translation vector of camera xx (extrinsic)
T_L = np.transpose(np.matrix([5.956621e-02, 2.900141e-04, 2.577209e-03]))
T_R = np.transpose(np.matrix([-4.731050e-01, 5.551470e-03, -5.250882e-03]))
IMG_SIZE = (1392, 512)
rotation = R_L * R_R
translation = T_L - T_R
# output matrices from stereoRectify init
R1 = np.zeros(shape=(3, 3))
R2 = np.zeros(shape=(3, 3))
P1 = np.zeros(shape=(3, 4))
P2 = np.zeros(shape=(3, 4))
Q = np.zeros(shape=(4, 4))
R1, R2, P1, P2, Q, validPixROI1, validPixROI2 = cv2.stereoRectify(K_L, D_L, K_R, D_R, IMG_SIZE, rotation, translation,
R1, R2, P1, P2, Q,
newImageSize=(1242, 375))
The resulting matrix look like this (at this point I have a doubt that it is correct):
[[ 1. 0. 0. -614.37893072]
[ 0. 1. 0. -162.12583194]
[ 0. 0. 0. 680.05186262]
[ 0. 0. -1.87703644 0. ]]
Generated Point Cloud with reprojectImageTo3D which looks like this: point cloud
And now the questions part begins :)
Is it OK that all values returned by reprojectImageTo3D are negative?
What are the units of those values, taking into account that it is the KITTI dataset and their camera calibration data is available?
And finally, is it possible to convert those values to something like longitude\latitude if I have GPS coordinate of the camera that took those photos?
Would be appreciated for any help!
Is it OK for all values returned by reprojectImageTo3D to be negative?
Generally speaking, no, at least for Z values. The values returned by reprojectImageTo3D are real-world coordinates relative to the camera origin, so for a Z value to be negative it means the point is behind the camera (which is geometrically incorrect). The X and Y values can be negative, since the camera origin is at the center of the FOV, so a negative X value means the point is "to the left" and a negative Y value means the point is "below". But for Z values, no, they should not be negative.
Your Q matrix is turning out almost the identity, since I think you are incorrectly setting up the rotation matrices in your call to stereoRectify. When you pass rotation and translation, that is the single rotation from camera 1 to camera 2, not the combined rotation from camera 1 to camera 2. What you are doing is multiplying the two rotations together after transposing one of them; instead you should be passing only R_L (since from your description I assume this means that it is the rotation from left to right camera).
What are the units of those values, taking into account that it is the KITTI dataset and their camera calibration data is available?
I am not familiar with the KITTI dataset, but the values returned after calling reprojectImageTo3D are in real-world units, typically meters.
And finally, is it possible to convert those values to something like longitude\latitude if I have GPS coordinate of the camera that took those photos?
The coordinates returned by reprojectImageTo3D are in real-world coordinates relative to the camera origin. If you have the GPS coordinate of the camera that took the photos, you can manipulate the latitude/longitude values with the (X, Y, Z) coordinates returned from the reprojection.