How is it possible to determine an object's 3D position using one camera and OpenCV when the camera is kept at (say) 45 degrees with respect to the ground ?
Two types of motion can be applied to camera in 3D world: translation and rotation. It's not possible to infer depth from mono camera, if there is no translation. You should check stereo vision for the details.
Simply, you need to recover essential matrix where E = [t_x]R and if t_x = 0, which means you are using monocular vision. There is no way to recover this by classical stereo vision.
However, there are some methods that uses depth of training dataset to infer the depth of test image. Please check this slide. They published their code for Matlab; however, you can easily implement it by yourself.
If you want a more accurate result, you can use deep learning models to estimate the depth of the pixels in an input image. There are some open-source models available such as this one. However, note that bts model is trained with KITTI dataset from an autonomous vehicle perspective. To have better results, you need to have a dataset that is relevant to your application. Then use frameworks such bts to train a model for depth estimation. This model will provide you with point clouds of a single image with (x,y,z) coordinates.
Related
I am very new to AR, ARCore and ARKit. The problem I am trying to solve is measure distance between visual features that are determined by AI algorithms. I want to use AR for measuring.
I am approaching this problem like this:
AR captures the scene with the features points
2d image is extracted for identifying required visual features with Neural Network
map 2d point to 3d coordinate in the same scene to create anchor
calculate the distance between the anchors
I am very confused about the following things:
Can we save an AR session and get the image that can be used as input for Neural Networks?
If yes for 1, how to transfer the 2d coordinates obtained after a prediction from the neural network in ARCore or ARKit session for creating the anchor as the depth information is not there?
What is the best approach to solving this problem? If there are any other possible solutions to this problem, help is appreciated.
How important it is to do camera calibration for ArUco? What if I dont calibrate the camera? What if I use calibration data from other camera? Do you need to recalibrate if camera focuses change? What is the practical way of doing calibration for consumer application?
Before answering your questions let me introduce some generic concepts related with camera calibration. A camera is a sensor that captures the 3D world and project it in a 2D image. This is a transformation from 3D to 2D performed by the camera. Following OpenCV doc is a good reference to understand how this process works and the camera parameters involved in the same. You can find detailed AruCo documentation in the following document.
In general, the camera model used by the main libraries is the pinhole model. In the simplified form of this model (without considering radial distortions) the camera transformation is represented using the following equation (from OpenCV docs):
The following image (from OpenCV doc) illustrates the whole projection process:
In summary:
P_im = K・R・T ・P_world
Where:
P_im: 2D points porojected in the image
P_world: 3D point from the world
K is the camera intrinsics matrix (this depends on the camera lenses parameters. Every time you change the camera focus for exapmle the focal distances fx and fy values whitin this matrix change)
R and T are the extrensics of the camera. They represent the rotation and translation matrices for the camera respecively. These are basically the matrices that represent the camera position/orientation in the 3D world.
Now, let's go through your questions one by one:
How important it is to do camera calibration for ArUco?
Camera calibration is important in ArUco (or any other AR library) because you need to know how the camera maps the 3D to 2D world so you can project your augmented objects on the physical world.
What if I dont calibrate the camera?
Camera calibration is the process of obtaining camera parameters: intrinsic and extrinsic parameters. First one are in general fixed and depend on the camera physical parameters unless you change some parameter as the focus for example. In such case you have to re-calculate them. Otherwise, if you are working with camera that has a fixed focal distance then you just have to calculate them once.
Second ones depend on the camera location/orientation in the world. Each time you move the camera the RT matrices change and you have to recalculate them. Here when libraries such as ArUco come handy because using markers you can obtain these values automatically.
In few words, If you don't calculate the camera you won't be able to project objects on the physical world on the exact location (which is essential for AR).
What if I use calibration data from other camera?
It won't work, this is similar as using an uncalibrated camera.
Do you need to recalibrate if camera focuses change?
Yes, you have to recalculate the intrinsic parameters because the focal distance changes in this case.
What is the practical way of doing calibration for consumer application?
It depends on your application, but in general you have to provide some method for manual re-calibration. There're also method for automatic calibration using some 3D pattern.
I have two calibrated cameras looking at an overlapping scene. I am trying to estimate the pose of camera2 with respect to camera1 (because camera2 can be moving; but both camera1 and 2 will always have some features that are overlapping).
I am identifying features using SIFT, computing the fundamental matrix and eventually the essential matrix. Once I solve for R and t (one of the four possible solutions), I obtain the translation up-to-scale, but is it possible to somehow compute the translation in real world units? There are no objects of known size in the scene; but I do have the calibration data for both the cameras. I've gone through some info about Structure from Motion and stereo pose estimation, but the concept of scale and the correlation with real world translation is confusing me.
Thanks!
This is the classical scale problem with structure from motion.
The short answer is that you must have some other source of information in order to resolve scale.
This information can be about points in the scene (e.g. terrain map), or some sensor reading from the moving camera (IMU, GPS, etc.)
I'm creating a static gesture recognition system using an OpenCV Haar Cascade Classifier. I ultimately would like to turn this recognition system into a stereoscopic recognition system. Here is my question, can I take the 2D recognition system created by the Haar Cascade Classifier and implement it on both cameras in order to create a disparity map after using the stereoscopic calibration functions contained in OpenCV? Or, would I have to take pictures with my already calibrated stereoscopic system to create the cascade classifier?
It's hard to find good information on this topic, and I would like to plan my project and make sure I'm doing the correct things before buying and creating everything.
Thanks.
First, you should clarify what it is you are trying to accomplish.
Do you need to detect an object and then localize it in 3D world coordinates? Or do you need the 3D information in order to detect the object in the first place?
In the former case, there are a couple of ways to go. One is to calibrate your stereo camera system, detect the object in both cameras, and then find its 3D location by triangulation. For example, you may want to triangulate the centroid of the object. The problem with this approach is that the 2D localization of the cascade object detector may not be precise enough to get a reliable 3D point.
The other way is to calibrate your cameras, and then to rectify the images to make them look as though the cameras are parallel and row-aligned. Now instead of triangulating specific points, you can compute disparity map for the whole image and get a corresponding 3D location (in theory) for any pixel. Now you can detect your object of interest in camera 1, and then use the disparity map to find a 3D location of any point on the object.
On the other hand, if you want to use 3D information to improve your detection, you will have to read up on some very recent research. For example, here is a paper on people detection using RGB-D sensors. The paper talks about the HOD (Histogram of Oriented Depths) descriptor, as opposed to the HOG descriptor. The reason this is relevant, is that if you calibrate your cameras and rectify your images you can get the same kind of depth map that you get from an RGB-D sensor like Kinect.
I am aware of the chessboard camera calibration technique, and have implemented it.
If I have 2 cameras viewing the same scene, and I calibrate both simultaneously with the chessboard technique, can I compute the rotation matrix and translation vector between them? How?
If you have the 3D camera coordinates of the corresponding points, you can compute the optimal rotation matrix and translation vector by Rigid Body Transformation
If You are using OpenCV already then why don't you use cv::stereoCalibrate.
It returns the rotation and translation matrices. The only thing you have to do is to make sure that the calibration chessboard is seen by both of the cameras.
The exact way is shown in .cpp samples provided with OpenCV library( I have 2.2 version and samples were installed by default in /usr/local/share/opencv/samples).
The code example is called stereo_calib.cpp. Although it's not explained clearly what they are doing there (for that You might want to look to "Learning OpenCV"), it's something You can base on.
If I understood you correctly, you have two calibrated cameras observing a common scene, and you wish to recover their spatial arrangement. This is possible (provided you find enough image correspondences) but only up to an unknown factor on translation scale. That is, we can recover rotation (3 degrees of freedom, DOF) and only the direction of the translation (2 DOF). This is because we have no way to tell whether the projected scene is big and the cameras are far, or the scene is small and cameras are near. In the literature, the 5 DOF arrangement is termed relative pose or relative orientation (Google is your friend).
If your measurements are accurate and in general position, 6 point correspondences may be enough for recovering a unique solution. A relatively recent algorithm does exactly that.
Nister, D., "An efficient solution to the five-point relative pose problem," Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.26, no.6, pp.756,770, June 2004
doi: 10.1109/TPAMI.2004.17
Update:
Use a structure from motion/bundle adjustment package like Bundler to solve simultaneously for the 3D location of the scene and relative camera parameters.
Any such package requires several inputs:
camera calibrations that you have.
2D pixel locations of points of interest in cameras (use a interest point detection like Harris, DoG (first part of SIFT)).
Correspondences between points of interest from each camera (use a descriptor like SIFT, SURF, SSD, etc. to do the matching).
Note that the solution is up to a certain scale ambiguity. You'll thus need to supply a distance measurement either between the cameras or between a pair of objects in the scene.
Original answer (applies primarily to uncalibrated cameras as the comments kindly point out):
This camera calibration toolbox from Caltech contains the ability to solve and visualize both the intrinsics (lens parameters, etc.) and extrinsics (how the camera positions when each photo is taken). The latter is what you're interested in.
The Hartley and Zisserman blue book is also a great reference. In particular, you may want to look at the chapter on epipolar lines and fundamental matrix which is free online at the link.