I'm creating a static gesture recognition system using an OpenCV Haar Cascade Classifier. I ultimately would like to turn this recognition system into a stereoscopic recognition system. Here is my question, can I take the 2D recognition system created by the Haar Cascade Classifier and implement it on both cameras in order to create a disparity map after using the stereoscopic calibration functions contained in OpenCV? Or, would I have to take pictures with my already calibrated stereoscopic system to create the cascade classifier?
It's hard to find good information on this topic, and I would like to plan my project and make sure I'm doing the correct things before buying and creating everything.
Thanks.
First, you should clarify what it is you are trying to accomplish.
Do you need to detect an object and then localize it in 3D world coordinates? Or do you need the 3D information in order to detect the object in the first place?
In the former case, there are a couple of ways to go. One is to calibrate your stereo camera system, detect the object in both cameras, and then find its 3D location by triangulation. For example, you may want to triangulate the centroid of the object. The problem with this approach is that the 2D localization of the cascade object detector may not be precise enough to get a reliable 3D point.
The other way is to calibrate your cameras, and then to rectify the images to make them look as though the cameras are parallel and row-aligned. Now instead of triangulating specific points, you can compute disparity map for the whole image and get a corresponding 3D location (in theory) for any pixel. Now you can detect your object of interest in camera 1, and then use the disparity map to find a 3D location of any point on the object.
On the other hand, if you want to use 3D information to improve your detection, you will have to read up on some very recent research. For example, here is a paper on people detection using RGB-D sensors. The paper talks about the HOD (Histogram of Oriented Depths) descriptor, as opposed to the HOG descriptor. The reason this is relevant, is that if you calibrate your cameras and rectify your images you can get the same kind of depth map that you get from an RGB-D sensor like Kinect.
Related
How important it is to do camera calibration for ArUco? What if I dont calibrate the camera? What if I use calibration data from other camera? Do you need to recalibrate if camera focuses change? What is the practical way of doing calibration for consumer application?
Before answering your questions let me introduce some generic concepts related with camera calibration. A camera is a sensor that captures the 3D world and project it in a 2D image. This is a transformation from 3D to 2D performed by the camera. Following OpenCV doc is a good reference to understand how this process works and the camera parameters involved in the same. You can find detailed AruCo documentation in the following document.
In general, the camera model used by the main libraries is the pinhole model. In the simplified form of this model (without considering radial distortions) the camera transformation is represented using the following equation (from OpenCV docs):
The following image (from OpenCV doc) illustrates the whole projection process:
In summary:
P_im = K・R・T ・P_world
Where:
P_im: 2D points porojected in the image
P_world: 3D point from the world
K is the camera intrinsics matrix (this depends on the camera lenses parameters. Every time you change the camera focus for exapmle the focal distances fx and fy values whitin this matrix change)
R and T are the extrensics of the camera. They represent the rotation and translation matrices for the camera respecively. These are basically the matrices that represent the camera position/orientation in the 3D world.
Now, let's go through your questions one by one:
How important it is to do camera calibration for ArUco?
Camera calibration is important in ArUco (or any other AR library) because you need to know how the camera maps the 3D to 2D world so you can project your augmented objects on the physical world.
What if I dont calibrate the camera?
Camera calibration is the process of obtaining camera parameters: intrinsic and extrinsic parameters. First one are in general fixed and depend on the camera physical parameters unless you change some parameter as the focus for example. In such case you have to re-calculate them. Otherwise, if you are working with camera that has a fixed focal distance then you just have to calculate them once.
Second ones depend on the camera location/orientation in the world. Each time you move the camera the RT matrices change and you have to recalculate them. Here when libraries such as ArUco come handy because using markers you can obtain these values automatically.
In few words, If you don't calculate the camera you won't be able to project objects on the physical world on the exact location (which is essential for AR).
What if I use calibration data from other camera?
It won't work, this is similar as using an uncalibrated camera.
Do you need to recalibrate if camera focuses change?
Yes, you have to recalculate the intrinsic parameters because the focal distance changes in this case.
What is the practical way of doing calibration for consumer application?
It depends on your application, but in general you have to provide some method for manual re-calibration. There're also method for automatic calibration using some 3D pattern.
How is it possible to determine an object's 3D position using one camera and OpenCV when the camera is kept at (say) 45 degrees with respect to the ground ?
Two types of motion can be applied to camera in 3D world: translation and rotation. It's not possible to infer depth from mono camera, if there is no translation. You should check stereo vision for the details.
Simply, you need to recover essential matrix where E = [t_x]R and if t_x = 0, which means you are using monocular vision. There is no way to recover this by classical stereo vision.
However, there are some methods that uses depth of training dataset to infer the depth of test image. Please check this slide. They published their code for Matlab; however, you can easily implement it by yourself.
If you want a more accurate result, you can use deep learning models to estimate the depth of the pixels in an input image. There are some open-source models available such as this one. However, note that bts model is trained with KITTI dataset from an autonomous vehicle perspective. To have better results, you need to have a dataset that is relevant to your application. Then use frameworks such bts to train a model for depth estimation. This model will provide you with point clouds of a single image with (x,y,z) coordinates.
I am looking for algorithms/publications on face detection. There are plenty in the web. But my scenario is somewhat specialized. I want to detect faces accurately in images taken by wearable devices (e.g. narrative clips), so there will be motion blur, and image quality will not be that good. I want to detect faces that are within 15 feet of the camera accurately. Next goal is to estimate the pose, primarily to find out if the person is looking toward the camera ( or better looking at the camera owner).
Any suggestion?
My go to for this would either be a deep-learning framework using convolutional layers for pixel classification, or K-means/ K-Nearest Neighbour algorithm.
This does depend on your data, however. From your post I am assuming that your data isn't labelled? meaning you are unable to feed in the 'truth' to the algorithm for classification.
you could perhaps use a CNN (convolutional neural network) for pixel classification (image segmentation) which should identify the location of a person. given this, perhaps you could run a 'local' CNN i a region close to the face identified to classify the region the body is located in as a certain pose.
This would probably be my first take on the problem but would depend on the exact structure of your data, and the structure of your labels (if you have any).
I have to say it does sound like a fun project!
I found OpenCV's Haar Cascades for Face Detection pretty accurate and robust for motion blur and "live" face recognition.
I'm saying that because I used them for implementing an Eye-Tracker in C++ with a laptop webcam (whose resolution was not excellent and motion blur was naturally always present).
They work in multiresolution and are therefore able to detect faces of any size, but you can easily tune them for your distance of interest.
They might not be your final optimal solution, but since they are already implemented and come with the OpenCV package, they could constitute a good starting point.
I am a beginner when it comes to computer vision so I apologize in advance. Basically, the idea I am trying to code is that given two cameras that can simulate a multiple baseline stereo system; I am trying to estimate the pose of one camera given the other.
Looking at the same scene, I would incorporate some noise in the pose of the second camera, and given the clean image from camera 1, and slightly distorted/skewed image from camera 2, I would like to estimate the pose of camera 2 from this data as well as the known baseline between the cameras. I have been reading up about homography matrices and related implementation in opencv, but I am just trying to get some suggestions about possible approaches. Most of the applications of the homography matrix that I have seen talk about stitching or overlaying images, but here I am looking for a six degrees of freedom attitude of the camera from that.
It'd be great if someone can shed some light on these questions too: Can an approach used for this be extended to more than two cameras? And is it also possible for both the cameras to have some 'noise' in their pose, and yet recover the 6dof attitude at every instant?
Let's clear up your question first. I guess You are looking for the pose of the camera relative to another camera location. This is described by Homography only for pure camera rotations. For General motion that includes translation this is described by rotation and translation matrices. If the fields of view of the cameras overlap the task can be solved with structure from motion which still estimates only 5 dof. This means that translation is estimated up to scale. If there is a chessboard with known dimensions in the cameras' field of view you can easily solve for 6dof by running a PnP algorithm. Of course, cameras should be calibrated first. Finally, in 2008 Marc Pollefeys came up with an idea how to estimate 6 dof from two moving cameras with non-overlapping fields of view without using any chess boards. To give you more detail please tell a bit for the intended appljcation you are looking for.
Is it necessary to calibrate the camera if I were to implement a natural marker tracker?
Actually I don't quite get the idea of camera calibration although I have read that it is required for augmenting 3d/2d objects onto the image feed.
Camera-calibration means finding intrinsic parameters of the camera. It is necessary, of course, if you want to detect natural features sucessfully, and here I explain why.
Then you only have to look for extrinsic parameters. You only have to do calibration once, as the camera is always the same (considering you cannot zoom in/out, change focal length, etc). Without camera calibration you will have many problems for the natural features tracking task, as it is more challenging than fiducials tracking.
In the link I passed you, you will also find how to calculate the pose from a planar marker. It is theoretical, but you can find a lot of code in the web. If you need more help tell me, I can explain in more detail if necessary.
Strictly speaking, you could detect features, do pattern matching to recognize the marker and then track those features without camera calibration. Calibration allows us to determine both intrinsic (e.g. distortion coefficients) and extrinsic (e.g. rotation) camera parameters, which are required when someone is to determine marker boundaries or perform 3D pose estimation.
Is it necessary? No.
Is it useful? You bet. The rule of thumb is ALWAYS, if you can perform camera calibration for your stationary camera, do it.
You can do many things with such information: remove distortion, get distance in some type of metric space, ... Most trackers have an underlying assumption/models, these models are best fit when the data is in a space where the model makes sense. Camera calibration is one easy way to achieve this.