I am working on an algorithm to estimate the height of detected people in a video, and I'm stuck.
The part that I have working is the detection of people using the HoG algorithm, so I have a bounding box for every person in the frame. And I have calibrated the camera, so I have my intrinsic and extrinsic camera parameters.
The problem is that now I have a formula for the perspective projection with 2 unknowns: height of the object and the distance from the object to the camera. I am using one mono web camera to detect people so I have no information about the distance from the object to the camera. And the height is what I'm trying to estimate, so I don't have that as well.
I know this problem is solvable if I use a kinect or a stereo camera in order to get the distance, but I'm limited to only one mono web camera.
Does anyone have an idea on how to approach this problem? I have read about using reference objects but I can't figure out how to use them to help my problem.
Related
I have a multi-camera system where the field of views are mostly non-overlapping. I have been researching on methods to calibrate the camera extrinsics and the first thing I'm going to try is to take a picture of a chessboard at a known location and use solvePnP from OpenCV to find the extrinsic rotation and translation vectors for each camera separately (following the method described in the answer here).
My problem is, this method uses only one measurement and as every measurement it is prone to errors. I assume that by taking multiple measurements, either by changing the position or the orientation of the chessboard, the accuracy can be improved. But what would be the best way to combine the rotation and translation obtained from the different measurements? A simple average?
In theory I would think that an option could be using solvePnP on all the points at the same time. Since I am calculating extrinsics the camera can't be moved so I would have to change to position and/or orientation of the board for each picture and measure the 3D points positions as accurately as possible each time.
I'm also wondering if using two chessboards in the same picture would be a possible solution, even if OpenCV doesn't seem to support multiple chessboard detection.
Is there a better way to measure extrinsics or anything that I'm missing?
I'm working with an stereo camera setup that has auto-focus (which I cannot turn off) and a really low baseline of less than 1cm.
Auto-focus process can actually change any intrinsic parameter of both cameras (as focal length and principal point, for example) and without a fix relation (left camera may add focus while right one decrease it). Luckily cameras always report the current state of intrinsics with great precision.
On every frame an object of interest is being detected and disparities between camera images are calculated. As baseline is quite low and resolution is not the greatest, performing stereo triangulation leads to quite poor results and for this matter several succeeding computer vision algorithms relay only on image keypoints and disparities.
Now, disparities being calculated between stereo frames cannot be directly related. If principal point changes disparities will be in very different magnitudes after the auto-focus process.
Is there any way to relate keypoint corners and/or disparities between frames after auto-focus process? For example, calculate where would the object lie in the image with the previous intrinsics?
Maybe using a bearing vector towards object and then seek for intersection with image plane defined by previous intrinsics?
Quite challenging your project, perhaps these patents could help you in some way:
Stereo yaw correction using autofocus feedback
Autofocus for stereo images
Depth information for auto focus using two pictures and two-dimensional Gaussian scale space theory
Assume I have two independent cameras looking at the same scene (there are features that are visible from both) and that I know the calibration parameters of both the cameras individually (I can also perform stereo calibration at a certain baseline but I don't know if that would be useful). One of the cameras is fixed and stable, the other is noisy in terms of its pose (translation and rotation). As the pose keeps changing over time, is it possible to accurately estimate the pose of the moving camera with respect to the stationary one using image data from both cameras (in opencv)?
I've been doing a little bit of reading, and this is what I've gathered so far:
Find features using SIFT and the point correspondences.
Find the fundamental matrix.
Find essential matrix and perform SVD to obtain the R and t values between the cameras.
Does this approach work on a frame-by-frame basis? And how does the setup help in getting the scale factor? Pointers and suggestions would be very helpful.
Thanks!
I am a beginner when it comes to computer vision so I apologize in advance. Basically, the idea I am trying to code is that given two cameras that can simulate a multiple baseline stereo system; I am trying to estimate the pose of one camera given the other.
Looking at the same scene, I would incorporate some noise in the pose of the second camera, and given the clean image from camera 1, and slightly distorted/skewed image from camera 2, I would like to estimate the pose of camera 2 from this data as well as the known baseline between the cameras. I have been reading up about homography matrices and related implementation in opencv, but I am just trying to get some suggestions about possible approaches. Most of the applications of the homography matrix that I have seen talk about stitching or overlaying images, but here I am looking for a six degrees of freedom attitude of the camera from that.
It'd be great if someone can shed some light on these questions too: Can an approach used for this be extended to more than two cameras? And is it also possible for both the cameras to have some 'noise' in their pose, and yet recover the 6dof attitude at every instant?
Let's clear up your question first. I guess You are looking for the pose of the camera relative to another camera location. This is described by Homography only for pure camera rotations. For General motion that includes translation this is described by rotation and translation matrices. If the fields of view of the cameras overlap the task can be solved with structure from motion which still estimates only 5 dof. This means that translation is estimated up to scale. If there is a chessboard with known dimensions in the cameras' field of view you can easily solve for 6dof by running a PnP algorithm. Of course, cameras should be calibrated first. Finally, in 2008 Marc Pollefeys came up with an idea how to estimate 6 dof from two moving cameras with non-overlapping fields of view without using any chess boards. To give you more detail please tell a bit for the intended appljcation you are looking for.
I have a problem in which i have a stationary video camera in a room and several videos from it, i need to transform the image coordinates into world coordinates.
What i know:
1. all the measurements of the room.
2. 16 image coordinates and their respected world coordinates.
The problem i encounter:
At first i thought i just need to create a geometric transformation (According to http://xenia.media.mit.edu/~cwren/interpolator/), but i have a problem since the edge of the room are distorted in the image, and i cant calibrate the camera because i can't get a hold of the room or the camera.
Is there anyway i can overcome those difficulties and measure the distance in the room with some accuracy?
Thanks
You can calibrate the distortion of the camera by extracting first the edges of your room and then finding the best set of distortion parameters (that will minimize edge distortion).
There are few works that implement this approach though:
you can find a skeleton of distortion estimation procedure in R. Szeliski's book, but without an implementation;
alternatively, you can find a method + implementation (+ an online demo where you can upload your images) on IPOL.
Regarding the perspective distortion, after removing the lens distortion just proceed with the link that you have found by applying this method to the image of the four corners of the room floor.
This will give you the mapping between an image pixel and a ground pixel (and thus the object world coordinate, assuning you only want the X-Y coordinates). If you need the height measurement, then you need to find an object with a known height in your images to calibrate it too.