I need to reconstruct a depth map from an image sequence taken by a single static camera of a moving object.
As far as I understand I can calculate the depth of a point found in two images using a stereo camera using the intercept theorem. Is there any way to calculate depth information using only a single camera and matching points from multiple images instead?
Any comments and alternative solutions are welcome. Thanks in advance for your help!
There are some algorithms which help you get depth from a single image. A list of them is mentioned here, http://make3d.cs.cornell.edu/results_stateoftheart.html
These techniques use MRFs and assume that the scene is made up of a collection of planes.
A moving object does not provide any information about the depth (until unless you know the depth of some other moving object), however a single rotating camera can help in extracting depth.
Related
We have this camera array arranged in an arc around a person (red dot). Think The Matrix - each camera fires at the same time and then we create an animated gif from the output. The problem is that it is near impossible to align the cameras exactly and so I am looking for a way in OpenCV to align the images better and make it smoother.
Looking for general steps. I'm unsure of the order I would do it. If I start with image 1 and match 2 to it, then 2 is further from three than it was at the start. And so matching 3 to 2 would be more change... and the error would propagate. I have seen similar alignments done though. Any help much appreciated.
Here's a thought. How about performing a quick and very simple "calibration" of the imaging system by using a single reference point?
The best thing about this is you can try it out pretty quickly and even if results are too bad for you, they can give you some more insight into the problem. But the bad thing is it may just not be good enough because it's hard to think of anything "less advanced" than this. Here's the description:
Remove the object from the scene
Place a small object (let's call it a "dot") to position that rougly corresponds to center of mass of object you are about to record (the center of area denoted by red circle).
Record a single image with each camera
Use some simple algorithm to find the position of the dot on every image
Compute distances from dot positions to image centers on every image
Shift images by (-x, -y), where (x, y) is the above mentioned distance; after that, the dot should be located in the center of every image.
When recording an actual object, use these precomputed distances to shift all images. After you translate the images, they will be roughly aligned. But since you are shooting an object that is three-dimensional and has considerable size, I am not sure whether the alignment will be very convincing ... I wonder what results you'd get, actually.
If I understand the application correctly, you should be able to obtain the relative pose of each camera in your array using homographies:
https://docs.opencv.org/3.4.0/d9/dab/tutorial_homography.html
From here, the next step would be to correct for alignment issues by estimating the transform between each camera's actual position and their 'ideal' position in the array. These ideal positions could be computed relative to a single camera, or relative to the focus point of the array (which may help simplify calculation). For each image, applying this corrective transform will result in an image that 'looks like' it was taken from the 'ideal' position.
Note that you may need to estimate relative camera pose in 3-4 array 'sections', as it looks like you have a full 180deg array (e.g. estimate homographies for 4-5 cameras at a time). As long as you have some overlap between sections it should work out.
Most of my experience with this sort of thing comes from using MATLAB's stereo camera calibrator app and related functions. Their help page gives a good overview of how to get started estimating camera pose. OpenCV has similar functionality.
https://www.mathworks.com/help/vision/ug/stereo-camera-calibrator-app.html
The cited paper by Zhang gives a great description of the mathematics of pose estimation from correspondence, if you're interested.
I have a next task: get a room 3d projection from multiple images (possible video stream, doesn't matter). There will be spherical camera (in fact multiple cameras on sphere-like construction), so the case is the right one on the image.
I decided to code it on iOS platform as I'm iOS developer and model cameras with iPhone cam rotating it as shown on the pic above. As I can decompose this task, first I need to get real distance to the objects (walls in most cases, I think). Is it possible? Which algoritms/methods should I use to achieve this? I don't ask you to make the task for me obviously, but give me the direction, because I have no idea, maybe some equations/tutorials/algorithms with explanation to my case. Thank you!
The task of building a 3D model from multiple 2D images is called "scene reconstruction." It's still an active area of research, but solutions involve recognizing the same keypoint (e.g. a distinctive part of an object) in two images. Once you have that, you can use the known camera geometry to solve for the 3D position of that keypoint in the world.
Here's a reference:
http://docs.opencv.org/3.1.0/d4/d18/tutorial_sfm_scene_reconstruction.html#gsc.tab=0
You can google "scene reconstruction" to find lots more, and papers that go into more detail.
Hi i am using an asus xtion pro live camera for my object detection, i am also new to opencv. Im trying to get distance of object from the camera. The Object detected is in 2d image. Im not sure on what should i use to get the information then following up with the calculations to get distance between camera and object detected. Could someone advise me please?
In short: You can't.
You're losing the depth information and any visible pixel in your camera image essentially transforms into a ray originating from your camera.
So once you've got an object at pixel X, all you know is that the object somewhere intersects the vector cast based on this pixel and the camera's intrinsic/extrinsic parameters.
You'll essentially need more information. One of the following should suffice:
Know at least one coordinate of the 3D point (e.g. everything detected is on the ground or in some known plane).
Know the relation between two projected points:
Either the same point from different positions (known camera movement/offset)
or two points with significant distance between them (like the two ends of some staff or bar).
Once you've got either, you're able to use simple trigonometry (rule of three) to calculate the missing values.
Since I initially missed this being a camera with an OpenNI compatible depth sensor, it's possible to build OpenCV with support for that by definining the preprocessor define WITH_OPENNI when building the library.
I don't like to be the one breaking this to you but what you are trying to do is either impossible or extremely difficult with a single camera.
You need to have the camera moving, record a video of it and use a complex technique such as this. Usually 3d information is created from at least 2 2d images taken from 2 different places. You also need to know quite precisely the distance and the rotation between the two images. The common technique is to have 2 cameras with a precisely measured distance between the two.
The Xtion is not a basic webcam. It's a stereo-scopic depth sensing cam similar to Kinect and Primesense. The main API for this is OpenNI - see http://structure.io/openni.
I am currently looking for a proper solution to the following problem, which is not directly programming oriented, but I am guessing that the users of opencv might have an idea:
My stereo camera has a sensor of 1/3.2" 752x480 resolution. I am using the two stereo images of this very camera in order to create a point cloud, thanks to the point cloud library (PCL).
The problem is that I would like to reduce the number of points contained by the point cloud, by directly lowering the resolution of the input images (passing from 752x480 to 376x240).
As it is indicated in the title, I have to adapt the focal of the camera in pixels to this need:
I calculate this very parameter thanks to the following formula:
float focal_pixel = (FOCAL_METERS / SENSOR_WIDTH_METERS)*InputImg.cols;
However the SENSOR_WIDTH_METERS is currently constant and corresponds to the 1/3.2" data converted to meters AND I would like to adapt this to the resolution that I would like to have: 376x240.
I am absolutly not sure if I turned my problem clearly enough to be answered, which would mean that I am going in the wrong direction.
Thank you in advance
edit: the function used to process the stereo image (after computing):
getPointCloud(hori_c_pp, vert_c_pp, focal_pixel, BASELINE_METERS, out_stereo_cloud, ref_texture);
where the two first parameters are just the coordinates of the center of the image, BASELINE_METERS the baseline of my camera out_stereo_cloud my output cloud and eventually ref_texture the color information. This function is taken from the sub library stereo_matching.
For some reason, if I just resize the stereo images, it seems to enter in conflict with the focal_pixel parameters, since the dimension are not the same anymore.
Im very lost on this issue.
As I don't really follow the formulas and method calls you're posting I advise you to use another approach.
OpenCV already gives you the possibility to create voxels using stereo images with the method cv::reprojectImageTo3D. Another question also already discusses the conversion to the according PCL datatype.
If you only want to reproject a certain ROI of your image you should opt for cv::perspectiveTransform as is explained in the documentation I pointed out in the first link.
Say I have an object, and I obtained RGB data and Depth data from a kinect on one angle. Then I moved the kinect slightly so that I can take a second picture of the object from a different angle. I'm trying to figure out the best way to determine how much the kinect camera has translated and rotated from its original position based on the correspondences from both images.
I'm using OpenCV for certain image processing tasks and I've looked into the SURF algorithm, which seems to find good correspondences from two 2D images. However it doesn't take into account the depth data, and I don't think it works very nice with multiple pictures. Additionally, I can't figure out how you can obtain the translation/rotation data from the correspondence.. Perhaps I'm looking in the wrong direction?
Note: My long-term goal is to "merge" the multiple images to form a 3D model from all the data. At the moment it somewhat works if I specify the angles that the kinect is located at, but I think it's much better to reduce the "errors" involved (i.e. points shifted slightly from where they should be) by finding correspondences instead of specifying parameters manually