How to convert webcam image to RGB Depth - machine-learning

I'm building an iPhone-like FaceID program using my PC's webcam. I'm following this notebook which uses Kinect to create RGB-D images. So can I use my webcam to capture several images for the same purpose?
Here's how to predict the person in the Kinect image. It uses a .dat file.
inp1 = create_input_rgbd(file1)
file1 = ('faceid_train/(2012-05-16)(154211)/011_1_d.dat')
inp2 = create_input_rgbd(file1)
model_final.predict([inp1, inp2])

They use Kinect to create RGB-D images where you want to only use RGB camera to do the similar? Hardwarely they are different. So there wont be a direct method
You have to first estimate a depth map using only monocular Image.
Well you can try with Revisiting Single Image Depth Estimation: Toward Higher Resolution Maps with Accurate Object Boundaries as shown below. The depth obtained is pretty much close to real ground truth. For non-life threatening case(e.g control UAV or control car), you can use it anytime.
The code and model are available at
https://github.com/JunjH/Revisiting_Single_Depth_Estimation
Edit the demo py file to do a single image detection.
image = you
deep_learned_fake_depth = model(image)
#Add your additional classification routing behind.
Take note this method cant work real time. So you can only do it at keyframes. Usualy people use the feature tracking technique to fake continuous detection( which is the common practice).
Also take note that some of the phone devices does have a small depth estimation sensor that you can make use of. Details I'm not very sure as I deal android and ios at a very minimal level.

Related

How can I make a simulation rendered depth image look like a stereo matching created depth image

For a AI project I am collecting images from a simulated environment and a real environment. In both scenarios there is a grayscale depth image generated. However the simulated environment generates perfect depth images which are not representative of the real world. This is why I want to artificially make the depth image of the simulation look like the one from the real world.
I am looking for some functions in for example opencv to generate noise to make the simulated image look like the real world. I already tried opencv filter2D which improved the image a bit. But I am looking for some other functions which work better.
The real world depth image is generated using a ZED2 stereo vision camera.
Note that these images are not from the same situation. However they both have threes so they should give a bit the idea.
Simulated image:
real world image:
Thanks
Sieuwe
You can try synthesizing the 2d images and then reconstruct the depth map with noises. Below are detailed steps:
Assume 1 camera matrix with random extrinsic parameters and then create another camera matrix with similar pose to make it look like stereo paired cameras.
Project the simulated depth map to 2d image using these 2 camera poses to generate stereo pairs of images.
Add random noises in the images to introduce noise in feature point correspondences
Given 2 views of same scene, estimate the depth map which will contain real world noises.

How do I generate stereo images from mono camera?

I have a stationary mono camera which captures a single image frame at some fps.
Assume the camera is not allowed to move,how do I generate a stereo image pair from the obtained single image frame? Is there any algorithms exists for this? If so, are they available in Open-CV?
To get a stereo image, you need a stereo camera, i.e. a camera with two calibrated lenses. So you cannot get a stereo image from a single camera with traditional techniques.
However, with the magic of deep learning, you can obtain the depth image from single camera.
And no, there's no builtin OpenCV function to do that.
The most common use of this kind of techniques is in 3D TVs, which often offer 2D-to-3D conversion, and thus mono to stereo conversion.
Various algorithms are used for this, you can look at this state of the art report.
There is also optical way for this.
If you can add binocular prisms/mirrors to your camera objective ... then you could obtain real stereoscopic image from single camera. That of coarse need access to the camera and setting up the optics. This also introduce some problems like wrong auto-focusing , need for image calibration, etc.
You can also merge Red/Cyan filtered images together to maintain the camera full resolution.
Here is a publication which might be helpful Stereo Panorama with a single Camera.
You might also want to have a look at the opencv camera calibration module and a look at this page.

Stiching Aerial images with OpenCV with a warper that projects images to the ground

Have anyone done something like that?
My problems with the OpenCV sticher is that it warps the images for panoramas, meaning the images get stretched a lot as one moves away from the first image.
From what I can tell OpenCV also builds ontop of the assumption of the camera is in the same position. I am seeking a little guidence on this, if its just the warper I need to change or I also need to relax this asusmption about the camera position being fixed before that.
I noticed that opencv uses a bundle adjuster also, is it using the same assumption that the camera is fixed?
Aerial image mosaicing
The image warping routines that are used in remote sensing and digital geography (for example to produce geotiff files or more generally orthoimages) rely on both:
estimating the relative image motion (often improved with some aircraft motion sensors such as inertial measurement units),
the availability of a Digital Elevation Model of the observed scene.
This allows to estimate the exact projection on the ground of each measured pixel.
Furthermore, this is well beyond what OpenCV will provide with its built-in stitcher.
OpenCV's Stitcher
OpenCV's Stitcher class is indeed dedicated to the assembly of images taken from the same point.
This would not be so bad, except that the functions try to estimate just a rotation (to be more robust) instead of plain homographies (this is where the fixed camera assumption will bite you).
It adds however more functionality that are useful in the context of panoramao creation, especially the image seam cut detection part and the image blending in overlapping areas.
What you can do
With aerial sensors, it is usually sound to assume (except when creating orthoimages) that the camera - scene distance is big enough so that you can approach the inter-frame transform by homographies (expecially if your application does not require very accurate panoramas).
You can try to customize OpenCV's stitcher to replace the transform estimate and the warper to work with homographies instead of rotations.
I can't guess if it will be difficult or not, because for the most part it will consist in using the intermediate transform results and bypassing the final rotation estimation part. You may have to modify the bundle adjuster too however.

SURF feature detection for linear panoramas OpenCV

I have started on a project to create linear/ strip panorama's of long scenes using video. This meaning that the panorama doesn't revolve around a center but move parallel to a scene eg. vid cam mounted on a vehicle looking perpendicular to the street facade.
The steps I will be following are:
capture frames from video
Feature detection - (SURF)
Feature tracking (Kanade-Lucas-Tomasi)
Homography estimation
Stitching Mosaic.
So far I have been able to save individual frames from video and complete SURF feature detection on only two images. I am not asking for someone to solve my entire project but I am stuck trying complete the SURF detection on the remaing frames captured.
Question: How do I apply SURF detection to successive frames? Do I save it as a YAML or xml?
For my feature detection I used OpenCV's sample find_obj.cpp and just changed the images used.
Has anyone experienced such a project? An example of what I would like to achieve is from Iwane technologies http://www.iwane.com/en/2dpcci.php
While working on a similar project, I created an std::vector of SURF keypoints (both points and descriptors) then used them to compute the pairwise matchings.
The vector was filled while reading frame-by-frame a movie, but it works the same with a sequence of images.
There are not enough points to saturate your memory (and use yml/xml files) unless you have very limited resources or a very very long sequence.
Note that you do not need the feature tracking part, at least in most standard cases: SURF descriptors matching can also provide you an homography estimate (without the need for tracking).
Reading to a vector
Start by declaring a vector of Mat's, for example std::vector<cv::Mat> my_sequence;.
Then, you have two choices:
either you know the number of frames, then you resize the vector to the correct size. Then, for each frame, read the image to some variable and copy it to the correct place in the sequence, using my_sequence.at(i) = frame.clone(); or frame.copyTo(my_sequence.at(i));
or you don't know the size beforehand, and you simply call the push_back() method as usual: my_sequence.push_back(frame);

Is it possible to extract the foreground in a video where the biggest part of the background is a huge screen (playing a video)?

I am working on a multi-view telepresence project using an array of kinect cameras.
To improve the visual quality we want to extract the foreground, e.g. the person standing in the middle using the color image, the color image and not the depth image because we want to use the more reliable color image to repair some artefacts in the depth image.
The problem now is that the foreground objects (usually 1-2 persons) are standing in front of a huge screen showing another party, which is also moving all the time, of the telepresence system and this screen is visible for some of the kinects. Is it still possible to extract the foreground for these kinects and if so, could you point me in the right direction?
Some more information regarding the existing system:
we already have a system running that merges the depth maps of all the kinects, but that only gets us so far. There are a lot of issues with the kinect depth sensor, e.g. interference and distance to the sensor.
Also the color and depth sensor are slightly shifted, so when you map the color (like a texture) on a mesh reconstructed using the depth data you sometimes map the floor on the person.
All these issues decrease the overall quality of the depth data, but not the color data, so one could view the color image silhouette as the "real" one and the depth one as a "broken" one. But nevertheless the mesh is constructed using the depth data. So improving the depth data equals improving the quality of the system.
Now if you have the silhouette you could try to remove/modify incorrect depth values outside of the silhouette and/or add missing depth values inside
Thanks for every hint you can provide.
In my experience with this kind of problems the strategy you propose is not the best way to go.
As you have a non-constant background, the problem you want to solve is actually 2D segmentation. This is a hard problem, and people are typically using depth to make segmentation easier and not the other way round. I would try to combine / merge the multiple depth maps from your Kinects in order to improve your depth images, maybe in a Kinect fusion kind of way, or using classic sensor fusion techniques.
If you are absolutely determined to follow your strategy, you could try to use your imperfect depth maps to combine the RGB camera images of the Kinects in order to reconstruct a complete view of the background (without occlusion by the people in front of it). However, due to the changing background image on the screen, this would require your Kinects' RGB cameras to by synchronized, which I think is not possible.
Edit in the light of comments / updates
I think exploiting your knowledge of the image on the screen is your only chance of doing background subtraction for silhouette enhancement. I see that this is a tough problem as the sceen is a stereoscopic display, if I understand you correctly.
You could try to compute a model that describes what your Kinect RGB cameras see (given the stereoscopic display and their placement, type of sensor etc) when you display a certain image on your screen, essentially a function telling you: Kinect K sees (r, g, b) at pixel (x, y) when I show (r',g',b') at pixel (x',y') on my display. To do this you will have do create a sequence of calibration images which you show on the display, without a person standing in front of it, and film with the Kinect. This would allow you to predict the appearance of your screen in the Kinect cameras, and thus compute background subtraction. This is a pretty challenging task (but it would give a good research paper if it works).
A side note: You can easily compute the geometric relation of a Kinect's depth camera to its color camera, in order to avoid mapping the floor on the person. Some Kinect APIs allow you to retrieve the raw image of the depth camera. If you cover the IR projector you can film a calibration pattern with both, depth and RGB camera, and compute an extrinsic calibration.

Resources