I don't understand what is a synthetic image in computer vision.
And what are the differences between optical image and synthetic image?
Here's an example of the question. It's a screen shot of a research paper:
A real image is obtained by an imaging device such as a camera, which converts the light from a scene to pixel values. Due to the image formation process that obeys the laws of physics, real images are rich, complex and often noisy and textured. The real world contains a lot of information.
A synthetic image is obtained "out of the blue" by pure computation, i.e. by modelling the real world and simulating the laws of optics.
Two decades ago, you could spot a synthetic image at a glance, because it was lacking realism and was obtained through too simple models (in part due to heavy computation costs). This is no more true nowadays, they tend to be indistinguishable.
Note that in scientific contexts, they can be using very simple images (say a chessboard) for experimental purposes, for instance testing an image filter.
For instance, the scene below has been synthetized by armies of researchers, with the goal of finding the most realistic lighting simulation. This room never existed.
Related
I'm working on a computer vision application that needs to infer the texture of a surface from a photo taken under illumination from a known direction. Take this embossed paper for example:
The brain makes various assumptions, such as that illumination comes from above and that the surface is contiguous, and arrives at a model of texture such that you can imagine what it would be like to run your finger over this surface.
This problem is called "shape from shading", is recognised to be very hard in the general case, is an active topic of research and even has its own book.
My questions is: for the specific case of a mostly flat surface with bumps in it such as embossed paper or a white-painted wall, are there any algorithms for doing this that are either released in an open source library, or simple enough that they can be assembled from primitives that are available in open source libraries like OpenCV or SciPy without requiring degree level maths?
Context:
I have the RGB-D video from a Kinect, which is aimed straight down at a table. There is a library of around 12 objects I need to identify, alone or several at a time. I have been working with SURF extraction and detection from the RGB image, preprocessing by downscaling to 320x240, grayscale, stretching the contrast and balancing the histogram before applying SURF. I built a lasso tool to choose among detected keypoints in a still of the video image. Then those keypoints are used to build object descriptors which are used to identify objects in the live video feed.
Problem:
SURF examples show successful identification of objects with a decent amount of text-like feature detail eg. logos and patterns. The objects I need to identify are relatively plain but have distinctive geometry. The SURF features found in my stills are sometimes consistent but mostly unimportant surface features. For instance, say I have a wooden cube. SURF detects a few bits of grain on one face, then fails on other faces. I need to detect (something like) that there are four corners at equal distances and right angles. None of my objects has much of a pattern but all have distinctive symmetric geometry and color. Think cellphone, lollipop, knife, bowling pin. My thought was that I could build object descriptors for each significantly different-looking orientation of the object, eg. two descriptors for a bowling pin: one standing up and one laying down. For a cellphone, one laying on the front and one on the back. My recognizer needs rotational invariance and some degree of scale invariance in case objects are stacked. Ability to deal with some occlusion is preferable (SURF behaves well enough) but not the most important characteristic. Skew invariance would be preferable and SURF does well with paper printouts of my objects held by hand at a skew.
Questions:
Am I using the wrong SURF parameters to find features at the wrong scale? Is there a better algorithm for this kind of object identification? Is there something as readily usable as SURF that uses the depth data from the Kinect along with or instead of the RGB data?
I was doing something similar for a project, and ended up using a super simple method for object recognition, which was using OpenCV blob detection, and recognizing objects based on their areas. Obviously, there needs to be enough variance for this method to work.
You can see my results here: http://portfolio.jackkalish.com/Secondhand-Stories
I know there are other methods out there, one possible solution for you could be approxPolyDP, which is described here:
How to detect simple geometric shapes using OpenCV
Would love to hear about your progress on this!
I have read several papers on using graph cuts for 3D reconstruction and I have noticed that there seem to be two alternative approaches to posing this problem.
One approach is volumetric and describes a 3D region of voxels for which a graph cut is used to infer a binary labelling (contains object of interest or does not) for each voxel. Papers which take this approach include Multi-View Stereo via Volumetric Graph Cuts and Occlusion Robust Photo-Consistency and A Surface Reconstruction Using Global Graph Cut Optimization.
The second approach is 2D and seeks to label each pixel of a reference image with the depth of the 3D point that projects there. Papers which take this approach include Computing Visual Correspondence with Occlusions via Graph Cuts.
I want to understand the advantages/disadvantages of each method and which are the most significant when choosing which method to use. So far I understand that some advantages of the first approach are:
It is a binary problem, so is solvable exactly with Max-Flow algorithms.
Provides simple methods of modelling occlusion.
And some advantages of the second approach are:
Smaller neighbor set for each node of the graph.
Easier to model smoothness (but does it give better results?).
Additionally, I would be interested in which situations I would be better off choosing one representation or the other and why.
The most significant difference is the type of scenes the algorithms are typically used with, and the way they represent the 3D shape of the object.
Volumetric approaches perform best...
with a large number of images...
taken from different viewpoints, well distributed around the object,...
of a more or less compact "object" (e.g. an artifact, in contrast, for example, to an outdoor scene observed by a vehicle camera).
Volumetric approaches are popular for reconstructing "objects" (especially artifacts). Given sufficient views (i.e. images), the algorithms give a complete volumetric (i.e. voxel) representation of the object's shape. This can be converted to a surface representation using Marching Cubes or similar method.
The second type of algorithms you identified are called stereo algorithms, and graph cuts are just one of many methods of solving such problems. Stereo is best...
if you have only two images...
with a fairly narrow baseline (i.e. distance between cameras)
Generalizations to more than two images (with narrow baselines) exist, but most of the literature deals with the binocular (i.e. two image) case. Some algorithms generalize more easily to more views than others.
Stereo algorithms only give you a depth map, i.e. an image with a depth value for each pixel. This does not allow you to look "around" the object. There are, however, 3D reconstruction systems that start with stereo on image pairs and combine the depth maps in order to get a representation of the complete object, which is a non-trivial problem of its own right. Interestingly, this is often approached using a volumetric representation as an intermediate step.
Stereo algorithms can be and are often used to for "scenes", e.g. the road observed by a pair of cameras in a vehicle, or people in a room for 3D video conferencen.
Some closing remarks
For both stereo and volumetric reconstruction, graph cuts are just one of several methods to solve the problem. Stereo, for example, can also be formulated as a continuous optimization problem, rather than a discrete one, which implies other optimization methods for its solution.
My answer contains a bunch of generalizations and simplifications. It is not meant to be a definitive treatment of the subject.
I don't necessarily agree that smoothness is easier in the stereo case. Why do you think so?
what approach would you recommend for finding obstacles in a 2D image?
Here are some key points I came up with till now:
I doubt I can use object recognition based on "database of obstacles" search, since I don't know what might the obstruction look like.
I assume color recognition might be problematic if the path does not differ a lot from the object itself.
Possibly, adding one more camera and computing a 3D image (like a Kinect does) would work, but that would not run as smooth as I require.
To illustrate the problem; robot can ride either left or right side of the pavement. In the following picture, left side is the correct choice:
If you know what the path looks like, this is largely a classification problem. Acquire a bunch of images of path at different distances, illumination, etc. and manually label the ground in each image. Use this labeled data to train a classifier that classifies each pixel as either "road" or "not road." Depending upon the texture of the road, this could be as simple as classifying each pixels' RGB (or HSV) values or using OpenCv's built-in histogram back-projection (i.e. cv::CalcBackProjectPatch()).
I suggest beginning with manual thresholds, moving to histogram-based matching, and only using a full-fledged machine learning classifier (such as a Naive Bayes Classifier or a SVM) if the simpler techniques fail. Once the entire image is classified, all pixels that are identified as "not road" are obstacles. By classifying the road instead of the obstacles, we completely avoided building a "database of objects".
Somewhat out of the scope of the question, the easiest solution is to add additional sensors ("throw more hardware at the problem!") and directly measure the three-dimensional position of obstacles. In order of preference:
Microsoft Kinect: Cheap, easy, and effective. Due to ambient IR light, it only works indoors.
Scanning Laser Rangefinder: Extremely accurate, easy to setup, and works outside. Also very expensive (~$1200-10,000 depending upon maximum range and sample rate).
Stereo Camera: Not as good as a Kinect, but it works outside. If you cannot afford a pre-made stereo camera (~$1800), you can make a decent custom stereo camera using USB webcams.
Note that professional stereo vision cameras can be very fast by using custom hardware (Stereo On-Chip, STOC). Software-based stereo is also reasonably fast (10-20 Hz) on a modern computer.
I want to develop an application in which user input an image (of a person), a system should be able to identify face from an image of a person. System also works if there are more than one persons in an image.
I need a logic, I dont have any idea how can work on image pixel data in such a manner that it identifies person faces.
Eigenface might be a good algorithm to start with if you're looking to build a system for educational purposes, since it's relatively simple and serves as the starting point for a lot of other algorithms in the field. Basically what you do is take a bunch of face images (training data), switch them to grayscale if they're RGB, resize them so that every image has the same dimensions, make the images into vectors by stacking the columns of the images (which are now 2D matrices) on top of each other, compute the mean of every pixel value in all the images, and subtract that value from every entry in the matrix so that the component vectors won't be affine. Once that's done, you compute the covariance matrix of the result, solve for its eigenvalues and eigenvectors, and find the principal components. These components will serve as the basis for a vector space, and together describe the most significant ways in which face images differ from one another.
Once you've done that, you can compute a similarity score for a new face image by converting it into a face vector, projecting into the new vector space, and computing the linear distance between it and other projected face vectors.
If you decide to go this route, be careful to choose face images that were taken under an appropriate range of lighting conditions and pose angles. Those two factors play a huge role in how well your system will perform when presented with new faces. If the training gallery doesn't account for the properties of a probe image, you're going to get nonsense results. (I once trained an eigenface system on random pictures pulled down from the internet, and it gave me Bill Clinton as the strongest match for a picture of Elizabeth II, even though there was another picture of the Queen in the gallery. They both had white hair, were facing in the same direction, and were photographed under similar lighting conditions, and that was good enough for the computer.)
If you want to pull faces from multiple people in the same image, you're going to need a full system to detect faces, pull them into separate files, and preprocess them so that they're comparable with other faces drawn from other pictures. Those are all huge subjects in their own right. I've seen some good work done by people using skin color and texture-based methods to cut out image components that aren't faces, but these are also highly subject to variations in training data. Color casting is particularly hard to control, which is why grayscale conversion and/or wavelet representations of images are popular.
Machine learning is the keystone of many important processes in an FR system, so I can't stress the importance of good training data enough. There are a bunch of learning algorithms out there, but the most important one in my view is the naive Bayes classifier; the other methods converge on Bayes as the size of the training dataset increases, so you only need to get fancy if you plan to work with smaller datasets. Just remember that the quality of your training data will make or break the system as a whole, and as long as it's solid, you can pick whatever trees you like from the forest of algorithms that have been written to support the enterprise.
EDIT: A good sanity check for your training data is to compute average faces for your probe and gallery images. (This is exactly what it sounds like; after controlling for image size, take the sum of the RGB channels for every image and divide each pixel by the number of images.) The better your preprocessing, the more human the average faces will look. If the two average faces look like different people -- different gender, ethnicity, hair color, whatever -- that's a warning sign that your training data may not be appropriate for what you have in mind.
Have a look at the Face Recognition Hompage - there are algorithms, papers, and even some source code.
There are many many different alghorithms out there. Basically what you are looking for is "computer vision". We had made a project in university based around facial recognition and detection. What you need to do is google extensively and try to understand all this stuff. There is a bit of mathematics involved so be prepared. First go to wikipedia. Then you will want to search for pdf publications of specific algorithms.
You can go a hard way - write an implementaion of all alghorithms by yourself. Or easy way - use some computer vision library like OpenCV or OpenVIDIA.
And actually it is not that hard to make something that will work. So be brave. A lot harder is to make a software that will work under different and constantly varying conditions. And that is where google won't help you. But I suppose you don't want to go that deep.