I am a newbie to the field. My problem is to recognize whether an object similar to the object used in training images(Images of Similar objects) is present in test image or not. I want to use SIFT descriptors for recognition. Is the bag of words approach by clustering of SIFT descriptors which is used for object classification into different classes is suitable for it or if there is simpler approach using sift descriptors for it.
Thanks in advance
The bag of visual words (BoW) is indeed the classic approach, originally proposed by Sivic & Zisserman in 2003 [Paper]. It was among the first to depart from previous methods that favored global descriptors as opposed to local features like SIFT and SURF. I do recommend continuing to implement this classic pipeline if you are just beginning to learn about object detection and recognition.
Related
I've started working with computer vision techniques quite a bit, mainly deep learning but I want to try and get a good understanding of the more traditional techniques as well for a good grounding. I have been playing around with some manual feature engineering techniques for classification with RF and SVM classifiers. I've looked at texture representations like HOG and LBP descriptors as well as edge filters, gabor filters and spacial features such as fourier descriptors. What i'm kind of lacking is a good idea of how the different features group and what categories they each belong to. I know some are defined as global and local but what does this mean exactly and which ones? and are there others categories like texture and geometric that I should consider? Any explanation would be useful and much appreciated (i've looked a lot online but it all seems a bit fragmented)
Thanks!
Features are the information extracted from images in terms of numerical values that are difficult to understand and correlate by human. Suppose we consider the image as data the information extracted from the data is known as features. Generally, features extracted from an image are of much more lower dimension than the original image. The reduction in dimentionality reduces the overheads of processing the bunch of images.
Basically there are two types of features are extracted from the images based on the application. They are local and global features. Features are sometimes referred to as descriptors. Global descriptors are generally used in image retrieval, object detection and classification, while the local descriptors are used for object recognition/identification. There is a large difference between detection and identification. Detection is finding the existence of something/object (Finding whether an object is exist in image/video) where as Recognition is finding the identity (Recognizing a person/object) of an object.
Global features describe the image as a whole to the generalize the entire object where as the local features describe the image patches (key points in the image) of an object. Global features include contour representations, shape descriptors, and texture features and local features represents the texture in an image patch. Shape Matrices, Invariant Moments (Hu, Zerinke), Histogram Oriented Gradients (HOG) and Co-HOG are some examples of global descriptors. SIFT, SURF, LBP, BRISK, MSER and FREAK are some examples of local descriptors.
Generally, for low level applications such as object detection and classification, global features are used and for higher level applications such as object recognition, local features are used. Combination of global and local features improves the accuracy of the recognition with the side-effect of computational overheads.
Im working on an automatic image annotation problem in which im trying to associate tags with images. For that im trying for SIFT features for learning. But the problem is all the SIFT features are a set of keypoints, each of which have a 2-D array, and the number of keypoints are also huge.How many and how do I give them for my learning algorithm which typically accepts only one-d features?
You can represent single SIFT as "visual word" which is one number and use it as SVM input, I think it is what you need. It is usually done by k-means clustering.
This method is called "bag-of-words" and described in this paper.
Short presentation review of method.
You should read the original paper about SIFT, it tells you what is SIFT and how to use it, you should carefully read the chapter 7 and rest for understanding how to use it practically.
Here is the link for original paper.
You can use the Bag of Words approach, of which you can read about in the following post:
http://gilscvblog.wordpress.com/2013/08/23/bag-of-words-models-for-visual-categorization/
Sift and Surf are invariant feature extractors. There for matching features will help solving lots of problems.
But there is matching problem since all points may not be same in two different image. (and in the case of similarity problem). Therefore you should use the features which is matched the others may.
Another problem is this algorithms extract lots of features which is not possible to match in large datasets.
There is a good solution to those problems which is called "Bag of Visual Word"
https://github.com/dermotte/LIRE complete bag of visual word is fully implemented. Here is the lire Demo site.
Code is very simple if you know the bag of visual word you can modify also.
After getting visual word you should use information retrieval approaches used in search engines. By the way Lire also include an information retrieval library called lucene. You should fallow the lire way until you get the complete idea and implement your own.
I am trying to use the OpenCV's cascade classifier based on Histogram of Oriented Objects (HOGs) feature type -- such as the paper "Fast Human Detection Using a Cascade of Histograms of Oriented Gradients".
Searching in the web, I found that the Cascade Classificator of OpenCV only supports HAAR/LBP feature type (OpenCV Cascade Classification).
Is there a way to use HOGs with the OpenCV cascade classifier? What
do you suggest?
Is there a patch or another library that I can use?
Thanks in advance!
EDIT 1
I've kept my search, when I finally found in android-opencv that there is a trunk in Cascade Classifier which allows it to work with HOG features. But I don't know if it works...
Link: http://code.opencv.org/projects/opencv/repository/revisions/6853
EDIT 2
I have not tested the fork above because my problem has changed. But I found an interesting link which may be very useful in the future (when I come back to this problem).
This page contains the source code of the paper "Histograms of Oriented Gradients for
Human Detection". Also, more information. http://pascal.inrialpes.fr/soft/olt/
If you use OpenCV-Python, then you have the option of using some additional libraries, such as scikits.image, that have Histogram of Oriented Gradient built-ins.
I had to solve exactly this same problem a few months ago, and documented much of the work (including very basic Python implementations of HoG, plus GPU implementations of HoG using PyCUDA) at this project page. There is code available there. The GPU code should be reasonably easy to modify for use in C++.
It now seems to be available also in the non-python code. opencv_traincascade in 2.4.3 has a HOG featuretype option (which I did not try):
[-featureType <{HAAR(default), LBP, HOG}>]
Yes, you can use cv::CascadeClassifier with HOG features. To do this just load it with hogcascade_pedestrians.xml that you may find in opencv_src-dir/data/hogcascades.
The classifier works faster and its results are much better when it trained with hogcascade in compare with haarcascade...
There doesn't seem to be any implementations of HOG training in openCV and little sources about how HOG training works. From what I gathered, HOG training can be done in real time. But what are the requirements of training? How does the training process actually work?
As with most computer vision algorithms, Google Scholar is your friend :) I would suggest reading a few papers on how it works. Here is one of the most referenced papers on HoG for you to start with.
Another tip when researching in computer vision is to note the authors of the papers you find interesting, and try to find their websites. They will tend to have an implementation of their algorithms as well as rules of thumb on how to use them. Also, look up the references that are sited in the paper about your algorithm. This can be very helpful in aquiring the background knowledge to truly understand how the algorithm works and why.
Your terminology is a bit mixed up. HOG is a feature descriptor. You can train a classifier using HOG, which can in turn be used for object detection. OpenCV includes a people detector that uses HOG features and an SVM classifier. It also includes CascadeClassifier, which can use HOG, and which is typically used for face detection.
There is a program in OpenCV called opencv_traincascade, which lets you train a cascade object detector, an which gives you the option to use HOG. There is a function in the Computer Vision System Toolbox for MATLAB called trainCascadeObjectDetector, which does the same thing.
I know that most common object detection involves Haar cascades and that there are many techniques for feature detection such as SIFT, SURF, STAR, ORB, etc... but if my end goal is to recognizes objects doesn't both ways end up giving me the same result? I understand using feature techniques on simple shapes and patterns but for complex objects these feature algorithms seem to work as well.
I don't need to know the difference in how they function but whether or not having one of them is enough to exclude the other. If I use Haar cascading, do I need to bother with SIFT? Why bother?
thanks
EDIT: for my purposes I want to implement object recognition on a broad class of things. Meaning that any cups that are similarly shaped as cups will be picked up as part of class cups. But I also want to specify instances, meaning a NYC cup will be picked up as an instance NYC cup.
Object detection usually consists of two steps: feature detection and classification.
In the feature detection step, the relevant features of the object to be detected are gathered.
These features are input to the second step, classification. (Even Haar cascading can be used
for feature detection, to my knowledge.) Classification involves algorithms
such as neural networks, K-nearest neighbor, and so on. The goal of classification is to find
out whether the detected features correspond to features that the object to be detected
would have. Classification generally belongs to the realm of machine learning.
Face detection, for example, is an example of object detection.
EDIT (Jul. 9, 2018):
With the advent of deep learning, neural networks with multiple hidden layers have come into wide use, making it relatively easy to see the difference between feature detection and object detection. A deep learning neural network consists of two or more hidden layers, each of which is specialized for a specific part of the task at hand. For neural networks that detect objects from an image, the earlier layers arrange low-level features into a many-dimensional space (feature detection), and the later layers classify objects according to where those features are found in that many-dimensional space (object detection). A nice introduction to neural networks of this kind is found in the Wolfram Blog article "Launching the Wolfram Neural Net Repository".
Normally objects are collections of features. A feature tends to be a very low-level primitive thing. An object implies moving the understanding of the scene to the next level up.
A feature might be something like a corner, an edge etc. whereas an object might be something like a book, a box, a desk. These objects are all composed of multiple features, some of which may be visible in any given scene.
Invariance, speed, storage; few reasons, I can think on top of my head. The other method to do would be to keep the complete image and then check whether the given image is similar to glass images you have in your database. But if you have a compressed representation of the glass, it will need lesser computation (thus faster), will need lesser storage and the features tells you the invariance across images.
Both the methods you mentioned are essentially the same with slight differences. In case of Haar, you detect the Haar features then you boost them to increase the confidence. Boosting is nothing but a meta-classifier, which smartly chooses which all Harr features to be included in your final meta-classification, so that it can give a better estimate. The other method, also more or less does this, except that you have more "sophisticated" features. The main difference is that, you don't use boosting directly. You tend to use some sort of classification or clustering, like MoG (Mixture of Gaussian) or K-Mean or some other heuristic to cluster your data. Your clustering largely depends on your features and application.
What will work in your case : that is a tough question. If I were you, I would play around with Haar and if it doesn't work, would try the other method (obs :>). Be aware that you might want to segment the image and give some sort of a boundary around for it to detect glasses.