I am using bag of visual words to cluster the features of the image. So far all the work i came across used BOW for clustering features computed using SIFT, SURF, etc. Maybe I missed this out but is it possible to represent Color Histogram Feature into BOW and also Edge oriented Histogram feature in BOW.
I am working on image classifier and I want to used SIFT with RGB color Histogram as feature descriptors in Opencv. So i was wondering is it correct to concatenate the 2 feature vectors into one and add to BOW or is it correct to add SIFT feature to BOW and concatenate the Histogram feature to BOW (I am using this model as of now but I am wondering which one is correct.)
Historically we use bag-of-words model for document retrieval.
As mentioned in this post, in computer vision bag-of-visual-words refers to bag-of-features (BOF). If you want to use two types of features, you can concatenate the two feature vectors and build the BOF model. (As an example,) For human action recognition task people often models human action using BOF built on the concatenation of appearance feature (Gradient -- HOG) and motion feature (Optical Flow -- HOF).
Related
I have to calculate the similarity between 2 images, and I was guided to use feature embeddings of images extracted by Auto-encoders rather than Features extracted by CNN.
Can I know what is exact difference why Feature embeddings & why it can be used to calculate similarity but not image features extracted by CNN?
I have a high-level idea of Image features, that it is a generated data by running a single Foward prop on a pre-trained network (N-1)th layer, not the prediction layer(softmax or sigmoid).
And I know word embedding that projecting a dimension of a given word into more convenient feature dimensional space.
But what is the intuition of embeddings in Image?
When to use one over another ?
I am trying to implement a wrinkle feature extraction method using Gabor filters in OpenCV C++ as outlined in this study:
Choi, Sung Eun, et al. "Age estimation using a hierarchical classifier based on global and local facial features." Pattern Recognition 44.6 (2011): 1262-1281.
The filters will have as input patches of skin from a face image, and need to extract thick and fine wrinkles. I don't know what parameters to use for wrinkles since most examples on the internet deal with edges formed by objects. After getting the output image of the convolution, I need to find the mean and variance of the magnitude response.
It would be great if I found a study on forming Gabor kernels for wrinkles, sample code, or even a neat way to figure out the best parameters.
I have implemented the SIFT algorithm in OpenCV for feature detection and matching using the following steps:
Background Removal using Otsu's thresholding
Feature Detection using SIFT feature detector
Descriptor Extraction using SIFT feature extractor
Matching feature vectors using BFMatcher(L2 Norm) and using the ratio test to filter
good matches
My objective is to classify images into different categories such as shoes, shirts etc. based on their similarity. For example two different heels should be more similar to each other than a heel and a sports shoe or a heel and a t-shirt.
However this algorithm is working well only when my template image is present in the search image (in any scale and orientation). If I compare two different heels, they don't match well and the matches are also random(heel of one image matches to the flat surface of the other image). There are also many false positives when I compare a heel with a sports shoe or a heel with a t-shirt or a heel with the picture of a baby!
I would like to look at a heel and identify it as a heel and return how similar the heel is to different images in my database giving maximum similarity for other heels, then followed by other shoes. It should not produce any similarity with irrelevant objects such as shirts, phones, pens..
I understand that the SIFT algorithm produces a descriptor vector for each keypoint based on the gradient values of pixels around the keypoint and images are matched purely using this attribute. Hence it is highly possible that a keypoint located near the heel of one shoe is matched to a keypoint at the surface of the other shoe. Therefore, what I gather is that this algorithm can be used only to detect exact matches and not to detect similarity between images
Could you please tell me if this algorithm can be used for my objective and if I am doing something wrong or suggest any other approach that I should use.
For classification of similar objects, I certainly would go for cascade classifiers.
Basically, cascade classifiers is a machine learning method where you train your classifier to detect an object in different images. For it to work well, you need to train your classifier with a lot of positive (where your object is) and negative (where your object is not) images. The method was invented by Viola and Jones in 2001.
There is a ready-made implementation in OpenCV for face detection, you will have a bit more explanations on the openCV documentation (sorry, can't post the link, I'm limited to 1 link for the moment ..)
Now, for the caveats :
First, you need a lot of positive and negative images. The more images you have, the better the algorithm will perform. Beware of over-learning : if your training dataset for heels contains, for instance, too many images of a given model it is possible that others will not be detected properly
Training the cascade classifier can be long and difficult. The end-result will depend on how well you choose the parameters for training the classifier. Some info on this can be found on this webpage : http://coding-robin.de/2013/07/22/train-your-own-opencv-haar-classifier.html
I am working on developing a object classifier by using 3 different features i.e SIFT, HISTOGRAM and EGDE.
However these 3 features have different dimensional vector e.g. SIFT = 128 dimension. HIST = 256.
Now these features cannot be concatenated into once vector due to different sizes. What I am planning to do but I am not sure if it is going to be correct way is this:
For each features i train the classifier separately and than i apply classification separately for 3 different features and than count the majority and finally declare the image with majority votes.
Do you think this is a correct way?
There are several ways to get classification results that take into account multiple features. What you have suggested is one possibility where instead of combining features you train multiple classifiers and through some protocol, arrive at a consensus between them. This is typically under the field of ensemble methods. Try googling for boosting, random forests for more details on how to combine the classifiers.
However, it is not true that your feature vectors cannot be concatenated because they have different dimensions. You can still concatenate the features together into a huge vector. E.g., joining your SIFT and HIST features together will give you a vector of 384 dimensions. Depending on the classifier you use, you will likely have to normalize the entries of the vector so that no one feature dominate simply because by construction it has larger values.
EDIT in response to your comment:
It appears that your histogram is some feature vector describing a characteristic of the entire object (e.g. color) whereas your SIFT descriptors are extracted at local interest keypoints of that object. Since the number of SIFT descriptors may vary from image to image, you cannot pass them directly to a typical classifier as they often take in one feature vector per sample you wish to classify. In such cases, you will have to build a codebook (also called visual dictionary) using the SIFT descriptors you have extracted from many images. You will then use this codebook to help you derive a SINGLE feature vector from the many SIFT descriptors you extract from each image. This is what is known as a "bag of visual words (BOW)" model. Now that you have a single vector that "summarizes" the SIFT descriptors, you can concatenate that with your histogram to form a bigger vector. This single vector now summarizes the ENTIRE image/(object in the image).
For details on how to build the bag of words codebook and how to use it to derive a single feature vector from the many SIFT descriptors extracted from each image, look at this book (free for download from author's website) http://programmingcomputervision.com/ under the chapter "Searching Images". It is actually a lot simpler than it sounds.
Roughly, just run KMeans to cluster the SIFT descriptors from many images and take their centroids (which is a vector called a "visual word") as the codebook. E.g. for K = 1000 you have a 1000 visual word codebook. Then, for each image, create a result vector the same size as K (in this case 1000). Each element of this vector corresponds to a visual word. Then, for each SIFT descriptor extracted from an image, find its closest matching vector in the codebook and increment the count in the corresponding cell in the result vector. When you are done, this result vector essentially counts how often the different visual words appear in the image. Similar images will have similar counts for the same visual words and hence this vector effectively represents your images. You will also need to "normalize" this vector to make sure that images with different number of SIFT descriptors (and hence total counts) are comparable. This can be as simple as simply dividing each entry by the total count in the vector or through a more sophisticated measure such as tf/idf as described in the book.
I believe the author also provide python code on his website to accompany the book. Take a look or experiment with them if you are unsure.
More sophisticated method for combining features include Multiple Kernel Learning (MKL). In this case, you compute different kernel matrices, each using one feature. You then find the optimal weights to combine the kernel matrices and use the combined kernel matrix to train a SVM. You can find the code for this in the Shogun Machine Learning Library.
I am working with SVM-light. I would like to use SVM-light to train a classifier for object detection. I figured out the syntax to start a training:
svm_learn example2/train_induction.dat example2/model
My problem: how can I build the "train_induction.dat" from a
set of positive and negative pictures?
There are two parts to this question:
What feature representation should I use for object detection in images with SVMs?
How do I create an SVM-light data file with (whatever feature representation)?
For an intro to the first question, see Wikipedia's outline. Bag of words models based on SIFT or sometimes SURF or HOG features are fairly standard.
For the second, it depends a lot on what language / libraries you want to use. The features can be extracted from the images using something like OpenCV, vlfeat, or many others. You can then convert those features to the SVM-light format as described on the SVM-light homepage (no anchors on that page; search for "The input file").
If you update with what language and library you want to use, we can give more specific advice.