I am trying to use sift for object classification using Normal Bayes Classifier.When I compute the descriptors for each image of variable size i get different sized feature vectors. Eg:
Feature Size: [128 x 39]
Feature Size: [128 x 54]
Feature Size: [128 x 69]
Feature Size: [128 x 64]
Feature Size: [128 x 14]
As for development, i am using 20 training images and therefore i have 20 labels. My classification is only of 3 classes containing car, book and ball. So my label vector size is [1 x 20]
As far as I understand, to perform Machine learning the feature vector size and label vector size should be same so i should get a vector size for training data as [__ x 20] and label is [1 x 20].
But my problem is that sift has 128 dimensional feature space hence and each image has different feature size as i shown above. How do I convert all to same size without losing features?
OR perhaps I might be doing it incorrectly so please help me out in this?
PS: actual I have done it using BOW model and it works but just for my learning purposes I am trying to do it in this matter just to learn out of interest so any hint and advise are welcomed. Thank you
You are right, SIFT descriptor is a 128 dimensional feature.
SIFT descriptor is computed for every key-point detected in the image. Before computing descriptor, you probably used a detector (as Harris, Sift or Surf Detector) to detect points of interest.
Detecting key-points and computing descriptors are two independent steps!
When you print Feature Size: [128 x Y] in your program, Y represents the number of key-points detected in the current image.
Generally, using BOW allows you to assign for each key-points descriptors the indice of the closest cluster in the BOW. Depending on your application, you can make decision ... (voting on the presence of one object in the scene or ...)
If you do not want to use BOW you could try to match the individual SIFT features as described in the original SIFT paper by Lowe.
The basic idea is that you compare two images with each other and decide, whether they are similar or not. You do that by comparing the individual SIFT features. You decide if they match. Then, to check if the spatial positions are consistent, you need to check if it is possible to transform the matched features from one image to the other.
It is described in more detail in the SIFT wikipedia article.
Related
I am having problem to understand about size of HOG feature vector...
scene: I took a 286x286 image.Then I calculated HOG for 8x8 patch each.Mean I got 8x8x2=128 numbers represented by a 9 bin histogram for each patch.so can I say this 9 bin histogram as a 9 dimensional vector?.After,total number of patch to estimate HOG in whole image was approx. 1225(since I have square matrix I estimated total patch by squaring(286/8)=35)).I iterated 1225 patches and calculated 9 bin histogram for each.(I didn't applied 16x16 block normalization) After that concatenating all vector together I obtained 1225x9=11,025 sized HOG of whole image.
question:
1.Then is is right to say I obtained 11,025 dimension of an HOG vector in given image?
2.Am I going in right direction?(if I opt for classification via neural network)
3.Is this concatenated HOG feature can directly feeded to PCA for dimension reduction?or need further more preprocessing?(in genral not in advance)
Thank you in advance!
Yes
Probably not. What are you trying to do? For example, if you are doing classification, you should use bag-of-words (actually, you should stop using HOG and try deep learning). If you are doing image retrieval/matching, you should compute HOG feature for local patches.
You can almost always use PCA for dimensionality reduction for your features, even for 128 dimensional SIFT.
So, I've seen that many of the first CNN examples in Machine Learning use the MNIST dataset. Each image there is 28x28, and so we know the shape of the input before hand. How would this be done for variable size input, let's say you have some images that are 56x56 and some 28x28.
I'm looking for a language and framework agnostic answer if possible or in tensorflow terms preferable
In some cases, resizing the images appropriately (for example to keep the aspectratio) will be sufficient. But, this can introduce distortion, and in case this is harmful, another solution is to use Spatial Pyramidal Pooling (SPP). The problem with different image sizes is that it produces layers of different sizes, for example, taking the features of the n-th layer of some network, you can end up with a featuremap of size 128*fw*fh where fw and fh vary depending on the size of the input example. What SPP does in order to alleviate this problem, is to turn this variable size feature map into a fix-length vector of features. It operates on different scales, by dividing the image into equal patches and performing maxpooling on them. I think this paper does a great job at explaining it. An example application can be seen here.
As a quick explanation, imagine you have a feature map of size k*fw*fh. You can consider it as k maps of the form
X Y
Z T
where each of the blocks are of size fw/2*fh/2. Now, performing maxpooling on each of those blocks separately gives you a vector of size 4, and therefore, you can grossly describe the k*fw*fh map as a k*4 fixed-size vector of features.
Now, call this fixed-size vector w and set it aside, and this time, consider the k*fw*fh featuremap as k featureplanes written as
A B C D
E F G H
I J K L
M N O P
and again, perform maxpooling separately on each block. So, using this, you obtain a more fine-grained representation, as a vector of length v=k*16.
Now, concatenating the two vectors u=[v;w] gives you a fixed-size representation. This is exaclty what a 2-scale SPP does (well, of course you can change the number/sizes of divisions).
Hope this helps.
When you use CNN for classification task, your network has two part:
Feature generator. Part generates feature map with size WF x HF and CF channels by image with size WI x HI and CI channels . The relation between image sizes and feature map size depends of structure your NN (for example, on amount of pooling layers and stride of them).
Classifier. Part solves the task of classification vectors with WF*HF*CF components into classes.
You can put image with different size into feature generator, and get feature map with different sizes. But classifier can only be training on some fixed lengths vectors. Therefore you obviously train your network for some fixed sizes of images. If you have images with different size you resize it to input size of network, or crop some part of image.
Another way described in the article
K. He, X. Zhang, S. Ren, J. Sun, "Spatial pyramid pooling in deep convolutional networks for visual recognition," arXiv:1406.4729 2014
Authors offered Spatial pyramid pooling, which solve the problem with different image on the input of CNN. But I don't sure is spatial pyramid pooling layer exists in tensorflow.
What is the point of image segmentation algorithms like SLIC? Most object detection algorithms work over the entire set of (square) sub-images anyway.
The only conceivable benefit to segmenting the image is that now, the classifier has shape information available to it. Is that right?
Most classifiers I know of take rectangular input images. What classifiers allow you to pass variable sized image segments to them?
First, SLIC, and the kind of algorithms I'm guessing you refer to, are not segmentation algorithms, they are oversegmentation algorithms. There is a difference between those two terms. segmentation methods split the image in objects while oversegmentation methods split the image in small clusters (spatially adjacent group of pixels with similar characteristics), these clusters are usually called superpixels. See the image** of superpixels below:
Now, answering parts of your original question:
Why to use superpixels?
They reduce the dimensionality of your data/complexity of the problem. From N = w x h pixels to M superpixels, with M << N. This is, for an image composed of N = 320x480 = 153600 pixels, M = 400 or M = 800 superpixels seem to be enough to oversegment it. Now, letting for latter how to classify them, just consider how easier your problem has become reducing from N=100K to N=800 training examples to train/classify. The superpixels still represent your data properly, as they adhere to image boundaries.
They allow you to compute more complex features. With pixels, you can only compute some simple statistics on them, or use a filter-bank/feature extractor to extract some features in its vicinity. This however represents your pixel's neighbour very locally, without having in consideration the context. With superpixels, you can compute a superpixel descriptor from all the pixels that belong to it. This is, features are usually computed at pixel level as before, but, then features are merged into a superpixel descriptor by different methods. Some of the methods to do that are: simple mean of all pixels inside a superpixel, histograms, bag-of-words, correlation. As a simple example, imagine you only consider grayscale as a feature for your image/classifier, if you use simple pixels, all you have is pixel's intensity, which is very local and noisy. If you use superpixels, you can compute a histogram of the intensities of all the pixels inside, which describes much better the region than a single local intensity value.
They allow you to compute new features. Over superpixels you can compute some regional statistics (1st order as mean or variance or second order covariance). You can now extract some other information not available before (e.g. shape, length, diameter, area...).
How to use them?
Compute pixel features
Merge pixel features into superpixel descriptors
Classify/Optimize superpixel descriptors
In step 2., either by averaging, using histogram or using bag-of-words model, the superpixel descriptor is computed fixed-sized (e.g. 100 bins for histogram). Therefore, at the end you have reduced the X = N x K training data (N = 100K pixels times K features) to X = M x D (with M = 1K superpixels and D the length of the superpixel descriptor). Usually D > K but M << N, therefore you endup with some regional/more robust features that represent better your data with lower data dimensionality, which is great and reduces the complexity of your problem (classify, optimize) in average 2-3 orders of magnitude!
Conclusions
You can compute more complex (robust) features, but you have to be careful how, when and whatfor do you use superpixels as your data representation. You might lose some information (e.g. you lose your 2D grid lattices) or if you don't have enough training examples you might make the problem more difficult as the features are more complex and could be that you transform a linearly-separable problem into a non-linear one.
** Image from SLIC superpixels: http://ivrl.epfl.ch/research/superpixels
My problem is as follows:
I have 6 types of images, or 6 classes. For example, cat, dog, bird, etc.
For every type of image, I have many variations of that image. For example, brown cat, black dog, etc.
I'm currently using a Support Vector Machine (SVM) to classify the images using one-versus-rest classification. I'm unfolding each image into a single pixel vector and using that as the feature vector for a given image I'm experiencing decent classification accuracy, but I want to try something different.
I want to use image descriptors, particularly SURF features, as the feature vector for each image. This issue is, I can only have a single feature vector per given image and I'm given a variable number of SURF features from the feature extraction process. For example, 1 picture of a cat may give me 40 SURF features, while 1 picture of a dog will give me 68 SURF features. I could pick the n strongest features, but I have no way of guaranteeing that the chosen SURF features are ones that describe my image (for example, it could focus on the background). There's also no guarantee that ANY SURF features are found.
So, my problem is, how can I get many observations (each being a SURF feature vector), and "fold" these observations into a single feature vector which describes the raw image and can fed to an SVM for training?
Thanks for your help!
Typically the SURF descriptors are quantized using a K-means dictionary and aggregated into one l1-normalized histogram. So your inputs to the SVM algorithm are now fixed in size.
I want to generate SURF descriptor (a vector of length 64).
In the original paper, here it says:
The region is split up regularly into smaller 4*4 square sub-regions. For each sub-region, we compute Haar wavelet responses at 5*5 regularly spaced sample points.
With OpenCV help, I can get keypoint and its relating region (by using KeyPoint::size and angle), but my questions are:
1. how to compute Haar wavelet responses?
2. what is "at 5 x 5 regularly spaced sample points" mean?
I've read the wiki and introduction about Harr wavelet but still have no idea how to write the code.
I've known how to use the OpenCV SurfDescriptorExtractor, but I cannot use it because I need to enlarge or shrink the original region and get new descriptor.
Thanks for any hint about how to generate SURF descriptor.