I am working on developing a object classifier by using 3 different features i.e SIFT, HISTOGRAM and EGDE.
However these 3 features have different dimensional vector e.g. SIFT = 128 dimension. HIST = 256.
Now these features cannot be concatenated into once vector due to different sizes. What I am planning to do but I am not sure if it is going to be correct way is this:
For each features i train the classifier separately and than i apply classification separately for 3 different features and than count the majority and finally declare the image with majority votes.
Do you think this is a correct way?
There are several ways to get classification results that take into account multiple features. What you have suggested is one possibility where instead of combining features you train multiple classifiers and through some protocol, arrive at a consensus between them. This is typically under the field of ensemble methods. Try googling for boosting, random forests for more details on how to combine the classifiers.
However, it is not true that your feature vectors cannot be concatenated because they have different dimensions. You can still concatenate the features together into a huge vector. E.g., joining your SIFT and HIST features together will give you a vector of 384 dimensions. Depending on the classifier you use, you will likely have to normalize the entries of the vector so that no one feature dominate simply because by construction it has larger values.
EDIT in response to your comment:
It appears that your histogram is some feature vector describing a characteristic of the entire object (e.g. color) whereas your SIFT descriptors are extracted at local interest keypoints of that object. Since the number of SIFT descriptors may vary from image to image, you cannot pass them directly to a typical classifier as they often take in one feature vector per sample you wish to classify. In such cases, you will have to build a codebook (also called visual dictionary) using the SIFT descriptors you have extracted from many images. You will then use this codebook to help you derive a SINGLE feature vector from the many SIFT descriptors you extract from each image. This is what is known as a "bag of visual words (BOW)" model. Now that you have a single vector that "summarizes" the SIFT descriptors, you can concatenate that with your histogram to form a bigger vector. This single vector now summarizes the ENTIRE image/(object in the image).
For details on how to build the bag of words codebook and how to use it to derive a single feature vector from the many SIFT descriptors extracted from each image, look at this book (free for download from author's website) http://programmingcomputervision.com/ under the chapter "Searching Images". It is actually a lot simpler than it sounds.
Roughly, just run KMeans to cluster the SIFT descriptors from many images and take their centroids (which is a vector called a "visual word") as the codebook. E.g. for K = 1000 you have a 1000 visual word codebook. Then, for each image, create a result vector the same size as K (in this case 1000). Each element of this vector corresponds to a visual word. Then, for each SIFT descriptor extracted from an image, find its closest matching vector in the codebook and increment the count in the corresponding cell in the result vector. When you are done, this result vector essentially counts how often the different visual words appear in the image. Similar images will have similar counts for the same visual words and hence this vector effectively represents your images. You will also need to "normalize" this vector to make sure that images with different number of SIFT descriptors (and hence total counts) are comparable. This can be as simple as simply dividing each entry by the total count in the vector or through a more sophisticated measure such as tf/idf as described in the book.
I believe the author also provide python code on his website to accompany the book. Take a look or experiment with them if you are unsure.
More sophisticated method for combining features include Multiple Kernel Learning (MKL). In this case, you compute different kernel matrices, each using one feature. You then find the optimal weights to combine the kernel matrices and use the combined kernel matrix to train a SVM. You can find the code for this in the Shogun Machine Learning Library.
Related
I am new to image processing, and I want to extract image features in order to do some classification. I am having problems understanding the pipeline.
As far as I understand, I have a images and I run the SIFT algorithm on them. This gives me a set of descriptors for each images, the number varies, with fixed length of 128.
I then proceed to cluster them, since it is not possible to apply algorithms on varying number of features. For this, I stack up all the descriptors of all images and I run the k means algorithm with the desired number of clusters. What I get are k number of features of length 128.
Here is where I am confused, so I now have these new descriptors right, what do I do with them? I don't understand how I can plug them into a classifier if they represent all images? Should each images have their own separate features to be fed into a classifier?
I am sure I did not understand the concept, but can anybody please clarify what happens after I get a k*128 sized matrix? What is fed into for example an SVM classifier and how? How does this k means result suffice to train a classifier?
Thanks!
EDIT: I might have confused keypoints and descriptors, sorry new to image processing!
You should look into the image classification/image retrieval approach known as 'bag of visual words' - it is extremely relevant. A bag of visual words is a fixed-length feature vector v which summarises the occurrences of the features in an image. This makes use of what is called a codebook (also called a dictionary from historical uses in text retrieval), which in your case is built from your K-means clustering. To make v for a given image, the simplest approach is to assign v[j] the proportion of SIFT descriptors that are closest to the jth cluster centroid. This means the length of V is K, so it is independent of the number of SIFT features that are detected in the image.
Concretely, suppose you've done K means clustering with K = 100. Let's use ci to denote the ith cluster centre. For SIFT, this would be a vector of size 128. Now, for a given input image, you make this vector v, which is of size 100 and initialized with zeros. You then extract features from the image, and their corresponding descriptors. Let's say there are N descriptors, and we will call them d0, d2,...,d(N-1), where dj is the jth descriptor. For each dj you compute the vector distance between it and the cluster centres c0, c2,...c99. You then take the cluster index k with the lowest distance to dj, and increment: v[k]+=1. Note that this process can be parallelised very well particularly on GPUs. Also it can be faster to replace this process using what is known as Approximate Nearest Neighbours, using e.g. the FLANN library.
I sort of understand what features are, say a ML algorithm that learns SPAM, certain keywords could be a feature?
But in the famous MNIST digits data set, I see a matrix of numbers, is the entire matrix one single feature? Or is a feature each number in the matrix?
In my opinion, you are lacking some critical literature review.
Here are some good papers about RNN and CNN that can be used for image recognition appications :
https://pdfs.semanticscholar.org/86ef/e7769f2b8a0e15ca213ab09881e6705caeb0.pdf
https://arxiv.org/pdf/1506.00019.pdf
What is a feature? A feature represents one of the elements of the input vector which will be used to train the model and produce output.
The feature set is to be determined depending on the application.
Each element of the input vector is a different (dependent or independent) feature.
Look at this tutorial for example using the MNIST digit data set:
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/recurrent_network.py
It says:
'''
To classify images using a recurrent neural network, we consider every image
row as a sequence of pixels. Because MNIST image shape is 28*28px, we will then
handle 28 sequences of 28 steps for every sample.
'''
The RNN is built on sequences, hence if the image is 28 by 28 you can break it in 28 sequences of 28 features.
# Network Parameters
num_input = 28 # MNIST data input (img shape: 28*28)
timesteps = 28 # timesteps
This is what you see in the network parameters. The 28 features (num_input = 28) representing one sequence of the image.
To repeat again, each element of the input vector is considered a feature. Furthermore, is the analyst's responsibility to properly define these features.
Technically, a feature is a numerical value which discriminatively represents (or attempts to discriminatively represent) input or some part(s) of input. In case of MNIST, where image size is 28 x 28, the entire image matrix is flattened (generally row-wise) into a 1D feature vector, each element of this feature vector is a feature (in this case, simply image intensity). The type or kind of feature which one wants to use is completely problem specific. For e.g., instead of flattening the entire MNIST digit image, you could have used number of white pixels as your feature; however, it boils down to how discriminative such a feature could be for the given problem.
In case of spam classification, generally the features are frequency of words (there are several other things involved, such as stop word elimination, stemming, etc.).
One can off-course select or design multiple features for a given problem, such as stroke length, curvature, number of edges, etc. which you mentioned in the comment above. However, the main idea is that features should be discriminative enough for all the classes and they should not be derived from each other (this point leads us to another problem called feature or dimensionality reduction). I suggest you to read this Wikipedia page here and then go on to read an academic presentation on feature extraction and dimensionality reduction, such as this (this one is specific to images). This would help you to understand the overall idea.
An additional note, the features are combined into a compact representation called a feature vector. In this particular case, as mentioned before, you have a 1-D feature vector, which contains image intensities as a features.
I am building a software to classify cells from images taken by a microscope.
I have a dataset of images of cells to use as training dataset - I have extracted Keypoints from each image using ORB - Here is my problem - some image produce a lot of keypoints and some small number of keypoints. Thus the descriptor vectors produced are of different lentgh. So when i try to build a training matrix from them i have to 'Normalize' the number of Keypoints chosen from each Image so that the length of the descriptor vectors will be the same.
How many key points should i pick and which? how to pick the 'Best' Keypoints? (this question also rises when i want to preform a prediction on an object i want to classify) are there known approaches to this problem?
Regards.
You could use bag of words approach to classify your images. You first have to collect all keypoint descriptors and cluster them into a certain number of groups. Each group (cluster) is your word. A descriptor corresponds to a word. For each image now, you can build a histogram by counting the occurrence of words. You can then normalize the histogram to remove the effect of varying number of keypoints in different images.
Using spatial pyramid matching could be another solution.
The simplest approach, as described by Ajay, is to cluster keypoints into N clusters and then define N binary features, such that for a given sample, feature i equals 1 if the sample shows a keypoint in cluster i, and 0 otherwise.
Another approach is to use a kernel classifier, like Support Vector Machines (SVM), and use a kernel that accepts variable-length vectors (e.g. Fisher kernel).
i have some doubts incase of bag of words based image classification, i will first of tell what i have done
i have extracted the features from the training image with two different categories using SURF method,
i have then made clustering of the features for the two categories.
in order to classify my test image (i.e) to which of the two category the test image belongs to. for this classifying purpose i am using SVM classifier, but here is what i have a doubt , how do we input the test image do we have to do the same step from 1 to 2 again and then use it as a test set or is there any other method to do,
also would be great to know the efficiency of the bow approach,
kindly some one provide me with an clarification
The classifier needs the representation for the test data to have the same meaning as the training data. So, when you're evaluating a test image, you extract the features and then make the histogram of which words from your original vocabulary they're closest to.
That is:
Extract features from your entire training set.
Cluster those features into a vocabulary V; you get K distinct cluster centers.
Encode each training image as a histogram of the number of times each vocabulary element shows up in the image. Each image is then represented by a length-K vector.
Train the classifier.
When given a test image, extract the features. Now represent the test image as a histogram of the number of times each cluster center from V was closest to a feature in the test image. This is a length K vector again.
It's also often helpful to discount the histograms by taking the square root of the entries. This approximates a more realistic model for image features.
I have a large set of plant images labeled with the botanical name. What would be the best algorithm to use to train on this dataset in order to classify an unlabel photo? The photos are processed so that 100% of the pixels contain the plant (e.g. either closeups of the leaves or bark), so there are no other objects/empty-space/background that the algorithm would have to filter out.
I've already tried generating SIFT features for all the photos and feeding these (feature,label) pairs to a LibLinear SVM, but the accuracy was a miserable 6%.
I also tried feeding this same data to a few Weka classifiers. The accuracy was a little better (25% with Logistic, 18% with IBk), but Weka's not designed for scalability (it loads everything into memory). Since the SIFT feature dataset is a several million rows, I could only test Weka with a random 3% slice, so it's probably not representative.
EDIT: Some sample images:
Normally, you would not train on the SIFT features directly. Cluster them (using k-means) and then train on the histogram of cluster membership identifiers (i.e., a k-dimensional vector, which counts, at position i, how many features were assigned to the i-th cluster).
This way, you obtain a single output per image (and a single, k-dimensional, feature vector).
Here's the quasi-code (using mahotas and milk in Pythonn):
from mahotas.surf import surf
from milk.unsupervised.kmeans import kmeans,assign_centroids
import milk
# First load your data:
images = ...
labels = ...
local_features = [surfs(im, 6, 4, 2) for im in imgs]
allfeatures = np.concatenate(local_features)
_, centroids = kmeans(allfeatures, k=100)
histograms = []
for ls in local_features:
hist = assign_centroids(ls, centroids, histogram=True)
histograms.append(hist)
cmatrix, _ = milk.nfoldcrossvalidation(histograms, labels)
print "Accuracy:", (100*cmatrix.trace())/cmatrix.sum()
This is a fairly hard problem.
You can give BoW model a try.
Basically, you extract SIFT features on all the images, then use K-means to cluster the features into visual words. After that, use the BoW vector to train you classifiers.
See the Wikipedia article above and the references papers in that for more details.
You probably need better alignment, and probably not more features. There is no way you can get acceptable performance unless you have correspondences. You need to know what points in one leaf correspond to points on another leaf. This is one of the "holy grail" problems in computer vision.
People have used shape context for this problem. You should probably look at this link. This paper describes the basic system behind leafsnap.
You can implement the BoW model according to this Bag-of-Features Descriptor on SIFT Features with OpenCV. It is a very good tutorial to implement the BoW model in OpenCV.