I am building a software to classify cells from images taken by a microscope.
I have a dataset of images of cells to use as training dataset - I have extracted Keypoints from each image using ORB - Here is my problem - some image produce a lot of keypoints and some small number of keypoints. Thus the descriptor vectors produced are of different lentgh. So when i try to build a training matrix from them i have to 'Normalize' the number of Keypoints chosen from each Image so that the length of the descriptor vectors will be the same.
How many key points should i pick and which? how to pick the 'Best' Keypoints? (this question also rises when i want to preform a prediction on an object i want to classify) are there known approaches to this problem?
Regards.
You could use bag of words approach to classify your images. You first have to collect all keypoint descriptors and cluster them into a certain number of groups. Each group (cluster) is your word. A descriptor corresponds to a word. For each image now, you can build a histogram by counting the occurrence of words. You can then normalize the histogram to remove the effect of varying number of keypoints in different images.
Using spatial pyramid matching could be another solution.
The simplest approach, as described by Ajay, is to cluster keypoints into N clusters and then define N binary features, such that for a given sample, feature i equals 1 if the sample shows a keypoint in cluster i, and 0 otherwise.
Another approach is to use a kernel classifier, like Support Vector Machines (SVM), and use a kernel that accepts variable-length vectors (e.g. Fisher kernel).
Related
I have to calculate the similarity between 2 images, and I was guided to use feature embeddings of images extracted by Auto-encoders rather than Features extracted by CNN.
Can I know what is exact difference why Feature embeddings & why it can be used to calculate similarity but not image features extracted by CNN?
I have a high-level idea of Image features, that it is a generated data by running a single Foward prop on a pre-trained network (N-1)th layer, not the prediction layer(softmax or sigmoid).
And I know word embedding that projecting a dimension of a given word into more convenient feature dimensional space.
But what is the intuition of embeddings in Image?
When to use one over another ?
I am new to image processing, and I want to extract image features in order to do some classification. I am having problems understanding the pipeline.
As far as I understand, I have a images and I run the SIFT algorithm on them. This gives me a set of descriptors for each images, the number varies, with fixed length of 128.
I then proceed to cluster them, since it is not possible to apply algorithms on varying number of features. For this, I stack up all the descriptors of all images and I run the k means algorithm with the desired number of clusters. What I get are k number of features of length 128.
Here is where I am confused, so I now have these new descriptors right, what do I do with them? I don't understand how I can plug them into a classifier if they represent all images? Should each images have their own separate features to be fed into a classifier?
I am sure I did not understand the concept, but can anybody please clarify what happens after I get a k*128 sized matrix? What is fed into for example an SVM classifier and how? How does this k means result suffice to train a classifier?
Thanks!
EDIT: I might have confused keypoints and descriptors, sorry new to image processing!
You should look into the image classification/image retrieval approach known as 'bag of visual words' - it is extremely relevant. A bag of visual words is a fixed-length feature vector v which summarises the occurrences of the features in an image. This makes use of what is called a codebook (also called a dictionary from historical uses in text retrieval), which in your case is built from your K-means clustering. To make v for a given image, the simplest approach is to assign v[j] the proportion of SIFT descriptors that are closest to the jth cluster centroid. This means the length of V is K, so it is independent of the number of SIFT features that are detected in the image.
Concretely, suppose you've done K means clustering with K = 100. Let's use ci to denote the ith cluster centre. For SIFT, this would be a vector of size 128. Now, for a given input image, you make this vector v, which is of size 100 and initialized with zeros. You then extract features from the image, and their corresponding descriptors. Let's say there are N descriptors, and we will call them d0, d2,...,d(N-1), where dj is the jth descriptor. For each dj you compute the vector distance between it and the cluster centres c0, c2,...c99. You then take the cluster index k with the lowest distance to dj, and increment: v[k]+=1. Note that this process can be parallelised very well particularly on GPUs. Also it can be faster to replace this process using what is known as Approximate Nearest Neighbours, using e.g. the FLANN library.
I want to analyze the similarity of two images.
A conventional way for this is
Detect features(keypoints) for both images.
Compute descriptors for every keypoints.
Compute a match using these two sets of descriptors.
However in my case I have already had the matched point sets from two images.
So I think I can directly proceed to the second step for descriptor computation.
Is it reasonable and possible?
yes, its possible to compute descriptors for any point, be it a keypoint or not. After all, a descriptor is just a representation of intensities in a patch. For computing similarity, you could use bag of words.
A better way to compute image similarity is to compare mid level deep features, which can be easily computed using CAFFE caffe.berkeleyvision.org.
I am working on developing a object classifier by using 3 different features i.e SIFT, HISTOGRAM and EGDE.
However these 3 features have different dimensional vector e.g. SIFT = 128 dimension. HIST = 256.
Now these features cannot be concatenated into once vector due to different sizes. What I am planning to do but I am not sure if it is going to be correct way is this:
For each features i train the classifier separately and than i apply classification separately for 3 different features and than count the majority and finally declare the image with majority votes.
Do you think this is a correct way?
There are several ways to get classification results that take into account multiple features. What you have suggested is one possibility where instead of combining features you train multiple classifiers and through some protocol, arrive at a consensus between them. This is typically under the field of ensemble methods. Try googling for boosting, random forests for more details on how to combine the classifiers.
However, it is not true that your feature vectors cannot be concatenated because they have different dimensions. You can still concatenate the features together into a huge vector. E.g., joining your SIFT and HIST features together will give you a vector of 384 dimensions. Depending on the classifier you use, you will likely have to normalize the entries of the vector so that no one feature dominate simply because by construction it has larger values.
EDIT in response to your comment:
It appears that your histogram is some feature vector describing a characteristic of the entire object (e.g. color) whereas your SIFT descriptors are extracted at local interest keypoints of that object. Since the number of SIFT descriptors may vary from image to image, you cannot pass them directly to a typical classifier as they often take in one feature vector per sample you wish to classify. In such cases, you will have to build a codebook (also called visual dictionary) using the SIFT descriptors you have extracted from many images. You will then use this codebook to help you derive a SINGLE feature vector from the many SIFT descriptors you extract from each image. This is what is known as a "bag of visual words (BOW)" model. Now that you have a single vector that "summarizes" the SIFT descriptors, you can concatenate that with your histogram to form a bigger vector. This single vector now summarizes the ENTIRE image/(object in the image).
For details on how to build the bag of words codebook and how to use it to derive a single feature vector from the many SIFT descriptors extracted from each image, look at this book (free for download from author's website) http://programmingcomputervision.com/ under the chapter "Searching Images". It is actually a lot simpler than it sounds.
Roughly, just run KMeans to cluster the SIFT descriptors from many images and take their centroids (which is a vector called a "visual word") as the codebook. E.g. for K = 1000 you have a 1000 visual word codebook. Then, for each image, create a result vector the same size as K (in this case 1000). Each element of this vector corresponds to a visual word. Then, for each SIFT descriptor extracted from an image, find its closest matching vector in the codebook and increment the count in the corresponding cell in the result vector. When you are done, this result vector essentially counts how often the different visual words appear in the image. Similar images will have similar counts for the same visual words and hence this vector effectively represents your images. You will also need to "normalize" this vector to make sure that images with different number of SIFT descriptors (and hence total counts) are comparable. This can be as simple as simply dividing each entry by the total count in the vector or through a more sophisticated measure such as tf/idf as described in the book.
I believe the author also provide python code on his website to accompany the book. Take a look or experiment with them if you are unsure.
More sophisticated method for combining features include Multiple Kernel Learning (MKL). In this case, you compute different kernel matrices, each using one feature. You then find the optimal weights to combine the kernel matrices and use the combined kernel matrix to train a SVM. You can find the code for this in the Shogun Machine Learning Library.
i have some doubts incase of bag of words based image classification, i will first of tell what i have done
i have extracted the features from the training image with two different categories using SURF method,
i have then made clustering of the features for the two categories.
in order to classify my test image (i.e) to which of the two category the test image belongs to. for this classifying purpose i am using SVM classifier, but here is what i have a doubt , how do we input the test image do we have to do the same step from 1 to 2 again and then use it as a test set or is there any other method to do,
also would be great to know the efficiency of the bow approach,
kindly some one provide me with an clarification
The classifier needs the representation for the test data to have the same meaning as the training data. So, when you're evaluating a test image, you extract the features and then make the histogram of which words from your original vocabulary they're closest to.
That is:
Extract features from your entire training set.
Cluster those features into a vocabulary V; you get K distinct cluster centers.
Encode each training image as a histogram of the number of times each vocabulary element shows up in the image. Each image is then represented by a length-K vector.
Train the classifier.
When given a test image, extract the features. Now represent the test image as a histogram of the number of times each cluster center from V was closest to a feature in the test image. This is a length K vector again.
It's also often helpful to discount the histograms by taking the square root of the entries. This approximates a more realistic model for image features.