What is difference between Features embedding and Image Features? - machine-learning

I have to calculate the similarity between 2 images, and I was guided to use feature embeddings of images extracted by Auto-encoders rather than Features extracted by CNN.
Can I know what is exact difference why Feature embeddings & why it can be used to calculate similarity but not image features extracted by CNN?
I have a high-level idea of Image features, that it is a generated data by running a single Foward prop on a pre-trained network (N-1)th layer, not the prediction layer(softmax or sigmoid).
And I know word embedding that projecting a dimension of a given word into more convenient feature dimensional space.
But what is the intuition of embeddings in Image?
When to use one over another ?

Related

How to compute similarity score between two images using their feature vectors?

I am working on face recognition project using deep learning architecture to classify the images into respective classes. The output of network at softmax layer is the predicted class label and the output of last but one layer at the dense layer is a feature representation of the input image. Here the feature vector is a 1-D matrix of size 1000 for each image. Predicting classes is recognition type problem, but I'm interested in verification problem.
So given two sample images, I need to compare the similarity/dissimilarity score between two given images using their feature representations. If the match score is greater than the threshold then it's a hit else no hit. Please let me know if there are any standard approaches?
Example of similar faces (which should ideally generate matchscore>threshold): https://3c1703fe8d.site.internapcdn.net/newman/gfx/news/hires/2014/yvyughbujh.jpg
Your project has two solutions:
Train your own network (using pretrained one) with output in 1000 classes. This approach is not the simplest one because of the necessity of having enough (say huge) amount of data for each class, approximately 1000 samples per class.
Another approach is to use Distance Metrics Learning. By this "distance" we usually mean Euclidean norm. This approach is much wider and deeper than just extract features and match them to the nearest one. Try to search for it.
Good luck!

Regarding the number of features extracted from an image for training

I am building a software to classify cells from images taken by a microscope.
I have a dataset of images of cells to use as training dataset - I have extracted Keypoints from each image using ORB - Here is my problem - some image produce a lot of keypoints and some small number of keypoints. Thus the descriptor vectors produced are of different lentgh. So when i try to build a training matrix from them i have to 'Normalize' the number of Keypoints chosen from each Image so that the length of the descriptor vectors will be the same.
How many key points should i pick and which? how to pick the 'Best' Keypoints? (this question also rises when i want to preform a prediction on an object i want to classify) are there known approaches to this problem?
Regards.
You could use bag of words approach to classify your images. You first have to collect all keypoint descriptors and cluster them into a certain number of groups. Each group (cluster) is your word. A descriptor corresponds to a word. For each image now, you can build a histogram by counting the occurrence of words. You can then normalize the histogram to remove the effect of varying number of keypoints in different images.
Using spatial pyramid matching could be another solution.
The simplest approach, as described by Ajay, is to cluster keypoints into N clusters and then define N binary features, such that for a given sample, feature i equals 1 if the sample shows a keypoint in cluster i, and 0 otherwise.
Another approach is to use a kernel classifier, like Support Vector Machines (SVM), and use a kernel that accepts variable-length vectors (e.g. Fisher kernel).

Bag of visual word model for feature clustering

I am using bag of visual words to cluster the features of the image. So far all the work i came across used BOW for clustering features computed using SIFT, SURF, etc. Maybe I missed this out but is it possible to represent Color Histogram Feature into BOW and also Edge oriented Histogram feature in BOW.
I am working on image classifier and I want to used SIFT with RGB color Histogram as feature descriptors in Opencv. So i was wondering is it correct to concatenate the 2 feature vectors into one and add to BOW or is it correct to add SIFT feature to BOW and concatenate the Histogram feature to BOW (I am using this model as of now but I am wondering which one is correct.)
Historically we use bag-of-words model for document retrieval.
As mentioned in this post, in computer vision bag-of-visual-words refers to bag-of-features (BOF). If you want to use two types of features, you can concatenate the two feature vectors and build the BOF model. (As an example,) For human action recognition task people often models human action using BOF built on the concatenation of appearance feature (Gradient -- HOG) and motion feature (Optical Flow -- HOF).

Image detection features: SIFT, HISTOGRAM and EGDE

I am working on developing a object classifier by using 3 different features i.e SIFT, HISTOGRAM and EGDE.
However these 3 features have different dimensional vector e.g. SIFT = 128 dimension. HIST = 256.
Now these features cannot be concatenated into once vector due to different sizes. What I am planning to do but I am not sure if it is going to be correct way is this:
For each features i train the classifier separately and than i apply classification separately for 3 different features and than count the majority and finally declare the image with majority votes.
Do you think this is a correct way?
There are several ways to get classification results that take into account multiple features. What you have suggested is one possibility where instead of combining features you train multiple classifiers and through some protocol, arrive at a consensus between them. This is typically under the field of ensemble methods. Try googling for boosting, random forests for more details on how to combine the classifiers.
However, it is not true that your feature vectors cannot be concatenated because they have different dimensions. You can still concatenate the features together into a huge vector. E.g., joining your SIFT and HIST features together will give you a vector of 384 dimensions. Depending on the classifier you use, you will likely have to normalize the entries of the vector so that no one feature dominate simply because by construction it has larger values.
EDIT in response to your comment:
It appears that your histogram is some feature vector describing a characteristic of the entire object (e.g. color) whereas your SIFT descriptors are extracted at local interest keypoints of that object. Since the number of SIFT descriptors may vary from image to image, you cannot pass them directly to a typical classifier as they often take in one feature vector per sample you wish to classify. In such cases, you will have to build a codebook (also called visual dictionary) using the SIFT descriptors you have extracted from many images. You will then use this codebook to help you derive a SINGLE feature vector from the many SIFT descriptors you extract from each image. This is what is known as a "bag of visual words (BOW)" model. Now that you have a single vector that "summarizes" the SIFT descriptors, you can concatenate that with your histogram to form a bigger vector. This single vector now summarizes the ENTIRE image/(object in the image).
For details on how to build the bag of words codebook and how to use it to derive a single feature vector from the many SIFT descriptors extracted from each image, look at this book (free for download from author's website) http://programmingcomputervision.com/ under the chapter "Searching Images". It is actually a lot simpler than it sounds.
Roughly, just run KMeans to cluster the SIFT descriptors from many images and take their centroids (which is a vector called a "visual word") as the codebook. E.g. for K = 1000 you have a 1000 visual word codebook. Then, for each image, create a result vector the same size as K (in this case 1000). Each element of this vector corresponds to a visual word. Then, for each SIFT descriptor extracted from an image, find its closest matching vector in the codebook and increment the count in the corresponding cell in the result vector. When you are done, this result vector essentially counts how often the different visual words appear in the image. Similar images will have similar counts for the same visual words and hence this vector effectively represents your images. You will also need to "normalize" this vector to make sure that images with different number of SIFT descriptors (and hence total counts) are comparable. This can be as simple as simply dividing each entry by the total count in the vector or through a more sophisticated measure such as tf/idf as described in the book.
I believe the author also provide python code on his website to accompany the book. Take a look or experiment with them if you are unsure.
More sophisticated method for combining features include Multiple Kernel Learning (MKL). In this case, you compute different kernel matrices, each using one feature. You then find the optimal weights to combine the kernel matrices and use the combined kernel matrix to train a SVM. You can find the code for this in the Shogun Machine Learning Library.

bag of words - image classification

i have some doubts incase of bag of words based image classification, i will first of tell what i have done
i have extracted the features from the training image with two different categories using SURF method,
i have then made clustering of the features for the two categories.
in order to classify my test image (i.e) to which of the two category the test image belongs to. for this classifying purpose i am using SVM classifier, but here is what i have a doubt , how do we input the test image do we have to do the same step from 1 to 2 again and then use it as a test set or is there any other method to do,
also would be great to know the efficiency of the bow approach,
kindly some one provide me with an clarification
The classifier needs the representation for the test data to have the same meaning as the training data. So, when you're evaluating a test image, you extract the features and then make the histogram of which words from your original vocabulary they're closest to.
That is:
Extract features from your entire training set.
Cluster those features into a vocabulary V; you get K distinct cluster centers.
Encode each training image as a histogram of the number of times each vocabulary element shows up in the image. Each image is then represented by a length-K vector.
Train the classifier.
When given a test image, extract the features. Now represent the test image as a histogram of the number of times each cluster center from V was closest to a feature in the test image. This is a length K vector again.
It's also often helpful to discount the histograms by taking the square root of the entries. This approximates a more realistic model for image features.

Resources