Best Model for judging if two images are similar - machine-learning

My dataset is in the form of pairs of images with a rating of 1 or 0. 1 indicates similar and 0 is dissimilar. The model has to be trained in such a way that it gives similarity between two input images not present in training. The number of classes is also indeterminate.
I have used ITML (Information Theoretic Metric Learning), LSML(Least Squares Metric Learning), and CSML (Cosine Similarity Metric Learning). So I have interpreted this problem as a metric learning problem.
Is there any other way to look at this problem or any other metric learning models I can use?

I think by your description that the key to your problem is feature extraction from the images, from which you can calculate a suitable metric for the problem at hand. When you have descriptive-enough features, most of the similarity measures you state could be useful.
By the way, your classes are not indeterminate. They are 0 (not-similar) and 1 (similar), as you are formulating it as a classification problem.

Related

Machine Learning: Weighting Training Points by Importance

I have a set of labeled training data, and I am training a ML algorithm to predict the label. However, some of my data points are more important than others. Or, analogously, these points have less uncertainty than the others.
Is there a general method to include an importance-representing weight to each training point in the model? Are there instead some specific models which are capable of this while others are not?
I can imagine duplicating these points (and perhaps smearing their features slightly to avoid exact duplicates), or downsampling the less important points. Is there a more elegant way to approach this problem?
Scikit-learn allows you to pass an array of sample weights while fitting the model. Vowpal Wabbit (an online ML library) also has this option.

Image similarity detection with TensorFlow

Recently I started to play with tensorflow, while trying to learn the popular algorithms i am in a situation where i need to find similarity between images.
Image A is supplied to the system by me, and userx supplies an image B and the system should retrieve image A to the userx if image B is similar(color and class).
Now i have got few questions:
Do we consider this scenario to be supervised learning? I am asking
because i don't see it as a classification problem(confused!!)
What algorithms i should use to train etc..
Re-training should be done quite often, how should i tackle this
problem so i don't train everytime from scratch( fine-tuning??)
Do we consider this scenario to be supervised learning?
It is supervised learning when you have labels to optimize your model. So for most neural networks, it is supervised.
However, you might also look at the complete task. I guess you don't have any ground truth for image pairs and the "desired" similarity value your model should output?
One way to solve this problem which sounds inherently unsupervised is to take a CNN (convolutional neural network) trained (in a supervised way) on the 1000 classes of image net. To get the similarity of two images, you could then simply take the euclidean distance of the output probability distribution. This will not lead to excellent results, but is probably a good starter.
What algorithms i should use to train etc..
First, you should define what "similar" means for you. Are two images similar when they contain the same object (classes)? Are they similar if the general color of the image is the same?
For example, how similar are the following 3 pairs of images?
Have a look at FaceNet and search for "Content based image retrieval" (CBIR):
Wikipedia
Google Scholar
This can be a supervised learning. You can classify the images into categories, if two images are in the same categories (or close in a category), you can think of them as similar.
You can use the deep conventional neural networks for imagenet such as inception model. The inception model outputs a probability map for 1000 classes (which is a vector whose values sum to 1). You can calculate the distance of vectors of two images to get their similarity.
On the same page of the inception model, you will also find the instructions to retrain a model: https://github.com/tensorflow/models/tree/master/inception#how-to-fine-tune-a-pre-trained-model-on-a-new-task

Suggested unsupervised feature selection / extraction method for 2 class classification?

I've got a set of F features e.g. Lab color space, entropy. By concatenating all features together, I obtain a feature vector of dimension d (between 12 and 50, depending on which features selected.
I usually get between 1000 and 5000 new samples, denoted x. A Gaussian Mixture Model is then trained with the vectors, but I don't know which class the features are from. What I know though, is that there are only 2 classes. Based on the GMM prediction I get a probability of that feature vector belonging to class 1 or 2.
My question now is: How do I obtain the best subset of features, for instance only entropy and normalized rgb, that will give me the best classification accuracy? I guess this is achieved, if the class separability is increased, due to the feature subset selection.
Maybe I can utilize Fisher's linear discriminant analysis? Since I already have the mean and covariance matrices obtained from the GMM. But wouldn't I have to calculate the score for each combination of features then?
Would be nice to get some help if this is a unrewarding approach and I'm on the wrong track and/or any other suggestions?
One way of finding "informative" features is to use the features that will maximise the log likelihood. You could do this with cross validation.
https://www.cs.cmu.edu/~kdeng/thesis/feature.pdf
Another idea might be to use another unsupervised algorithm that automatically selects features such as an clustering forest
http://research.microsoft.com/pubs/155552/decisionForests_MSR_TR_2011_114.pdf
In that case the clustering algorithm will automatically split the data based on information gain.
Fisher LDA will not select features but project your original data into a lower dimensional subspace. If you are looking into the subspace method
another interesting approach might be spectral clustering, which also happens
in a subspace or unsupervised neural networks such as auto encoder.

What are the metrics to evaluate a machine learning algorithm

I would like to know what are the various techniques and metrics used to evaluate how accurate/good an algorithm is and how to use a given metric to derive a conclusion about a ML model.
one way to do this is to use precision and recall, as defined here in wikipedia.
Another way is to use the accuracy metric as explained here. So, what I would like to know is whether there are other metrics for evaluating an ML model?
I've compiled, a while ago, a list of metrics used to evaluate classification and regression algorithms, under the form of a cheatsheet. Some metrics for classification: precision, recall, sensitivity, specificity, F-measure, Matthews correlation, etc. They are all based on the confusion matrix. Others exist for regression (continuous output variable).
The technique is mostly to run an algorithm on some data to get a model, and then apply that model on new, previously unseen data, and evaluate the metric on that data set, and repeat.
Some techniques (actually resampling techniques from statistics):
Jacknife
Crossvalidation
K-fold validation
bootstrap.
Talking about ML in general is a quite vast field, but I'll try to answer any way. The Wikipedia definition of ML is the following
Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.
In this context learning can be defined parameterization of an algorithm. The parameters of the algorithm are derived using input data with a known output. When the algorithm has "learned" the association between input and output, it can be tested with further input data for which the output is well known.
Let's suppose your problem is to obtain words from speech. Here the input is some kind of audio file containing one word (not necessarily, but I supposed this case to keep it quite simple). You'd record X words N times and then use (for example) N/2 of the repetitions to parameterize your algorithm, disregarding - at the moment - how your algorithm would look like.
Now on the one hand - depending on the algorithm - if you feed your algorithm with one of the remaining repetitions, it may give you some certainty estimate which may be used to characterize the recognition of just one of the repetitions. On the other hand you may use all of the remaining repetitions to test the learned algorithm. For each of the repetitions you pass it to the algorithm and compare the expected output with the actual output. After all you'll have an accuracy value for the learned algorithm calculated as the quotient of correct and total classifications.
Anyway, the actual accuracy will depend on the quality of your learning and test data.
A good start to read on would be Pattern Recognition and Machine Learning by Christopher M Bishop
There are various metrics for evaluating the performance of ML model and there is no rule that there are 20 or 30 metrics only. You can create your own metrics depending on your problem. There are various cases wherein when you are solving real - world problem where you would need to create your own custom metrics.
Coming to the existing ones, it is already listed in the first answer, I would just highlight each metrics merits and demerits to better have an understanding.
Accuracy is the simplest of the metric and it is commonly used. It is the number of points to class 1/ total number of points in your dataset. This is for 2 class problem where some points belong to class 1 and some to belong to class 2. It is not preferred when the dataset is imbalanced because it is biased to balanced one and it is not that much interpretable.
Log loss is a metric that helps to achieve probability scores that gives you better understanding why a specific point is belonging to class 1. The best part of this metric is that it is inbuild in logistic regression which is famous ML technique.
Confusion metric is best used for 2-class classification problem which gives four numbers and the diagonal numbers helps to get an idea of how good is your model.Through this metric there are others such as precision, recall and f1-score which are interpretable.

Using weka to classify sensor data

I am working on a classification problem, which has different sensors. Each sensor collect a sets of numeric values.
I think its a classification problem and want to use weka as a ML tool for this problem. But I am not sure how to use weka to deal with the input values? And which classifier will best fit for this problem( one instance of a feature is a sets of numeric value)?
For example, I have three sensors A ,B, C. Can I define 5 collected data from all sensors,as one instance? Such as, One instance of A is {1,2,3,4,5,6,7}, and one instance of B is{3,434,534,213,55,4,7). C{424,24,24,13,24,5,6}.
Thanks a lot for your time on reviewing my question.
Commonly the first classifier to try is Naive Bayes (you can find it under "Bayes" directory in Weka) because it's fast, parameter less and the classification accuracy is hard to beat whenever the training sample is small.
Random Forest (you can find it under "Tree" directory in Weka) is another pleasant classifier since it process almost any data. Just run it and see whether it gives better results. It can be just necessary to increase the number of trees from the default 10 to some higher value. Since you have 7 attributes 100 trees should be enough.
Then I would try k-NN (you can find it under "Lazy" directory in Weka and it's called "IBk") because it commonly ranks amount the best single classifiers for a wide range of datasets. The only issues with k-nn are that it scales badly for large datasets (> 1GB) and it needs to fine tune k, the number of neighbors. This value is by default set to 1 but with increasing number of training samples it's commonly better to set it up to some higher integer value in range from 2 to 60.
And finally for some datasets where both, Naive Bayes and k-nn performs poorly, it's best to use SVM (under "Functions", it's called "Lib SVM"). However, it can be hassle to set up all the parameters of the SVM to get competitive results. Hence I leave it to the end when I already know what classification accuracies to expect. This classifier may not be the most convenient if you have more than two classes to classify.

Resources