So far, I have read some highly cited metric learning papers. The general idea of such papers is to learn a mapping such that mapped data points with same label lie close to each other and far from samples of other classes. To evaluate such techniques they report the accuracy of the KNN classifier on the generated embedding. So my question is if we have a labelled dataset and we are interested in increasing the accuracy of classification task, why do not we learn a classifier on the original datapoints. I mean instead of finding a new embedding which suites KNN classifier, we can learn a classifier that fits the (not embedded) datapoints. Based on what I have read so far the classification accuracy of such classifiers is much better than metric learning approaches. Is there a study that shows metric learning+KNN performs better than fitting a (good) classifier at least on some datasets?
Metric learning models CAN BE classifiers. So I will answer the question that why do we need metric learning for classification.
Let me give you an example. When you have a dataset of millions of classes and some classes have only limited examples, let's say less than 5. If you use classifiers such as SVMs or normal CNNs, you will find it impossible to train because those classifiers (discriminative models) will totally ignore the classes of few examples.
But for the metric learning models, it is not a problem since they are based on generative models.
By the way, the large number of classes is a challenge for discriminative models itself.
The real-life challenge inspires us to explore more better models.
As #Tengerye mentioned, you can use models trained using metric learning for classification. KNN is the simplest approach but you can take the embeddings of your data and train another classifier, be it KNN, SVM, Neural Network, etc. The use of metric learning, in this case, would be to change the original input space to another one which would be easier for a classifier to handle.
Apart from discriminative models being hard to train when data is unbalanced, or even worse, have very few examples per class, they cannot be easily extended for new classes.
Take for example facial recognition, if facial recognition models are trained as classification models, these models would only work for the faces it has seen and wouldn't work for any new face. Of course, you could add images for the faces you wish to add and retrain the model or fine-tune the model if possible, but this is highly impractical. On the other hand, facial recognition models trained using metric learning can generate embeddings for new faces, which can be easily added to the KNN and your system then can identify the new person given his/her image.
Related
I know that support vector machine, random tree forest and logistic regression are famous machine learning (ML)algorithms for classification.
I'm confused the terminology between a feature extraction, selection and classification.
Does the above ML algorithms are used for extracting features not part of selecting?
Does the ML algorithms include both process of feature extraction and classification?
Does the result of training the ML algorithm (accuracy, specificity, sensitivity..) tell us the result of classifying a disease after the feature extraction?
Regarding your confusion about the 3 terminologies,
Feature extraction: When you want to create new features out of raw data (say you have the transaction_day column but you are only interested in the month, so you create a new column "transaction_month" out of "transaction_day")
Feature selection: You have many features but want to select only the important ones (how many of them is another topic to be studied). This could speed up the process of learning and with the right strategy, you would not sacrifice accuracy in many applications.
Classification: Is a family of supervised (labeled) machine learning that your goal is to assign observations to known classes (for example emails to spam or normal class)
Note: Some of machine learning algorithms like "Lasso" have build-in feature selection but for others, large coefficient of the feature after training usually shows the importance of the feature (read more about recursive feature elimination (rfe))
you may also find a good discussion in this post.
I'm learning transfer learning with some pre-trained models (vgg16, vgg19,…), and I wonder why I need to load pre-trained weight to train my own dataset.
I can understand if the classes in my dataset are included in the dataset that the pre-trained model is trained with. For example, VGG model was trained with 1000 classes in Imagenet dataset, and my model is to classify cat-dog, which are also in the Imagenet dataset. But here the classes in my dataset are not in this dataset. So how the pre-trained weight can help?
You don't have to use a pretrained network in order to train a model for your task. However, in practice using a pretrained network and retraining it to your task/dataset is usually faster and often you end up with better models yielding higher accuracy. This is especially the case if you do not have a lot of training data.
Why faster?
It turns out that (relatively) independent of the dataset and target classes, the first couple of layers converge to similar results. This is due to the fact that low level layers usually act as edge, corner and other simple structure detectors. Check out this example that visualizes the structures that filters of different layers "react" to. Having already trained the lower layers, adapting the higher level layers to your use case is much faster.
Why more accurate?
This question is harder to answer. IMHO it is due to the fact that pretrained models that you use as basis for transfer learning were trained on massive datasets. This means that the knowledge acquired flows into your retrained network and will help you to find a better local minimum of your loss function.
If you are in the compfortable situation that you have a lot of training data you should probably train a model from scratch as the pretained model might "point you in the wrong direction".
In this master thesis you can find a bunch of tasks (small datasets, medium datasets, small semantical gap, large semantical gap) where 3 methods are compared : fine tuning, features extraction + SVM, from scratch. Fine tuning a model pretrained on Imagenet is almost always a better choice.
Why behaviour of different classifier differ for different data?
Based on what parameters we can decide the good classifier for particular dataset?
For some dataset naive bayes gives better accuracy than SVM classifier
and for other dataset SVM performs better than naive bayes. Why is it
so? What is the reason?
Those are completely different classifiers. If you would have one classifier which is allways better than the other one. Why would you need the "bad one" then?
First google hit about when SVM's are not the best choice:
https://www.quora.com/For-what-kind-of-classification-problems-is-SVM-a-bad-approach
There is no general answer for this question. To understand which classifier to be used when you will need to understand the algorithm behind the classification procedure.
For instance logistic regression assumes a normal distribution of y and is generally useful when a particular parameter is not a uniquely deciding factor however combined weightage of the factors make a difference, for instance in text classification.
Decision tree on the other hand splits on the basis of parameter which gives most information. So if you have a set of parameters which is highly correlated with the label, then it makes more sense to use decision tree based classifiers.
SVM, work based on identifying adequate hyperplanes. These are generally useful when it is not possible to classify data in one plane but projecting them into higher plane classifies them easily. This is a nice tutorial on SVM https://blog.statsbot.co/support-vector-machines-tutorial-c1618e635e93
In short the only way to learn which classifier will be better in which situation is to understand how they work, and then figure out if they are best for your situation.
Another, crude way will be try every classifier and pick the best one, but i don't think you are interested in that.
I am intended to do a yes/no classifier. The problem is that the data does not come from me, so I have to work with what I have been given. I have around 150 samples, each sample contains 3 features, these features are continuous numeric variables. I know the dataset is quite small. I would like to make you two questions:
A) What would be the best machine learning algorithm for this? SVM? a neural network? All that I have read seems to require a big dataset.
B)I could make the dataset a little bit bigger by adding some samples that do not contain all the features, only one or two. I have read that you can use sparse vectors in this case, is this possible with every machine learning algorithm? (I have seen them in SVM)
Thanks a lot for your help!!!
My recommendation is to use a simple and straightforward algorithm, like decision tree or logistic regression, although, the ones you refer to should work equally well.
The dataset size shouldn't be a problem, given that you have far more samples than variables. But having more data always helps.
Naive Bayes is a good choice for a situation when there are few training examples. When compared to logistic regression, it was shown by Ng and Jordan that Naive Bayes converges towards its optimum performance faster with fewer training examples. (See section 4 of this book chapter.) Informally speaking, Naive Bayes models a joint probability distribution that performs better in this situation.
Do not use a decision tree in this situation. Decision trees have a tendency to overfit, a problem that is exacerbated when you have little training data.
Is this process correct?
Suppose We have a bunch of data such as MNIST.
We just feed all these data(without label) to RBM and resample each data from trained model.
Then output can be treated as new data for classification.
Do I understand it correctly?
What is the purpose of using RBM?
You are correct, RBMs are a form of unsupervised learning algorithm that are commonly used to reduce the dimensionality of your feature space. Another common approach is to use autoencoders.
RBMs are trained using the contrastive divergence algorithm. The best overview of this algorithm comes from Geoffrey Hinton who came up with it.
https://www.cs.toronto.edu/~hinton/absps/guideTR.pdf
A great paper about how unsupervised learning improves performance can be found at http://jmlr.org/papers/volume11/erhan10a/erhan10a.pdf. The paper shows that unsupervised learning provides better generalization and filters (if using CRBMs)