I'm a newbie to Machine Learning. I have a question about how Normal Bayes is implemented in OpenCV.
I have a mis-understanding regarding the terms Normal Bayes and Naive Bayes.
This site tells that Normal Bayes and Naive Bayes mean the same.
The NormalBayes documentation on OpenCV website specifies that the features are Normally distributed and not necessarily independent.
The wikipedia article on Naive Bayes classifier tells us that it is assumed that features are independent. Therefore, Covariance Matrix need not be determined.
However, when I look at the source of the implementation of Normal Bayes classifier, it does calculate Covariance Matrix.
I also found a similar question over here which wasn't answered.
Am I missing something here? or is it that Normal Bayes classifier in OpenCV is not a standard Naive Bayes classifier?
Theoretically, the Naive Bayes model assumes "complete independence between causes of an effect", while the Normal model assumes that "feature vectors from each class are normally distributed (though, not necessarily independently distributed)". Note that both uses mean vectors and covariance matrices, however, the model assumption is different.
In OpenCV "data distribution function is assumed to be a Gaussian mixture, one component per class" and the model does not made an assumption regarding independence of such classes.
Related
What are the advantages and disadvantages of LDA vs Naive Bayes in
terms of machine learning classification?
I know some of the differences like Naive Bayes assumes variables to be independent, while LDA assumes Gaussian class-conditional density models, but I don't understand when to use LDA and when to use NB depending on the situation?
Both methods are pretty simple, so it's hard to say which one is going to work much better. It's often faster just to try both and calculate the test accuracy. But here's the list of characteristics that usually indicate if certain method is less likely to give good results. It all boils down to the data.
Naive Bayes
The first disadvantage of the Naive Bayes classifier is the feature independence assumption. In practice, the data is multi-dimensional and different features do correlate. Due to this, the result can be potentially pretty bad, though not always significantly. If you know for sure, that features are dependent (e.g. pixels of an image), don't expect Naive Bayes to show off.
Another problem is data scarcity. For any possible value of a feature, a likelihood is estimated by a frequentist approach. This can result in probabilities being close to 0 or 1, which in turn leads to numerical instabilities and worse results.
A third problem arises for continuous features. The Naive Bayes classifier works only with categorical variables, so one has to transform continuous features to discrete, by which throwing away a lot of information. If there's a continuous variable in the data, it's a strong sign against Naive Bayes.
Linear Discriminant Analysis
The LDA does not work well if the classes are not balanced, i.e. the number of objects in various classes are highly different. The solution is to get more data, which can be pretty easy or almost impossible, depending on a task.
Another disadvantage of LDA is that it's not applicable for non-linear problems, e.g. separation of donut-shape point clouds, but in high dimensional spaces it's hard to spot it right away. Usually you understand this after you see LDA not working, but if the data is known to be very non-linear, this is a strong sign against LDA.
In addition, LDA can be sensitive to overfitting and need careful validation / testing.
So far, I have read some highly cited metric learning papers. The general idea of such papers is to learn a mapping such that mapped data points with same label lie close to each other and far from samples of other classes. To evaluate such techniques they report the accuracy of the KNN classifier on the generated embedding. So my question is if we have a labelled dataset and we are interested in increasing the accuracy of classification task, why do not we learn a classifier on the original datapoints. I mean instead of finding a new embedding which suites KNN classifier, we can learn a classifier that fits the (not embedded) datapoints. Based on what I have read so far the classification accuracy of such classifiers is much better than metric learning approaches. Is there a study that shows metric learning+KNN performs better than fitting a (good) classifier at least on some datasets?
Metric learning models CAN BE classifiers. So I will answer the question that why do we need metric learning for classification.
Let me give you an example. When you have a dataset of millions of classes and some classes have only limited examples, let's say less than 5. If you use classifiers such as SVMs or normal CNNs, you will find it impossible to train because those classifiers (discriminative models) will totally ignore the classes of few examples.
But for the metric learning models, it is not a problem since they are based on generative models.
By the way, the large number of classes is a challenge for discriminative models itself.
The real-life challenge inspires us to explore more better models.
As #Tengerye mentioned, you can use models trained using metric learning for classification. KNN is the simplest approach but you can take the embeddings of your data and train another classifier, be it KNN, SVM, Neural Network, etc. The use of metric learning, in this case, would be to change the original input space to another one which would be easier for a classifier to handle.
Apart from discriminative models being hard to train when data is unbalanced, or even worse, have very few examples per class, they cannot be easily extended for new classes.
Take for example facial recognition, if facial recognition models are trained as classification models, these models would only work for the faces it has seen and wouldn't work for any new face. Of course, you could add images for the faces you wish to add and retrain the model or fine-tune the model if possible, but this is highly impractical. On the other hand, facial recognition models trained using metric learning can generate embeddings for new faces, which can be easily added to the KNN and your system then can identify the new person given his/her image.
I am intended to do a yes/no classifier. The problem is that the data does not come from me, so I have to work with what I have been given. I have around 150 samples, each sample contains 3 features, these features are continuous numeric variables. I know the dataset is quite small. I would like to make you two questions:
A) What would be the best machine learning algorithm for this? SVM? a neural network? All that I have read seems to require a big dataset.
B)I could make the dataset a little bit bigger by adding some samples that do not contain all the features, only one or two. I have read that you can use sparse vectors in this case, is this possible with every machine learning algorithm? (I have seen them in SVM)
Thanks a lot for your help!!!
My recommendation is to use a simple and straightforward algorithm, like decision tree or logistic regression, although, the ones you refer to should work equally well.
The dataset size shouldn't be a problem, given that you have far more samples than variables. But having more data always helps.
Naive Bayes is a good choice for a situation when there are few training examples. When compared to logistic regression, it was shown by Ng and Jordan that Naive Bayes converges towards its optimum performance faster with fewer training examples. (See section 4 of this book chapter.) Informally speaking, Naive Bayes models a joint probability distribution that performs better in this situation.
Do not use a decision tree in this situation. Decision trees have a tendency to overfit, a problem that is exacerbated when you have little training data.
I am trying to implement Multiclass classification in WEKA.
I have lot of rows, say bank transactions, and one is tagged as Food,Medicine,Rent,etc. I want to develop a classifier which can be trained with the previous data I have and predict the class it can belong to for future transactions. If I am right this is Multiclass and not multilabel since each transaction can belong to only one class.
Below are a few algorithms I am considering
Naive Bayes
Multinomial Logistic Regression
Multiclass SVM
Max Entropy
Neural Networks (if possible)
In my data Number of features <<< Number of transactions and hence I am thinking of one vs rest binary classifier instead of one vs one.
Are there any other algorithms I should lok into which will help with my goal?
Is there any algos that I put are useless for my goal?
Also,I found that scikit-learn in Python is better than WEKA but I can run scikit-learn only on one processor. Is this true?
Answers to any question would be helpful.
Thanks!
You can look at RandomForest which is a well known classifier and quite efficient.
In scikit-learn, you have some class that can be used over several core like RandomForestClassifier. It has a constructor parameter that can be used to define the number of core or a value that will use every available core. Look at the documentation, constructor that contains n_jobs parameter can be used over several core
I'm using an OpenCV Haar classifier in my work but I keep reading conflicting reports on whether the OpenCV Haar classifier is an SVM or not, can anyone clarify if it is using an SVM? Also if it is not using an SVM what advantages does the Haar method offer over an SVM approach?
SVM and Boosting (AdaBoost, GentleBoost, etc) are feature classification strategies/algorithms. Support Vector Machines solve a complex optimization problem, often using kernel functions which allows us to separate samples by working in a much higher dimension feature space. On the other hand, boosting is a strategy based on combining lots of "cheap" classifiers in a smart way, which leads to a very fast classification. Those weak classifiers can be even SVM.
Haar-like features are a kind of features based in integral images and very suitable for Computer Vision problems.
This is, you can combine Haar features with any of the two classification schemes.
It isn't SVM. Here is the documentation:
http://docs.opencv.org/modules/objdetect/doc/cascade_classification.html#haar-feature-based-cascade-classifier-for-object-detection
It uses boosting (supporting AdaBoost and a variety of other similar methods -- all based on boosting).
The important difference is related to speed of evaluation is important in cascade classifiers and their stage based boosting algorithms allow very fast evaluation and high accuracy (in particular support training with many negatives), at a better balance point than an SVM for this particular application.