Linear Discriminant Analysis vs Naive Bayes - machine-learning

What are the advantages and disadvantages of LDA vs Naive Bayes in
terms of machine learning classification?
I know some of the differences like Naive Bayes assumes variables to be independent, while LDA assumes Gaussian class-conditional density models, but I don't understand when to use LDA and when to use NB depending on the situation?

Both methods are pretty simple, so it's hard to say which one is going to work much better. It's often faster just to try both and calculate the test accuracy. But here's the list of characteristics that usually indicate if certain method is less likely to give good results. It all boils down to the data.
Naive Bayes
The first disadvantage of the Naive Bayes classifier is the feature independence assumption. In practice, the data is multi-dimensional and different features do correlate. Due to this, the result can be potentially pretty bad, though not always significantly. If you know for sure, that features are dependent (e.g. pixels of an image), don't expect Naive Bayes to show off.
Another problem is data scarcity. For any possible value of a feature, a likelihood is estimated by a frequentist approach. This can result in probabilities being close to 0 or 1, which in turn leads to numerical instabilities and worse results.
A third problem arises for continuous features. The Naive Bayes classifier works only with categorical variables, so one has to transform continuous features to discrete, by which throwing away a lot of information. If there's a continuous variable in the data, it's a strong sign against Naive Bayes.
Linear Discriminant Analysis
The LDA does not work well if the classes are not balanced, i.e. the number of objects in various classes are highly different. The solution is to get more data, which can be pretty easy or almost impossible, depending on a task.
Another disadvantage of LDA is that it's not applicable for non-linear problems, e.g. separation of donut-shape point clouds, but in high dimensional spaces it's hard to spot it right away. Usually you understand this after you see LDA not working, but if the data is known to be very non-linear, this is a strong sign against LDA.
In addition, LDA can be sensitive to overfitting and need careful validation / testing.

Related

Machine Learning - Feature Ranking by Algorithms

I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome. I have 5 algorithms:
Neural Networks
Logistics
Naive
Random Forest
Adaboost
I read a lot about Information Gain technique and it seems it is independent of the machine learning algorithm used. It is like a preprocess technique.
My question follows, is it best practice to perform feature importance for each algorithm dependently or just use Information Gain. If yes what are the technique used for each ?
First of all, it's worth stressing that you have to perform the feature selection based on the training data only, even if it is a separate algorithm. During testing, you then select the same features from the test dataset.
Some approaches that spring to mind:
Mutual information based feature selection (eg here), independent of the classifier.
Backward or forward selection (see stackexchange question), applicable to any classifier but potentially costly since you need to train/test many models.
Regularisation techniques that are part of the classifier optimisation, eg Lasso or elastic net. The latter can be better in datasets with high collinearity.
Principal components analysis or any other dimensionality reduction technique that groups your features (example).
Some models compute latent variables which you can use for interpretation instead of the original features (e.g. Partial Least Squares or Canonical Correlation Analysis).
Specific classifiers can aid interpretability by providing extra information about the features/predictors, off the top of my head:
Logistic regression: you can obtain a p-value for every feature. In your interpretation you can focus on those that are 'significant' (eg p-value <0.05). (same for two-classes Linear Discriminant Analysis)
Random Forest: can return a variable importance index that ranks the variables from most to least important.
I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome.
This will depend on the algorithm. If you have 5 algorithms, you will likely get 5 slightly different answers, unless you perform the feature selection prior to classification (eg using mutual information). One reason is that Random Forests and neural networks would pick up nonlinear relationships while logistic regression wouldn't. Furthermore, Naive Bayes is blind to interactions.
So unless your research is explicitly about these 5 models, I would rather select one model and proceed with it.
Since your purpose is to get some intuition on what's going on, here is what you can do:
Let's start with Random Forest for simplicity, but you can do this with other algorithms too. First, you need to build a good model. Good in the sense that you need to be satisfied with its performance and it should be Robust, meaning that you should use a validation and/or a test set. These points are very important because we will analyse how the model takes its decisions, so if the model is bad you will get bad intuitions.
After having built the model, you can analyse it at two level : For the whole dataset (understanding your process), or for a given prediction. For this task I suggest you to look at the SHAP library which computes features contributions (i.e how much does a feature influences the prediction of my classifier) that can be used for both puproses.
For detailled instructions about this process and more tools, you can look fast.ai excellent courses on the machine learning serie, where lessons 2/3/4/5 are about this subject.
Hope it helps!

Classifiers of machine learning for different parameters of dataset

Why behaviour of different classifier differ for different data?
Based on what parameters we can decide the good classifier for particular dataset?
For some dataset naive bayes gives better accuracy than SVM classifier
and for other dataset SVM performs better than naive bayes. Why is it
so? What is the reason?
Those are completely different classifiers. If you would have one classifier which is allways better than the other one. Why would you need the "bad one" then?
First google hit about when SVM's are not the best choice:
https://www.quora.com/For-what-kind-of-classification-problems-is-SVM-a-bad-approach
There is no general answer for this question. To understand which classifier to be used when you will need to understand the algorithm behind the classification procedure.
For instance logistic regression assumes a normal distribution of y and is generally useful when a particular parameter is not a uniquely deciding factor however combined weightage of the factors make a difference, for instance in text classification.
Decision tree on the other hand splits on the basis of parameter which gives most information. So if you have a set of parameters which is highly correlated with the label, then it makes more sense to use decision tree based classifiers.
SVM, work based on identifying adequate hyperplanes. These are generally useful when it is not possible to classify data in one plane but projecting them into higher plane classifies them easily. This is a nice tutorial on SVM https://blog.statsbot.co/support-vector-machines-tutorial-c1618e635e93
In short the only way to learn which classifier will be better in which situation is to understand how they work, and then figure out if they are best for your situation.
Another, crude way will be try every classifier and pick the best one, but i don't think you are interested in that.

Machine learning algorithm for few samples and features

I am intended to do a yes/no classifier. The problem is that the data does not come from me, so I have to work with what I have been given. I have around 150 samples, each sample contains 3 features, these features are continuous numeric variables. I know the dataset is quite small. I would like to make you two questions:
A) What would be the best machine learning algorithm for this? SVM? a neural network? All that I have read seems to require a big dataset.
B)I could make the dataset a little bit bigger by adding some samples that do not contain all the features, only one or two. I have read that you can use sparse vectors in this case, is this possible with every machine learning algorithm? (I have seen them in SVM)
Thanks a lot for your help!!!
My recommendation is to use a simple and straightforward algorithm, like decision tree or logistic regression, although, the ones you refer to should work equally well.
The dataset size shouldn't be a problem, given that you have far more samples than variables. But having more data always helps.
Naive Bayes is a good choice for a situation when there are few training examples. When compared to logistic regression, it was shown by Ng and Jordan that Naive Bayes converges towards its optimum performance faster with fewer training examples. (See section 4 of this book chapter.) Informally speaking, Naive Bayes models a joint probability distribution that performs better in this situation.
Do not use a decision tree in this situation. Decision trees have a tendency to overfit, a problem that is exacerbated when you have little training data.

Is supervised learning synonymous to classification and unsupervised learning synonymous to clustering?

I am a beginner in machine learning and recently read about supervised and unsupervised machine learning. It looks like supervised learning is synonymous to classification and unsupervised learning is synonymous to clustering, is it so?
No.
Supervised learning is when you know correct answers (targets). Depending on their type, it might be classification (categorical targets), regression (numerical targets) or learning to rank (ordinal targets) (this list is by no means complete, there might be other types that I either forgot or unaware of).
On the contrary, in unsupervised learning setting we don't know correct answers, and we try to infer, learn some structure from data. Be it cluster number or low-dimensional approximation (dimensionality reduction, actually, one might think of clusterization as of extreme 1D case of dimensionality reduction). Again, this might be far away from completeness, but the general idea is about hidden structure, that we try to discover from data.
Supervised learning is when you have labeled training data. In other words, you have a well-defined target to optimize your method for.
Typical (supervised) learning tasks are classification and regression: learning to predict categorial (classification), numerical (regression) values or ranks (learning to rank).
Unsupservised learning is an odd term. Because most of the time, the methods aren't "learning" anything. Because what would they learn from? You don't have training data?
There are plenty of unsupervised methods that don't fit the "learning" paradigm well. This includes dimensionality reduction methods such as PCA (which by far predates any "machine learning" - PCA was proposed in 1901, long before the computer!). Many of these are just data-driven statistics (as opposed to parameterized statistics). This includes most cluster analysis methods, outlier detection, ... for understanding these, it's better to step out of the "learning" mindset. Many people have trouble understanding these approaches, because they always think in the "minimize objective function f" mindset common in learning.
Consider for example DBSCAN. One of the most popular clustering algorithms. It does not fit the learning paradigm well. It can nicely be interpreted as a graph-theoretic construct: (density-) connected components. But it doesn't optimize any objective function. It computes the transitive closure of a relation; but there is no function maximized or minimized.
Similarly APRIORI finds frequent itemsets; combinations of items that occur more than minsupp times, where minsupp is a user parameter. It's an extremely simple definition; but the search space can be painfully large when you have large data. The brute-force approach just doesn't finish in acceptable time. So APRIORI uses a clever search strategy to avoid unnecessary hard disk accesses, computations, and memory. But there is no "worse" or "better" result as in learning. Either the result is correct (complete) or not - nothing to optimize on the result (only on the algorithm runtime).
Calling these methods "unsupervised learning" is squeezing them into a mindset that they don't belong into. They are not "learning" anything. Neither optimizes a function, or uses labels, or uses any kind of feedback. They just SELECT a certain set of objects from the database: APRIORI selects columns that frequently have a 1 at the same time; DBSCAN select connected components in a density graph. Either the result is correct, or not.
Some (but by far not all) unsupervised methods can be formalized as an optimization problem. At which point they become similar to popular supervised learning approaches. For example k-means is a minimization problem. PCA is a minimization problem, too - closely related to linear regression, actually. But it is the other way around. Many machine learning tasks are transformed into an optimization problem; and can be solved with general purpose statistical tools, which just happen to be highly popular in machine learning (e.g. linear programming). All the "learning" part is then wrapped into the way the data is transformed prior to feeding it into the optimizer. And in some cases, like for PCA, a non-iterative way to compute the optimum solution was found (in 1901). So in these cases, you don't need the usual optimization hammer at all.

Advantages of SVM over decion trees and AdaBoost algorithm

I am working on binary classification of data and I want to know the advantages and disadvantages of using Support vector machine over decision trees and Adaptive Boosting algorithms.
Something you might want to do is use weka, which is a nice package that you can use to plug in your data and then try out a bunch of different machine learning classifiers to see how each works on your particular set. It's a well-tread path for people who do machine learning.
Knowing nothing about your particular data, or the classification problem you are trying to solve, I can't really go beyond just telling you random things I know about each method. That said, here's a brain dump and links to some useful machine learning slides.
Adaptive Boosting uses a committee of weak base classifiers to vote on the class assignment of a sample point. The base classifiers can be decision stumps, decision trees, SVMs, etc.. It takes an iterative approach. On each iteration - if the committee is in agreement and correct about the class assignment for a particular sample, then it becomes down weighted (less important to get right on the next iteration), and if the committee is not in agreement, then it becomes up weighted (more important to classify right on the next iteration). Adaboost is known for having good generalization (not overfitting).
SVMs are a useful first-try. Additionally, you can use different kernels with SVMs and get not just linear decision boundaries but more funkily-shaped ones. And if you put L1-regularization on it (slack variables) then you can not only prevent overfitting, but also, you can classify data that isn't separable.
Decision trees are useful because of their interpretability by just about anyone. They are easy to use. Using trees also means that you can also get some idea of how important a particular feature was for making that tree. Something you might want to check out is additive trees (like MART).

Resources