On a given problem, I've checked using k-fold (k=10) a few standalone algorithms among them are CART and KNN.
KNN came with the upper hand over CART and I furthermore tuned it using the best k I found (3).
When I went on to check ensemble methods I tried AdaBoost with the tuned KNN and the default CART and surprisingly the CART performed better.
I'm not so deep to the math behind the scenes but I don't really understand how this makes sense.
So two questions:
How come AdaBoost with CART performed better than the tuned KNN?
If standalone performance does not predict ensemble base estimator, how do I wisely choose base estimators for ensemble methods?
Related
I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome. I have 5 algorithms:
Neural Networks
Logistics
Naive
Random Forest
Adaboost
I read a lot about Information Gain technique and it seems it is independent of the machine learning algorithm used. It is like a preprocess technique.
My question follows, is it best practice to perform feature importance for each algorithm dependently or just use Information Gain. If yes what are the technique used for each ?
First of all, it's worth stressing that you have to perform the feature selection based on the training data only, even if it is a separate algorithm. During testing, you then select the same features from the test dataset.
Some approaches that spring to mind:
Mutual information based feature selection (eg here), independent of the classifier.
Backward or forward selection (see stackexchange question), applicable to any classifier but potentially costly since you need to train/test many models.
Regularisation techniques that are part of the classifier optimisation, eg Lasso or elastic net. The latter can be better in datasets with high collinearity.
Principal components analysis or any other dimensionality reduction technique that groups your features (example).
Some models compute latent variables which you can use for interpretation instead of the original features (e.g. Partial Least Squares or Canonical Correlation Analysis).
Specific classifiers can aid interpretability by providing extra information about the features/predictors, off the top of my head:
Logistic regression: you can obtain a p-value for every feature. In your interpretation you can focus on those that are 'significant' (eg p-value <0.05). (same for two-classes Linear Discriminant Analysis)
Random Forest: can return a variable importance index that ranks the variables from most to least important.
I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome.
This will depend on the algorithm. If you have 5 algorithms, you will likely get 5 slightly different answers, unless you perform the feature selection prior to classification (eg using mutual information). One reason is that Random Forests and neural networks would pick up nonlinear relationships while logistic regression wouldn't. Furthermore, Naive Bayes is blind to interactions.
So unless your research is explicitly about these 5 models, I would rather select one model and proceed with it.
Since your purpose is to get some intuition on what's going on, here is what you can do:
Let's start with Random Forest for simplicity, but you can do this with other algorithms too. First, you need to build a good model. Good in the sense that you need to be satisfied with its performance and it should be Robust, meaning that you should use a validation and/or a test set. These points are very important because we will analyse how the model takes its decisions, so if the model is bad you will get bad intuitions.
After having built the model, you can analyse it at two level : For the whole dataset (understanding your process), or for a given prediction. For this task I suggest you to look at the SHAP library which computes features contributions (i.e how much does a feature influences the prediction of my classifier) that can be used for both puproses.
For detailled instructions about this process and more tools, you can look fast.ai excellent courses on the machine learning serie, where lessons 2/3/4/5 are about this subject.
Hope it helps!
Does it overfit if I build a machine learning model where it use the output from another machine learning model while both models are trained on the same data?
Basically I was wondering if I can use the KNN prediction result as an input for a deep neural network model while both of the models are trained on the very same data.
Nesting machine learning models is possible. For example, neuronal networks can be seen as multiple nested perceptrons (see https://en.wikipedia.org/wiki/Perceptron).
However you are right - nesting machine learning models increase the VC-dimension (https://en.wikipedia.org/wiki/VC_dimension) of your complete machine learning system and thus the risk of overfitting.
In practice cross-validation is often used in order to reduce the risk of overfitting.
Edit:
#MatiasValdenegro +1 for pointing towards a point I do not specify very clearly in my answer. Pure cross-validation can indeed only be used in order to detect overfitting.
However when we training certain machine learning systems like neuronal networks, it is possible to use some sort of cross-validation in order to reduce the risk of overfitting. In order to do so, we simply discard e.g. 10% of the training data for training. Then after each training round, the trained machine learning system is evaluated on the discarded training data. Once the trained neuronal network is getting worse on the discarded part, the training algorithm stops. This is for example done by the python pybrain (http://pybrain.org/) library.
So far, I have read some highly cited metric learning papers. The general idea of such papers is to learn a mapping such that mapped data points with same label lie close to each other and far from samples of other classes. To evaluate such techniques they report the accuracy of the KNN classifier on the generated embedding. So my question is if we have a labelled dataset and we are interested in increasing the accuracy of classification task, why do not we learn a classifier on the original datapoints. I mean instead of finding a new embedding which suites KNN classifier, we can learn a classifier that fits the (not embedded) datapoints. Based on what I have read so far the classification accuracy of such classifiers is much better than metric learning approaches. Is there a study that shows metric learning+KNN performs better than fitting a (good) classifier at least on some datasets?
Metric learning models CAN BE classifiers. So I will answer the question that why do we need metric learning for classification.
Let me give you an example. When you have a dataset of millions of classes and some classes have only limited examples, let's say less than 5. If you use classifiers such as SVMs or normal CNNs, you will find it impossible to train because those classifiers (discriminative models) will totally ignore the classes of few examples.
But for the metric learning models, it is not a problem since they are based on generative models.
By the way, the large number of classes is a challenge for discriminative models itself.
The real-life challenge inspires us to explore more better models.
As #Tengerye mentioned, you can use models trained using metric learning for classification. KNN is the simplest approach but you can take the embeddings of your data and train another classifier, be it KNN, SVM, Neural Network, etc. The use of metric learning, in this case, would be to change the original input space to another one which would be easier for a classifier to handle.
Apart from discriminative models being hard to train when data is unbalanced, or even worse, have very few examples per class, they cannot be easily extended for new classes.
Take for example facial recognition, if facial recognition models are trained as classification models, these models would only work for the faces it has seen and wouldn't work for any new face. Of course, you could add images for the faces you wish to add and retrain the model or fine-tune the model if possible, but this is highly impractical. On the other hand, facial recognition models trained using metric learning can generate embeddings for new faces, which can be easily added to the KNN and your system then can identify the new person given his/her image.
I am a beginner in machine learning and recently read about supervised and unsupervised machine learning. It looks like supervised learning is synonymous to classification and unsupervised learning is synonymous to clustering, is it so?
No.
Supervised learning is when you know correct answers (targets). Depending on their type, it might be classification (categorical targets), regression (numerical targets) or learning to rank (ordinal targets) (this list is by no means complete, there might be other types that I either forgot or unaware of).
On the contrary, in unsupervised learning setting we don't know correct answers, and we try to infer, learn some structure from data. Be it cluster number or low-dimensional approximation (dimensionality reduction, actually, one might think of clusterization as of extreme 1D case of dimensionality reduction). Again, this might be far away from completeness, but the general idea is about hidden structure, that we try to discover from data.
Supervised learning is when you have labeled training data. In other words, you have a well-defined target to optimize your method for.
Typical (supervised) learning tasks are classification and regression: learning to predict categorial (classification), numerical (regression) values or ranks (learning to rank).
Unsupservised learning is an odd term. Because most of the time, the methods aren't "learning" anything. Because what would they learn from? You don't have training data?
There are plenty of unsupervised methods that don't fit the "learning" paradigm well. This includes dimensionality reduction methods such as PCA (which by far predates any "machine learning" - PCA was proposed in 1901, long before the computer!). Many of these are just data-driven statistics (as opposed to parameterized statistics). This includes most cluster analysis methods, outlier detection, ... for understanding these, it's better to step out of the "learning" mindset. Many people have trouble understanding these approaches, because they always think in the "minimize objective function f" mindset common in learning.
Consider for example DBSCAN. One of the most popular clustering algorithms. It does not fit the learning paradigm well. It can nicely be interpreted as a graph-theoretic construct: (density-) connected components. But it doesn't optimize any objective function. It computes the transitive closure of a relation; but there is no function maximized or minimized.
Similarly APRIORI finds frequent itemsets; combinations of items that occur more than minsupp times, where minsupp is a user parameter. It's an extremely simple definition; but the search space can be painfully large when you have large data. The brute-force approach just doesn't finish in acceptable time. So APRIORI uses a clever search strategy to avoid unnecessary hard disk accesses, computations, and memory. But there is no "worse" or "better" result as in learning. Either the result is correct (complete) or not - nothing to optimize on the result (only on the algorithm runtime).
Calling these methods "unsupervised learning" is squeezing them into a mindset that they don't belong into. They are not "learning" anything. Neither optimizes a function, or uses labels, or uses any kind of feedback. They just SELECT a certain set of objects from the database: APRIORI selects columns that frequently have a 1 at the same time; DBSCAN select connected components in a density graph. Either the result is correct, or not.
Some (but by far not all) unsupervised methods can be formalized as an optimization problem. At which point they become similar to popular supervised learning approaches. For example k-means is a minimization problem. PCA is a minimization problem, too - closely related to linear regression, actually. But it is the other way around. Many machine learning tasks are transformed into an optimization problem; and can be solved with general purpose statistical tools, which just happen to be highly popular in machine learning (e.g. linear programming). All the "learning" part is then wrapped into the way the data is transformed prior to feeding it into the optimizer. And in some cases, like for PCA, a non-iterative way to compute the optimum solution was found (in 1901). So in these cases, you don't need the usual optimization hammer at all.
I am working on binary classification of data and I want to know the advantages and disadvantages of using Support vector machine over decision trees and Adaptive Boosting algorithms.
Something you might want to do is use weka, which is a nice package that you can use to plug in your data and then try out a bunch of different machine learning classifiers to see how each works on your particular set. It's a well-tread path for people who do machine learning.
Knowing nothing about your particular data, or the classification problem you are trying to solve, I can't really go beyond just telling you random things I know about each method. That said, here's a brain dump and links to some useful machine learning slides.
Adaptive Boosting uses a committee of weak base classifiers to vote on the class assignment of a sample point. The base classifiers can be decision stumps, decision trees, SVMs, etc.. It takes an iterative approach. On each iteration - if the committee is in agreement and correct about the class assignment for a particular sample, then it becomes down weighted (less important to get right on the next iteration), and if the committee is not in agreement, then it becomes up weighted (more important to classify right on the next iteration). Adaboost is known for having good generalization (not overfitting).
SVMs are a useful first-try. Additionally, you can use different kernels with SVMs and get not just linear decision boundaries but more funkily-shaped ones. And if you put L1-regularization on it (slack variables) then you can not only prevent overfitting, but also, you can classify data that isn't separable.
Decision trees are useful because of their interpretability by just about anyone. They are easy to use. Using trees also means that you can also get some idea of how important a particular feature was for making that tree. Something you might want to check out is additive trees (like MART).