What is the effect of boosting with strong (instead of weak, error rate close to random) classifier? Could it be possible that a strong classifier perform better by itself than when this strong classifier is used in adaboost along with a bunch of weak classifiers?
Yes, it is possible. All depends of your learning dataset. Look at the no free lunch theorem, there is always dataset that don't fit a particular algorithm / heuristic (even combination of thoses).
Things got more interesting with boosting when you use algorithms within the same error rate, on differents dataset. The fact that classifier should be strong or weak doesn't change the benefit of boosting. But the theorem in the foundation of the boosting specified that it inferior limit is bunch of weak classifier. If you use less than weak classifier, it won't work.
In my experiences, I never found a problem where I found a so good/strong classifier that any other classifiers(better than random) doesn't improve performance with boosting on some dataset.
Related
I have heard of Haar-like features being described as weak descriptors and that the Adaboost method is advantageous over SVM in this case because of this. My question is what is a weak descriptor and strong descriptor and why does a boosting method perform better than an SVM (for example)?
A weak descriptor would be something which is not too refined or tuned (eg: haar features, edge maps etc). A strong descriptor(SIFT/SURF/MSER) would be something which is accurate, has high repeatability under blur, viewpoint/illumination change, JPEG compression. A boosting method would perform better for weak descriptors and SVM would be suitable for a strong descriptor. This is because the idea of boosting is to combine a lot of weak classifiers to learn a classifier. In the case of haar like features, adaboost combines many such weak features to learn a strong classifier. SVM tries to fit a hyperplane between the most confusing features between the two classes, so for SVM to perform better, the confusion between the classes should be less and features should be robust and accurate.
I want to compare different error rates of different classifiers with the error rate from a weak learner (better than random guessing). So, my question is, what are a few choices for a simple, easy to process weak learner? Or, do I understand the concept incorrectly, and is a weak learner simply any benchmark that I choose (for example, a linear regression)?
better than random guessing
That is basically the only requirement for a weak learner. So long as you can consistently beat random guessing, any true boosting algorithm will be able to increase the accuracy of the final ensemble. What weak learner you should choose is then a trade off between 3 factors:
The bias of the model. A lower bias is almost always better, but you don't want to pick something that will overfit (yes, boosting can and does overfit)
The training time for the weak learner. Generally we want to be able to learn a weak learner quickly, as we are going to be building a few hundred (or thousand) of them.
The prediction time for our weak learner. If we use a model that has a slow prediction rate, our ensemble of them is going to be a few hundred times slower!
The classic weak learner is a decision tree. By changing the maximum depth of the tree, you can control all 3 factors. This makes them incredibly popular for boosting. What you should be using depends on your individual problem, but decision trees is a good starting point.
NOTE: So long as the algorithm supports weighted data instances, any algorithm can be used for boosting. A guest speaker at my University was boosting 5 layer deep neural networks for his work in computational biology.
Weak learners are basically thresholds for each feature. One simple example is a 1-level decision tree called decision stump applied in bagging or boosting. It just chooses a threshold for one feature and splits the data on that threshold (for example, to determine whether the iris flower is Iris versicolor or Iris virginica based on the petal width). Then it is trained on this specific feature by bagging or AdaBoost.
In my bachelor thesis I am supposed to use AdaBoostM1 with a MultinomialNaiveBayes classifier on a text classification problem. The problem is that in most cases, the M1 is worse or equal to the MultinomialNaiveBayes without boosting.
I use the following code:
AdaBoostM1 m1 = new AdaBoostM1();
m1.setClassifier(new NaiveBayesMultinomial());
m1.buildClassifier(training);
So I don't get how the AdaBoost would not be able to improve the results? Unfortunately, I couldn't find anything else about that on the web as most people seem to be very satisfied with the AdaBoost.
AdaBoost is a binary/dichotomous/2-class classifier and designed to boost a weak learner that is just better than 1/2 accuracy. AdaBoostM1 is a M-class classifier but still requires the weak learner to be better than 1/2 accuracy, when one would expect chance level to be around 1/M. Balancing/weighting is used to get equal prevalence classes initially, but the reweighting inherent to AdaBoost can destroy this quickly. A solution is to base boosting on chance corrected measures like Kappa or Informedness (AdaBook).
As M grows, e.g. with text classification, this mismatch grows, and thus a much stronger than chance classifier is needed. Thus with M=100, chance is about 1% but 50% minimum accuracy is needed by AdaBoostM1.
As base classifiers get stronger (viz. no longer barely above chance) the scope for boosting to improve things reduces - it has already pulled us to a very specific part of the search space. It is increasingly likely to have overfitted to errors and outliers, so there is no scope to balance a wide variety of variants.
A number of resources on informedness (including matlab code and xls sheets and early papers) are here: http://david.wardpowers.info/BM A comparison with other chance-corrected kappa measures is here: http://aclweb.org/anthology-new/E/E12/E12-1035.pdf
A weka implementation and experimentation for Adaboost using Bookmaker informedness is available - contact author.
It's hard to beat Naive Bayes on text classification. Furthermore, boosting was designed for weak classifiers with high bias and that's where boosting performs well. Boosting decreases bias but increases variance. Hence if you want the combo AdaBoost + Naive Bayes to outperform Naive Bayes you have to have a big training data set and cross the border, where enlarging of the training set doesn't further increase Naive Bayes's performance (while AdaBoost still benefits from the enlarged training data set).
You may want to read the following paper which examines boosting on Naive Bayes. It demonstrates that boosting does not improve the accuracy of the naive Bayesian classifier as much is usually expected in a set of natural domains:
http://onlinelibrary.wiley.com/doi/10.1111/1467-8640.00219/abstract
Hope it provides a good insight.
I've read some documentation on how Adaboost works but have some questions regarding it.
I've also read that Adaboost also picks best features from data apart from weighting weak classifiers to and use them in testing phase to perform classification efficiently.
How does Adaboost pick best features from the data?
Correct me if my understanding of Adaboost is wrong!
In some cases the weak classifiers in Adaboost are (almost) equal to features. In other words, using a single feature to classify can result in slightly better than random performance, so it can be used as a weak classifier. Adaboost will find the set of best weak classifiers given the training data, so if the weak classifiers are equal to features then you will have an indication of the most useful features.
An example of weak classifiers resembling features are decision stumps.
OK, adaboost selects features based on its basic learner, tree. For a single tree, there are several means to estimate how much contribution a single feature does to the tree, called relative importance somewhere. For adaboosting, an ensamble method, containing several such trees, the relative significance of each feature to the final model can be calculated by measuring significance of each feature to each tree then average it.
Hope this can help you.
In many applications, creating a large training dataset can be very costly, if not outright impossible. So what steps can one take to limit the size that is needed for good accuracy?
Well, there is a branch of machine learning specifically dedicated to solve this problem (labeling datasets is costly) : semi-supervised learning
Honestly, from my experience, the computation is quite horrendously long and the results pale in comparison with fully labeled datasets... But better train on a large unlabeled dataset rather than with nothing!
Edit : Well, I first understood the question as "Labeling a dataset is expensive" rather than "The size of the dataset will be small no matter what"
Well, among other things, I would :
Tune my parameters with the leave one out cross validation. The most computationnaly expensive, but the best one.
Choose algorithms that have a rather quick convergence. (You need a comparison table, which I do not have right now)
Need very good generalization properties. Linear combinations of weak classifiers are quite good in this case. kNN (k nearest neighbours) are extremely bad.
Bias the "generalization" parameter. Most algorithm consist in a compromise between generalization (regularity) and quality (is the training set well classified by the classifier?). If your dataset is small, you should bias the algorithm toward generalization (after tuning the parameters with cross validation)