I've read some documentation on how Adaboost works but have some questions regarding it.
I've also read that Adaboost also picks best features from data apart from weighting weak classifiers to and use them in testing phase to perform classification efficiently.
How does Adaboost pick best features from the data?
Correct me if my understanding of Adaboost is wrong!
In some cases the weak classifiers in Adaboost are (almost) equal to features. In other words, using a single feature to classify can result in slightly better than random performance, so it can be used as a weak classifier. Adaboost will find the set of best weak classifiers given the training data, so if the weak classifiers are equal to features then you will have an indication of the most useful features.
An example of weak classifiers resembling features are decision stumps.
OK, adaboost selects features based on its basic learner, tree. For a single tree, there are several means to estimate how much contribution a single feature does to the tree, called relative importance somewhere. For adaboosting, an ensamble method, containing several such trees, the relative significance of each feature to the final model can be calculated by measuring significance of each feature to each tree then average it.
Hope this can help you.
Related
I want to compare different error rates of different classifiers with the error rate from a weak learner (better than random guessing). So, my question is, what are a few choices for a simple, easy to process weak learner? Or, do I understand the concept incorrectly, and is a weak learner simply any benchmark that I choose (for example, a linear regression)?
better than random guessing
That is basically the only requirement for a weak learner. So long as you can consistently beat random guessing, any true boosting algorithm will be able to increase the accuracy of the final ensemble. What weak learner you should choose is then a trade off between 3 factors:
The bias of the model. A lower bias is almost always better, but you don't want to pick something that will overfit (yes, boosting can and does overfit)
The training time for the weak learner. Generally we want to be able to learn a weak learner quickly, as we are going to be building a few hundred (or thousand) of them.
The prediction time for our weak learner. If we use a model that has a slow prediction rate, our ensemble of them is going to be a few hundred times slower!
The classic weak learner is a decision tree. By changing the maximum depth of the tree, you can control all 3 factors. This makes them incredibly popular for boosting. What you should be using depends on your individual problem, but decision trees is a good starting point.
NOTE: So long as the algorithm supports weighted data instances, any algorithm can be used for boosting. A guest speaker at my University was boosting 5 layer deep neural networks for his work in computational biology.
Weak learners are basically thresholds for each feature. One simple example is a 1-level decision tree called decision stump applied in bagging or boosting. It just chooses a threshold for one feature and splits the data on that threshold (for example, to determine whether the iris flower is Iris versicolor or Iris virginica based on the petal width). Then it is trained on this specific feature by bagging or AdaBoost.
What is the effect of boosting with strong (instead of weak, error rate close to random) classifier? Could it be possible that a strong classifier perform better by itself than when this strong classifier is used in adaboost along with a bunch of weak classifiers?
Yes, it is possible. All depends of your learning dataset. Look at the no free lunch theorem, there is always dataset that don't fit a particular algorithm / heuristic (even combination of thoses).
Things got more interesting with boosting when you use algorithms within the same error rate, on differents dataset. The fact that classifier should be strong or weak doesn't change the benefit of boosting. But the theorem in the foundation of the boosting specified that it inferior limit is bunch of weak classifier. If you use less than weak classifier, it won't work.
In my experiences, I never found a problem where I found a so good/strong classifier that any other classifiers(better than random) doesn't improve performance with boosting on some dataset.
Is there a way to extract the features corresponding to the weak learners from the adaboost algorithm implemented in Opencv ?
I know that adaboost combines a set of weak learners based on a set of input features.
The same features are measured for each sample in the training set.
Usually adaboost uses a decision stump and sets a threshold for each feature and chooses the decision stump having the minimum error. I want to find out what are the features that generated the weak learners.
Thanks.
You simply have to save the model and extract the trees/stump from the text file.
The save() api is quite simple to use. In the file you will find items like this:
"splits:
- { var:448, quality:5.0241161137819290e-002,
le:1.7250000000000000e+002 }"
The number next to "var" is the feature index and the "le" is the "less than" value for this feature.
I would like to classify text documents into four categories. Also I have lot of samples which are already classified that can be used for training. I would like the algorithm to learn on the fly.. please suggest an optimal algorithm that works for this requirement.
If by "on the fly" you mean online learning (where training and classification can be interleaved), I suggest the k-nearest neighbor algorithm. It's available in Weka and in the package TiMBL.
A perceptron will also be able to do this.
"Optimal" isn't a well-defined term in this context.
there are several algorithms which can be learned on fly. Examples: k-nearest neighbors, naive Bayes, neural networks. You can try how appropriate each of these methods are on a sample corpus.
Since you have unlabeled data you might want to use a model where this helps. The first thing that comes to my mind is nonlinear NCA: Learning a Nonlinear Embedding by Preserving
Class Neighbourhood Structure, (Salakhutdinov, Hinton).
Well....I have to say that document classification is kind of different what you guys are thinking.
Typically, in document classification, after preprocessing, the test data is always extremely huge, for example, O(N^2)...Therefore it might be too computationally expensive.
The another typical classifier that came into my mind is discriminant classifier...which doesn't need the generative model for your dataset. After training, you have to do is to put your single entry to the algorithm, and it is gonna be classified.
Good luck with this. For example, you can check E. Alpadin's book, Introduction to Machine Learning.
How do you find the negative and positive training data sets of Haar features for the AdaBoost algorithm? So say you have a certain type of blob that you want to locate in an image and there are several of them in your entire array - how do you go about training it? I'd appreciate a nontechnical explanation as much as possible. I'm new to this. Thanks.
First, AdaBoost does not necessarily have anything to do with Haar features. AdaBoost is a learning algorithm that combines weak learners to form a strong learner. Haar features are just a type of data on which an AdaBoost algorithm can learn.
Second, the best way to get them is to prearrange your data. So, if you want to do facial recognition a la Viola and Jones, you'll want to mark the faces in your images in a mask/overlay image. When you're training, you select samples from the image, as well as whether the sample you select is positive or negative. That positivity/negativity comes from your previous marking of the face (or whatever) in the image.
You'll have to make the actual implementation yourself, but you can use existing projects to either guide you, or you can modify their projects.