I'm a machine learning newbie trying to understand how Adaboost works.
I've read many articles explaining how Adaboost makes use of set of weak *classifiers* to create a strong classifier.
However, I seem to have problem understanding the statement that "Adaboost creates a Strong Classifier".
When I looked at implementations of Adaboost, I've realized that it doesn't "actually" create a Strong Classifier but somehow in the TESTING PHASE figures out on "how to use set of Weak Classifiers to get more accurate results" which in turn acts like a strong classifier "Collectively".
So technically there is NO SINGLE STRONG CLASSIFIER created (but set of weak classifiers collectively act as a strong classifier).
Please correct me if I'm wrong. It would be nice if someone can throw in some comments regarding this.
A classifier is a black box that receives an input (feature vectors) and returns an output (labeled vectors). So to call something a classifier, you only care about what it does, and not how it does it. AdaBoost's classifier can be seen as such black box, so it's indeed a single classifier, even if it uses internally several weak classifiers to produce such output.
Related
I am trying to determine the optimal group of variables for a classification task. Sometimes instead of a group of variables, only a single variable should be selected (but the data was pretty weak looking at each variable alone).
I used several classifiers (Random Forest, Logistic regression, SVM) and I have a small problem in understanding the results (the best results were achieved by using RF).
Can someone with a deeper conceptual understanding of random forest than me please explain what a random forest using one variable is doing? Since it is only one variable, it is hard for me to see how the random forest can achieve a better sens/spec than that single variable can ever achieve alone (which it does). Is (in this case) the RF a decision tree? I was thinking that it might be the case, and after testing I observed that all the scores (accuracy, F1, precision, recall) were the same for the two of them.
Thanks for the help.
When making a predictive model (specificly in telecommunication regarding churn), is it essential to have a 1:1 split between the classes in the training set(the actual distribution is more like 1:50)? When reading on what other people have done this seems to be the case. But they dont neccesarily state it as a requirement. What is recommended?
Your problem is frequently referred to as "Class Imbalance". Whether and how it will impact your result depends on the algorithm and the evaluation metric you use. The logistic regression algorithm, and the model accuracy, for example, can be very susceptible to this problem. Simple envelope models, and the model AUC, on the other hand, are more resilient against class imbalance. I am aware of five broad possible approaches to deal with this:
1) Up-sampling: Basically artificially increase the number of the rare class. This may be the go-to solution when you have very little data but you are confident that it is quite representative of the wider population.
2) Down-sampling: Just leave out a part of the abundant class. This is an option when you have a very large quantity of data.
3) Weighting: Telling your algorithm to give more importance to the information obtained from the rare class.
4) Bagging: Here, you are randomly sub-sampling your data and fitting "weak" learners to each subsample. Later, these weak learners are aggregated to create one final prediction.
5) Boosting: Similar to bagging, but each "weak" learner is not agnostic to the previously fitted one. Instead, they take the residuals from the latest ensemble.
There is a really nice article here that goes through these in great detail, including some worked examples in R, and another one here which focuses more on python
I'm using a multiclass classifier (a Support Vector Machine, via One-Vs-All) to classify data samples. Let's say I currently have n distinct classes.
However, in the scenario I'm facing, it is possible that a new data sample may belong to a new class n+1 that hasn't been seen before.
So I guess you can say that I need a form of Online Learning, as there is no distinct training set in the beginning that suits all data appearing later. Instead I need the SVM to adapt dynamically to new classes that may appear in the future.
So I'm wondering about if and how I can...
identify that a new data sample does not quite fit into the existing classes but instead should result in creating a new class.
integrate that new class into the existing classifier.
I can vaguely think of a few ideas that might be approaches to solve this problem:
If none of the binary SVM classifiers (as I have one for each class in the OVA case) predicts a fairly high probability (e.g. > 0.5) for the new data sample, I could assume that this new data sample may represent a new class.
I could train a new binary classifier for that new class and add it to the multiclass SVM.
However, these are just my naive thoughts. I'm wondering if there is some "proper" approach for this instead, e.g. using a Clustering algorithms to find all classes.
Or maybe my approach of trying to use an SVM for this is not even appropriate for this kind of problem?
Help on this is greatly appreciated.
As in any other machine learning problem, if you do not have a quality criterion, you suck.
When people say "classification", they have supervised learning in mind: there is some ground truth against which you can train and check your algorithms. If new classes can appear, this ground truth is ambiguous. Imagine one class is "horse", and you see many horses: black horses, brown horses, even white ones. And suddenly you see a zebra. Whoa! Is it a new class or just an unusual horse? The answer will depend on how you are going to use your class labels. The SVM itself cannot decide, because SVM does not use these labels, it only produces them. The decision is up to a human (or to some decision-making algorithm which knows what is "good" and "bad", that is, has its own "loss function" or "utility function").
So you need a supervisor. But how can you assist this supervisor? Two options come to mind:
Anomaly detection. This can help you with early occurences of new classes. After the very first zebra your algorithm sees it can raise an alarm: "There is something unusual!". For example, in sklearn various algorithms from random forest to one-class SVM can be used to detect unusial observations. Then your supervisor can look at them and decide whether they deserve to form an entirely new class.
Clustering. It can help you to make decision about splitting your classes. For example, after the first zebra, you decided it is not worth making a new class. But over time, your algorithm has accumulated dozens of their images. So if you run a clustering algorithm on all the observations labeled as "horses", you might end up with two well-separated clusters. And it will be again up to the supervisor to decide, whether the striped horses should be detached from the plain ones into a new class.
If you want this decision to be purely authomatic, you can split classes if the ratio of within-cluster mean distance to between-cluster distance is low enough. But it will work well only if you have a good distance metric in the first place. And what is "good" is again defined by how you use your algorithms and what your ultimate goal is.
I'm building a recognizer of antibodies in blod-cells images. It is based on libsvm. The prototype works well when it comes to recognize an instance which belongs to one of trained classes.
But when I give any image even not containing blod-cells (e.g. Microscope had bad offset/focus), it still suggests one of the classes known by model.
I first considered to implement class "Unknown" but I'm affraid training it with all the noise images would make the model performance worse.
So my idea is to check, if one/several feature(s) of an instance to be recognized is out of value-range and discard it.
Is it a good method?
If yes, how should the cut-off be selected (e.g. in terms of standard deviations)?
Thank you very much!
In problems with "possible non class samples" the most obvious solution seems to be create a one-class SVM (outlier detection algorithm) in one of two ways:
Train two one-class SVMs (oner per class) and discard samples marked by both models as "outliers"
Train one one-class SVM on the whole dataset (instances of both classes) and discard data marked as outlier
Suggested approach with "out of range check" is good as long as there is an obvios threshold value - as you are asking here what would be the best choice - it means that it is not a good way. If you cannot (as an expert) figure out it by yourself, it seems much better and safer option to train outlier detection method as suggested before, which will actualy do the same thing, but in the automatic fashion (as it will find rules for discarding "bad data" without training on any "bad images").
I've read some documentation on how Adaboost works but have some questions regarding it.
I've also read that Adaboost also picks best features from data apart from weighting weak classifiers to and use them in testing phase to perform classification efficiently.
How does Adaboost pick best features from the data?
Correct me if my understanding of Adaboost is wrong!
In some cases the weak classifiers in Adaboost are (almost) equal to features. In other words, using a single feature to classify can result in slightly better than random performance, so it can be used as a weak classifier. Adaboost will find the set of best weak classifiers given the training data, so if the weak classifiers are equal to features then you will have an indication of the most useful features.
An example of weak classifiers resembling features are decision stumps.
OK, adaboost selects features based on its basic learner, tree. For a single tree, there are several means to estimate how much contribution a single feature does to the tree, called relative importance somewhere. For adaboosting, an ensamble method, containing several such trees, the relative significance of each feature to the final model can be calculated by measuring significance of each feature to each tree then average it.
Hope this can help you.