using random forest as base classifier with adaboost - machine-learning

Can I use AdaBoost with random forest as a base classifier? I searched on the internet and I didn't find anyone who does it.
Like in the following code; I try to run it but it takes a lot of time:
estimators = Pipeline([('vectorizer', CountVectorizer()),
('transformer', TfidfTransformer()),
('classifier', AdaBoostClassifier(learning_rate=1))])
RF=RandomForestClassifier(criterion='entropy',n_estimators=100,max_depth=500,min_samples_split=100,max_leaf_nodes=None,
max_features='log2')
param_grid={
'vectorizer__ngram_range': [(1,2),(1,3)],
'vectorizer__min_df': [5],
'vectorizer__max_df': [0.7],
'vectorizer__max_features': [1500],
'transformer__use_idf': [True , False],
'transformer__norm': ('l1','l2'),
'transformer__smooth_idf': [True , False],
'transformer__sublinear_tf': [True , False],
'classifier__base_estimator':[RF],
'classifier__algorithm': ("SAMME.R","SAMME"),
'classifier__n_estimators':[4,7,11,13,16,19,22,25,28,31,34,43,50]
}
I tried with the GridSearchCV, I added the RF classifier into the AdaBoost parameters.
if I use it would the accuracy increase?

No wonder you have not actually seen anyone doing it - it is an absurd and bad idea.
You are trying to build an ensemble (Adaboost) which in itself consists of ensemble base classifiers (RFs) - essentially an "ensemble-squared"; so, no wonder about the high computation time.
But even if it was practical, there are good theoretical reasons not to do it; quoting from my own answer in Execution time of AdaBoost with SVM base classifier:
Adaboost (and similar ensemble methods) were conceived using decision trees as base classifiers (more specifically, decision stumps, i.e. DTs with a depth of only 1); there is good reason why still today, if you don't specify explicitly the base_classifier argument, it assumes a value of DecisionTreeClassifier(max_depth=1). DTs are suitable for such ensembling because they are essentially unstable classifiers, which is not the case with SVMs, hence the latter are not expected to offer much when used as base classifiers.
On top of this, SVMs are computationally much more expensive than decision trees (let alone decision stumps), which is the reason for the long processing times you have observed.
The argument holds for RFs, too - they are not unstable classifiers, hence there is not any reason to actually expect performance improvements when using them as base classifiers for boosting algorithms, like Adaboost.

Short answer:
It's not impossible.
I don't know if there's anything wrong with doing so in theory, but I tried this once and the accuracy increased.
Long answer:
I tried it on a typical dataset with n rows of p real-valued features, and a label list of length n. In case it matters, they are embeddings of nodes in a graph obtained by the DeepWalk algorithm, and the nodes are categorized into two classes. I trained a few classification models on this data using 5-fold cross validation, and measured common evaluation metrics for them (precision, recall, AUC etc.). The models I have used are SVM, logistic regression, random Forest, 2-layer perceptron and Adaboost with random forest classifiers. The last model, Adaboost with random forest classifiers, yielded the best results (95% AUC compared to multilayer perceptron's 89% and random forest's 88%). Sure, now the runtime has increased by a factor of, let's say, 100, but it's still about 20 mins, so it's not a constraint to me.
Here's what I thought: Firstly, I'm using cross validation, so there's probably no overfitting flying under the radar. Secondly, both are ensemble learning methods, but random forest is a bagging method, wheras Adaboost is a boosting technique. Perhaps they're still different enough for their combination to make sense?

Related

Gradient Boosting vs Random forest

According to my understanding, RF selects features randomly and hence is hard to overfit. But, in sklearn Gradient boosting also offers the option of max_features which can help to prevent overfitting. So, why would anyone use Random forest?
Can anyone explain when to use Gradient boosting vs Random forest based on the given data?
Any help is highly appreciated.
According to my personal experience, Random Forest could be a better choice when..
You train a model on small data set.
Your data set has few features to learn.
Your data set has low Y flag count or you try to predict a situation that has low chance to occur or rarely occurs.
In these situations, Gradient Boosting algorithms like XGBoost and Light GBM can overfit (though their parameters are tuned) while simple algorithms like Random Forest or even Logistic Regression may perform better. To illustrate, for XGboost and Ligh GBM, ROC AUC from test set may be higher in comparison with Random Forest but shows too high difference with ROC AUC from train set.
Despite the sharp prediction form Gradient Boosting algorithms, in some cases, Random Forest take advantage of model stability from begging methodology (selecting randomly) and outperform XGBoost and Light GBM. However, Gradient Boosting algorithms perform better in general situations.
Similar question asked on Quora:
https://www.quora.com/How-do-random-forests-and-boosted-decision-trees-compare
I agree with the author at the link that random forests are more robust -- they don't require much problem-specific tuning to get good results. Besides that, a couple other items based on my own experience:
Random forests can perform better on small data sets; gradient boosted trees are data hungry
Random forests are easier to explain and understand. This perhaps seems silly but can lead to better adoption of a model if needed to be used by less technical people
I think that's also true. I have also read on this page How Random Forest Works
There explains the advantages of random forest. like this :
For applications in classification problems, Random Forest algorithm
will avoid the overfitting problem
For both classification and
regression task, the same random forest algorithm can be used
The Random Forest algorithm can be used for identifying the most
important features from the training dataset, in other words,
feature engineering.

Balanced corpus for Naive Bayes Classifier

I'm working with sentiment analysis using NB classifier. I've found some information (blogs, tutorials etc) that training corpus should be balanced:
33.3% Positive;
33.3% Neutral
33.3% Negative
My question is:
Why corspus should be balanced? The Bayes theorem is based on propability of reason/case. So for training purpose isn't it important that in real world for example negative tweets are only 10% not 33.3%?
You are correct, balancing data is important for many discriminative models, but not really for NB.
However, it might be still more beneficial to bias P(y) estimators to get better predictive performance (since due to various simplifications models use, probability assigned to minority class can be heaviy underfitted). For NB it is not about balancing data, but literally modifying the estimated P(y) so that on the validation set accuracy is maximised.
In my opinion the best dataset for training purposes if a sample of the real world data that your classifier will be used with.
This is true for all classifiers (but some of them are indeed not suitable to unbalanced training sets in which cases you don't really have a choice to skew the distribution), but particularly for probabilistic classifiers such as Naive Bayes. So the best sample should reflect the natural class distribution.
Note that this is important not only for the class priors estimates. Naive Bayes will calculate for each feature the likelihood of predicting the class given the feature. If your bayesian classifier is built specifically to classify texts, it will use global document frequency measures (the number of times a given word occurs in the dataset, across all categories). If the number of documents per category in the training set doesn't reflect their natural distribution, the global term frequency of terms usually seen in unfrequent categories will be overestimated, and that of frequent categories underestimated. Thus not only the prior class probability will be incorrect, but also all the P(category=c|term=t) estimates.

Is there any classifier which is able to make decisions very fast?

Most classification algorithms are developed to improve the training speed. However, is there any classifier or algorithm focusing on the decision making speed(low computation complexity and simple realizable structure)? I can get enough training dataļ¼Œand endure the long training time.
There are many methods which classify fast, you could more or less sort models by classification speed in a following way (first ones - the fastest, last- slowest)
Decision Tree (especially with limited depth)
Linear models (linear regression, logistic regression, linear svm, lda, ...) and Naive Bayes
Non-linear models based on explicit data transformation (Nystroem kernel approximation, RVFL, RBFNN, EEM), Kernel methods (such as kernel SVM) and shallow neural networks
Random Forest and other committees
Big Neural Networks (ie. CNN)
KNN with arbitrary distance
Obviously this list is not exhaustive, it just shows some general ideas.
One way of obtaining such model is to build a complex, slow model, then use it as a black box label generator to train a simplier model (but on potentialy infinite training set) - thus getting a fast classifier at the cost of very expensive training. There are many works showing that one can do that for example by training a shallow neural network on outputs of deep nn.
In general classification speed should not be a problem. Some exceptions are algorithms which have a time complexity depending on the number of samples you have for training. One example is k-Nearest-Neighbors which has no training time, but for classification it needs to check all points (if implemented in a naive way). Other examples are all classifiers which work with kernels since they compute the kernel between the current sample and all training samples.
Many classifiers work with a scalar product of the features and a learned coefficient vector. These should be fast enough in almost all cases. Examples are: Logistic regression, linear SVM, perceptrons and many more. See #lejlot's answer for a nice list.
If these are still too slow you might try to reduce the dimension of your feature space first and then try again (this also speeds up training time).
Btw, this question might not be suited for StackOverflow as it is quite broad and recommendation instead of problem oriented. Maybe try https://stats.stackexchange.com/ next time.
I have a decision tree which is represented in the compressed form and which is at least 4 times faster than the actual tree in classifying an unseen instance.

weka AdaBoost does not improve results

In my bachelor thesis I am supposed to use AdaBoostM1 with a MultinomialNaiveBayes classifier on a text classification problem. The problem is that in most cases, the M1 is worse or equal to the MultinomialNaiveBayes without boosting.
I use the following code:
AdaBoostM1 m1 = new AdaBoostM1();
m1.setClassifier(new NaiveBayesMultinomial());
m1.buildClassifier(training);
So I don't get how the AdaBoost would not be able to improve the results? Unfortunately, I couldn't find anything else about that on the web as most people seem to be very satisfied with the AdaBoost.
AdaBoost is a binary/dichotomous/2-class classifier and designed to boost a weak learner that is just better than 1/2 accuracy. AdaBoostM1 is a M-class classifier but still requires the weak learner to be better than 1/2 accuracy, when one would expect chance level to be around 1/M. Balancing/weighting is used to get equal prevalence classes initially, but the reweighting inherent to AdaBoost can destroy this quickly. A solution is to base boosting on chance corrected measures like Kappa or Informedness (AdaBook).
As M grows, e.g. with text classification, this mismatch grows, and thus a much stronger than chance classifier is needed. Thus with M=100, chance is about 1% but 50% minimum accuracy is needed by AdaBoostM1.
As base classifiers get stronger (viz. no longer barely above chance) the scope for boosting to improve things reduces - it has already pulled us to a very specific part of the search space. It is increasingly likely to have overfitted to errors and outliers, so there is no scope to balance a wide variety of variants.
A number of resources on informedness (including matlab code and xls sheets and early papers) are here: http://david.wardpowers.info/BM A comparison with other chance-corrected kappa measures is here: http://aclweb.org/anthology-new/E/E12/E12-1035.pdf
A weka implementation and experimentation for Adaboost using Bookmaker informedness is available - contact author.
It's hard to beat Naive Bayes on text classification. Furthermore, boosting was designed for weak classifiers with high bias and that's where boosting performs well. Boosting decreases bias but increases variance. Hence if you want the combo AdaBoost + Naive Bayes to outperform Naive Bayes you have to have a big training data set and cross the border, where enlarging of the training set doesn't further increase Naive Bayes's performance (while AdaBoost still benefits from the enlarged training data set).
You may want to read the following paper which examines boosting on Naive Bayes. It demonstrates that boosting does not improve the accuracy of the naive Bayesian classifier as much is usually expected in a set of natural domains:
http://onlinelibrary.wiley.com/doi/10.1111/1467-8640.00219/abstract
Hope it provides a good insight.

SVM versus MLP (Neural Network): compared by performance and prediction accuracy

I should decide between SVM and neural networks for some image processing application. The classifier must be fast enough for near-real-time application and accuracy is important too. Since this is a medical application, it is important that the classifier has the low failure rate.
which one is better choice?
A couple of provisos:
performance of a ML classifier can refer to either (i) performance of the classifier itself; or (ii) performance of the predicate step: execution speed of the model-building algorithm. Particularly in this case, the answer is quite different depending on which of the two is intended in the OP, so i'll answer each separately.
second, by Neural Network, i'll assume you're referring to the most common implementation--i.e., a feed-forward, back-propagating single-hidden-layer perceptron.
Training Time (execution speed of the model builder)
For SVM compared to NN: SVMs are much slower. There is a straightforward reason for this: SVM training requires solving the associated Lagrangian dual (rather than primal) problem. This is a quadratic optimization problem in which the number of variables is very large--i.e., equal to the number of training instances (the 'length' of your data matrix).
In practice, two factors, if present in your scenario, could nullify this advantage:
NN training is trivial to parallelize (via map reduce); parallelizing SVM training is not trivial, but it's also not impossible--within the past eight or so years, several implementations have been published and proven to work (https://bibliographie.uni-tuebingen.de/xmlui/bitstream/handle/10900/49015/pdf/tech_21.pdf)
mult-class classification problem SVMs are two-class classifiers.They can be adapted for multi-class problems, but this is never straightforward because SVMs use direct decision functions. (An excellent source for modifying SVMs to multi-class problems is S. Abe, Support Vector Machines for Pattern Classification, Springer, 2005). This modification could wipe out any performance advantage SVMs have over NNs: So for instance, if your data has
more than two classes and you chose to configure the SVM using
successive classificstaion (aka one-against-many classification) in
which data is fed to a first SVM classifier which classifiers the
data point either class I or other; if the class is other then
the data point is fed to a second classifier which classifies it
class II or other, etc.
Prediction Performance (execution speed of the model)
Performance of an SVM is substantially higher compared to NN. For a three-layer (one hidden-layer) NN, prediction requires successive multiplication of an input vector by two 2D matrices (the weight matrices). For SVM, classification involves determining on which side of the decision boundary a given point lies, in other words a cosine product.
Prediction Accuracy
By "failure rate" i assume you mean error rate rather than failure of the classifier in production use. If the latter, then there is very little if any difference between SVM and NN--both models are generally numerically stable.
Comparing prediction accuracy of the two models, and assuming both are competently configured and trained, the SVM will outperform the NN.
The superior resolution of SVM versus NN is well documented in the scientific literature. It is true that such a comparison depends on the data, the configuration, and parameter choice of the two models. In fact, this comparison has been so widely studied--over perhaps all conceivable parameter space--and the results so consistent, that even the existence of a few exceptions (though i'm not aware of any) under impractical circumstances shouldn't interfere with the conclusion that SVMs outperform NNs.
Why does SVM outperform NN?
These two models are based on fundamentally different learing strategies.
In NN, network weights (the NN's fitting parameters, adjusted during training) are adjusted such that the sum-of-square error between the network output and the actual value (target) is minimized.
Training an SVM, by contrast, means an explicit determination of the decision boundaries directly from the training data. This is of course required as the predicate step to the optimization problem required to build an SVM model: minimizing the aggregate distance between the maximum-margin hyperplane and the support vectors.
In practice though it is harder to configure the algorithm to train an SVM. The reason is due to the large (compared to NN) number of parameters required for configuration:
choice of kernel
selection of kernel parameters
selection of the value of the margin parameter

Resources