In what scenario bagging can be used over boosting? - machine-learning

I am new to data science and so far i have learnt that bagging only reduces high variance but boosting reduces both variance and bias and thus increasing the accuracy for both train and test cases.
I understand the functioning of both. Seems like in terms of accuracy boosting always performs better than bagging. Please correct me if i am wrong.
Is there any parameter that makes bagging or bagging based algorithms better than boosting - be it in terms of memory or speed or complex data handling or any other parameter.

There are two properties of bagging that can make it more attractive than boosting:
It's parallelizable - You can speed up your training procedure by 4-8x times, depending on your cpu cores, due to the embarrassingly parallel nature of bagging.
Bagging is comparatively more robust to noise (paper). Real life data are rarely as clean as toy datasets we play with while learning data science. Boosting have a tendency to overfit to noise, while Bagging is comparatively better at handling noise.

You're right. Both of them are good for increasing model accuracy. Infact the boosting is better than bagging in most of the cases because it learns at each stage.
But, in cases where your model is overfitting, boosting will keep on overfitting it, while bagging will help in that case, because the trees are always made on a new subset of data.
In short. Bagging is better than boosting in cases where you have an overfitting problem.

The goals for bagging and boosting are quite different. Bagging is an ensemble technique that tries to reduce variance so one should use it in the case of low bias but high variance, E.g. KNN with low neighbour count or Fully grown decision tree. Boosting on the other hand tries reducing the bias and hence it can handle problems of high bias but low variance, E.g. Shallow Decision Tree.

Related

What to do if neural network always performs poorly even after addressing overfitting?

I have a medical image dataset of ~10K 256x256 images with which I am training a deep neural classifier for disease classification. I have been working with popular CNNs like InceptionV3 and ResNets.
These models have achieved validation set accuracies in the 50-60% range and I noticed that they were overfitting. So to improve the performance, I then tried common strategies like a dropout in the dense layers, smaller learning rates, and L2 regularization. After these modifications showed no reduction in overfitting, I next moved to smaller and simpler architectures with just 2-3 convolution layers + 1 FC classification layer which I thought would mitigate the issue. However, with the simpler models, the learning curves still showed signs of overfitting. Particularly, when training for 100 epochs, the models would have similar train and validation losses for the first 20-30 epochs, but then diverge after that.
I'm not sure what other strategies I can experiment with at this point and I'm worried that trying more experiments aimlessly is inefficient. Should I just accept that the models cannot generalize to this task well?
Additionally, FYI, the dataset is imbalanced, but I have dealt with this using data augmentation and a weighted cross-entropy loss as well but no real difference.
Try to use modern classification approaches like transformers or efficientnets - their accuracy is higher. To compare different modern architectures please use paperswithcode.
Augmentations, regularizations are must-have in training process, doesn't matter if balanced or imbalanced data you have.
You can try to make over- or undersampling of your data to get better results
Try to use warmup and learning rate schedules, this improves the convergence of the model

using random forest as base classifier with adaboost

Can I use AdaBoost with random forest as a base classifier? I searched on the internet and I didn't find anyone who does it.
Like in the following code; I try to run it but it takes a lot of time:
estimators = Pipeline([('vectorizer', CountVectorizer()),
('transformer', TfidfTransformer()),
('classifier', AdaBoostClassifier(learning_rate=1))])
RF=RandomForestClassifier(criterion='entropy',n_estimators=100,max_depth=500,min_samples_split=100,max_leaf_nodes=None,
max_features='log2')
param_grid={
'vectorizer__ngram_range': [(1,2),(1,3)],
'vectorizer__min_df': [5],
'vectorizer__max_df': [0.7],
'vectorizer__max_features': [1500],
'transformer__use_idf': [True , False],
'transformer__norm': ('l1','l2'),
'transformer__smooth_idf': [True , False],
'transformer__sublinear_tf': [True , False],
'classifier__base_estimator':[RF],
'classifier__algorithm': ("SAMME.R","SAMME"),
'classifier__n_estimators':[4,7,11,13,16,19,22,25,28,31,34,43,50]
}
I tried with the GridSearchCV, I added the RF classifier into the AdaBoost parameters.
if I use it would the accuracy increase?
No wonder you have not actually seen anyone doing it - it is an absurd and bad idea.
You are trying to build an ensemble (Adaboost) which in itself consists of ensemble base classifiers (RFs) - essentially an "ensemble-squared"; so, no wonder about the high computation time.
But even if it was practical, there are good theoretical reasons not to do it; quoting from my own answer in Execution time of AdaBoost with SVM base classifier:
Adaboost (and similar ensemble methods) were conceived using decision trees as base classifiers (more specifically, decision stumps, i.e. DTs with a depth of only 1); there is good reason why still today, if you don't specify explicitly the base_classifier argument, it assumes a value of DecisionTreeClassifier(max_depth=1). DTs are suitable for such ensembling because they are essentially unstable classifiers, which is not the case with SVMs, hence the latter are not expected to offer much when used as base classifiers.
On top of this, SVMs are computationally much more expensive than decision trees (let alone decision stumps), which is the reason for the long processing times you have observed.
The argument holds for RFs, too - they are not unstable classifiers, hence there is not any reason to actually expect performance improvements when using them as base classifiers for boosting algorithms, like Adaboost.
Short answer:
It's not impossible.
I don't know if there's anything wrong with doing so in theory, but I tried this once and the accuracy increased.
Long answer:
I tried it on a typical dataset with n rows of p real-valued features, and a label list of length n. In case it matters, they are embeddings of nodes in a graph obtained by the DeepWalk algorithm, and the nodes are categorized into two classes. I trained a few classification models on this data using 5-fold cross validation, and measured common evaluation metrics for them (precision, recall, AUC etc.). The models I have used are SVM, logistic regression, random Forest, 2-layer perceptron and Adaboost with random forest classifiers. The last model, Adaboost with random forest classifiers, yielded the best results (95% AUC compared to multilayer perceptron's 89% and random forest's 88%). Sure, now the runtime has increased by a factor of, let's say, 100, but it's still about 20 mins, so it's not a constraint to me.
Here's what I thought: Firstly, I'm using cross validation, so there's probably no overfitting flying under the radar. Secondly, both are ensemble learning methods, but random forest is a bagging method, wheras Adaboost is a boosting technique. Perhaps they're still different enough for their combination to make sense?

Gradient Boosting vs Random forest

According to my understanding, RF selects features randomly and hence is hard to overfit. But, in sklearn Gradient boosting also offers the option of max_features which can help to prevent overfitting. So, why would anyone use Random forest?
Can anyone explain when to use Gradient boosting vs Random forest based on the given data?
Any help is highly appreciated.
According to my personal experience, Random Forest could be a better choice when..
You train a model on small data set.
Your data set has few features to learn.
Your data set has low Y flag count or you try to predict a situation that has low chance to occur or rarely occurs.
In these situations, Gradient Boosting algorithms like XGBoost and Light GBM can overfit (though their parameters are tuned) while simple algorithms like Random Forest or even Logistic Regression may perform better. To illustrate, for XGboost and Ligh GBM, ROC AUC from test set may be higher in comparison with Random Forest but shows too high difference with ROC AUC from train set.
Despite the sharp prediction form Gradient Boosting algorithms, in some cases, Random Forest take advantage of model stability from begging methodology (selecting randomly) and outperform XGBoost and Light GBM. However, Gradient Boosting algorithms perform better in general situations.
Similar question asked on Quora:
https://www.quora.com/How-do-random-forests-and-boosted-decision-trees-compare
I agree with the author at the link that random forests are more robust -- they don't require much problem-specific tuning to get good results. Besides that, a couple other items based on my own experience:
Random forests can perform better on small data sets; gradient boosted trees are data hungry
Random forests are easier to explain and understand. This perhaps seems silly but can lead to better adoption of a model if needed to be used by less technical people
I think that's also true. I have also read on this page How Random Forest Works
There explains the advantages of random forest. like this :
For applications in classification problems, Random Forest algorithm
will avoid the overfitting problem
For both classification and
regression task, the same random forest algorithm can be used
The Random Forest algorithm can be used for identifying the most
important features from the training dataset, in other words,
feature engineering.

Is there any classifier which is able to make decisions very fast?

Most classification algorithms are developed to improve the training speed. However, is there any classifier or algorithm focusing on the decision making speed(low computation complexity and simple realizable structure)? I can get enough training dataļ¼Œand endure the long training time.
There are many methods which classify fast, you could more or less sort models by classification speed in a following way (first ones - the fastest, last- slowest)
Decision Tree (especially with limited depth)
Linear models (linear regression, logistic regression, linear svm, lda, ...) and Naive Bayes
Non-linear models based on explicit data transformation (Nystroem kernel approximation, RVFL, RBFNN, EEM), Kernel methods (such as kernel SVM) and shallow neural networks
Random Forest and other committees
Big Neural Networks (ie. CNN)
KNN with arbitrary distance
Obviously this list is not exhaustive, it just shows some general ideas.
One way of obtaining such model is to build a complex, slow model, then use it as a black box label generator to train a simplier model (but on potentialy infinite training set) - thus getting a fast classifier at the cost of very expensive training. There are many works showing that one can do that for example by training a shallow neural network on outputs of deep nn.
In general classification speed should not be a problem. Some exceptions are algorithms which have a time complexity depending on the number of samples you have for training. One example is k-Nearest-Neighbors which has no training time, but for classification it needs to check all points (if implemented in a naive way). Other examples are all classifiers which work with kernels since they compute the kernel between the current sample and all training samples.
Many classifiers work with a scalar product of the features and a learned coefficient vector. These should be fast enough in almost all cases. Examples are: Logistic regression, linear SVM, perceptrons and many more. See #lejlot's answer for a nice list.
If these are still too slow you might try to reduce the dimension of your feature space first and then try again (this also speeds up training time).
Btw, this question might not be suited for StackOverflow as it is quite broad and recommendation instead of problem oriented. Maybe try https://stats.stackexchange.com/ next time.
I have a decision tree which is represented in the compressed form and which is at least 4 times faster than the actual tree in classifying an unseen instance.

weka AdaBoost does not improve results

In my bachelor thesis I am supposed to use AdaBoostM1 with a MultinomialNaiveBayes classifier on a text classification problem. The problem is that in most cases, the M1 is worse or equal to the MultinomialNaiveBayes without boosting.
I use the following code:
AdaBoostM1 m1 = new AdaBoostM1();
m1.setClassifier(new NaiveBayesMultinomial());
m1.buildClassifier(training);
So I don't get how the AdaBoost would not be able to improve the results? Unfortunately, I couldn't find anything else about that on the web as most people seem to be very satisfied with the AdaBoost.
AdaBoost is a binary/dichotomous/2-class classifier and designed to boost a weak learner that is just better than 1/2 accuracy. AdaBoostM1 is a M-class classifier but still requires the weak learner to be better than 1/2 accuracy, when one would expect chance level to be around 1/M. Balancing/weighting is used to get equal prevalence classes initially, but the reweighting inherent to AdaBoost can destroy this quickly. A solution is to base boosting on chance corrected measures like Kappa or Informedness (AdaBook).
As M grows, e.g. with text classification, this mismatch grows, and thus a much stronger than chance classifier is needed. Thus with M=100, chance is about 1% but 50% minimum accuracy is needed by AdaBoostM1.
As base classifiers get stronger (viz. no longer barely above chance) the scope for boosting to improve things reduces - it has already pulled us to a very specific part of the search space. It is increasingly likely to have overfitted to errors and outliers, so there is no scope to balance a wide variety of variants.
A number of resources on informedness (including matlab code and xls sheets and early papers) are here: http://david.wardpowers.info/BM A comparison with other chance-corrected kappa measures is here: http://aclweb.org/anthology-new/E/E12/E12-1035.pdf
A weka implementation and experimentation for Adaboost using Bookmaker informedness is available - contact author.
It's hard to beat Naive Bayes on text classification. Furthermore, boosting was designed for weak classifiers with high bias and that's where boosting performs well. Boosting decreases bias but increases variance. Hence if you want the combo AdaBoost + Naive Bayes to outperform Naive Bayes you have to have a big training data set and cross the border, where enlarging of the training set doesn't further increase Naive Bayes's performance (while AdaBoost still benefits from the enlarged training data set).
You may want to read the following paper which examines boosting on Naive Bayes. It demonstrates that boosting does not improve the accuracy of the naive Bayesian classifier as much is usually expected in a set of natural domains:
http://onlinelibrary.wiley.com/doi/10.1111/1467-8640.00219/abstract
Hope it provides a good insight.

Resources