Is it because GBMs, each decision tree is dependent on the previous decision trees? In other words, there is no independence?
As you have already suspected, it is exactly because in GBM, each decision tree depends on the previous ones, so the trees cannot be fit independently, thus parallelization is in principle not possible.
Consider the following excerpt, quoted from The Elements of Statistical Learning, Ch. 10 (Boosting and Additive Trees), pp. 337-339 (emphasis mine):
A weak classifier is one whose error rate is only slightly better than
random guessing. The purpose of boosting is to sequentially apply the
weak classification algorithm to repeatedly modified versions of the data,
thereby producing a sequence of weak classifiers Gm(x), m = 1, 2, . . . , M. The predictions from all of them are then combined through a weighted
majority vote to produce the final prediction.
[...]
Each successive classifier is thereby forced to concentrate on those training observations that are missed by previous ones in the sequence.
In a picture (ibid, p. 338):
In Random Forest, on the other hand, all trees are independent, thus the parallelization of the algorithm is relatively straightforward.
Related
Can I use AdaBoost with random forest as a base classifier? I searched on the internet and I didn't find anyone who does it.
Like in the following code; I try to run it but it takes a lot of time:
estimators = Pipeline([('vectorizer', CountVectorizer()),
('transformer', TfidfTransformer()),
('classifier', AdaBoostClassifier(learning_rate=1))])
RF=RandomForestClassifier(criterion='entropy',n_estimators=100,max_depth=500,min_samples_split=100,max_leaf_nodes=None,
max_features='log2')
param_grid={
'vectorizer__ngram_range': [(1,2),(1,3)],
'vectorizer__min_df': [5],
'vectorizer__max_df': [0.7],
'vectorizer__max_features': [1500],
'transformer__use_idf': [True , False],
'transformer__norm': ('l1','l2'),
'transformer__smooth_idf': [True , False],
'transformer__sublinear_tf': [True , False],
'classifier__base_estimator':[RF],
'classifier__algorithm': ("SAMME.R","SAMME"),
'classifier__n_estimators':[4,7,11,13,16,19,22,25,28,31,34,43,50]
}
I tried with the GridSearchCV, I added the RF classifier into the AdaBoost parameters.
if I use it would the accuracy increase?
No wonder you have not actually seen anyone doing it - it is an absurd and bad idea.
You are trying to build an ensemble (Adaboost) which in itself consists of ensemble base classifiers (RFs) - essentially an "ensemble-squared"; so, no wonder about the high computation time.
But even if it was practical, there are good theoretical reasons not to do it; quoting from my own answer in Execution time of AdaBoost with SVM base classifier:
Adaboost (and similar ensemble methods) were conceived using decision trees as base classifiers (more specifically, decision stumps, i.e. DTs with a depth of only 1); there is good reason why still today, if you don't specify explicitly the base_classifier argument, it assumes a value of DecisionTreeClassifier(max_depth=1). DTs are suitable for such ensembling because they are essentially unstable classifiers, which is not the case with SVMs, hence the latter are not expected to offer much when used as base classifiers.
On top of this, SVMs are computationally much more expensive than decision trees (let alone decision stumps), which is the reason for the long processing times you have observed.
The argument holds for RFs, too - they are not unstable classifiers, hence there is not any reason to actually expect performance improvements when using them as base classifiers for boosting algorithms, like Adaboost.
Short answer:
It's not impossible.
I don't know if there's anything wrong with doing so in theory, but I tried this once and the accuracy increased.
Long answer:
I tried it on a typical dataset with n rows of p real-valued features, and a label list of length n. In case it matters, they are embeddings of nodes in a graph obtained by the DeepWalk algorithm, and the nodes are categorized into two classes. I trained a few classification models on this data using 5-fold cross validation, and measured common evaluation metrics for them (precision, recall, AUC etc.). The models I have used are SVM, logistic regression, random Forest, 2-layer perceptron and Adaboost with random forest classifiers. The last model, Adaboost with random forest classifiers, yielded the best results (95% AUC compared to multilayer perceptron's 89% and random forest's 88%). Sure, now the runtime has increased by a factor of, let's say, 100, but it's still about 20 mins, so it's not a constraint to me.
Here's what I thought: Firstly, I'm using cross validation, so there's probably no overfitting flying under the radar. Secondly, both are ensemble learning methods, but random forest is a bagging method, wheras Adaboost is a boosting technique. Perhaps they're still different enough for their combination to make sense?
I am analyzing medical data for a hospital study and if I am using a random forest with only one tree then the cross validation scores are quite bad (indicating overfitting) whereas if I am using a decision tree the score values are actually quiet good. Both classifier have the same depth parameter. So how can this behaviour be explained?
The construction procedure for decision trees usually includes pruning, which is a part that is done a posteriori in order to reduce the depth and avoid overfitting. Random Forest does not use this method, as it actually takes advantage of the high variance of the overfitted decision trees by averaging them.
Moreover, the decision tree would be built by training on the full dataset, while the "random forest" tree would be build on a bootstrap of the training dataset, which could likely translate into poorer performance since it will be biased towards records that have been included multiple times in the sampling. Again, Random Forest takes advantage of this by averaging over multiple trees, but here it's a disadvantage.
All and all, the difference in performance is not surprising.
Most classification algorithms are developed to improve the training speed. However, is there any classifier or algorithm focusing on the decision making speed(low computation complexity and simple realizable structure)? I can get enough training dataļ¼and endure the long training time.
There are many methods which classify fast, you could more or less sort models by classification speed in a following way (first ones - the fastest, last- slowest)
Decision Tree (especially with limited depth)
Linear models (linear regression, logistic regression, linear svm, lda, ...) and Naive Bayes
Non-linear models based on explicit data transformation (Nystroem kernel approximation, RVFL, RBFNN, EEM), Kernel methods (such as kernel SVM) and shallow neural networks
Random Forest and other committees
Big Neural Networks (ie. CNN)
KNN with arbitrary distance
Obviously this list is not exhaustive, it just shows some general ideas.
One way of obtaining such model is to build a complex, slow model, then use it as a black box label generator to train a simplier model (but on potentialy infinite training set) - thus getting a fast classifier at the cost of very expensive training. There are many works showing that one can do that for example by training a shallow neural network on outputs of deep nn.
In general classification speed should not be a problem. Some exceptions are algorithms which have a time complexity depending on the number of samples you have for training. One example is k-Nearest-Neighbors which has no training time, but for classification it needs to check all points (if implemented in a naive way). Other examples are all classifiers which work with kernels since they compute the kernel between the current sample and all training samples.
Many classifiers work with a scalar product of the features and a learned coefficient vector. These should be fast enough in almost all cases. Examples are: Logistic regression, linear SVM, perceptrons and many more. See #lejlot's answer for a nice list.
If these are still too slow you might try to reduce the dimension of your feature space first and then try again (this also speeds up training time).
Btw, this question might not be suited for StackOverflow as it is quite broad and recommendation instead of problem oriented. Maybe try https://stats.stackexchange.com/ next time.
I have a decision tree which is represented in the compressed form and which is at least 4 times faster than the actual tree in classifying an unseen instance.
I am using AdaBoost from scikit-learn using the typical DecisionTree weak learners. I would like to understand the runtime complexity in terms of data size N and number of weak learners T. I have searched for this info including in some of the original Adaboost paper from Yoav Freund and Robert Schapire and have not seen a very clear cut answer.
No disrespect meant to orgrisel, but his answer is lacking, as it completely ignores the number of features.
AdaBoost's time complexity is trivially O(T f), where f is the runtime of the weak learner in use.
For a normal style decision tree such as C4.5 the time complexity is O(N D^2), where D is the number of features. A single level decision tree would be O(N D)
You should never use experimentation to determine the runtime complexity of an algorithm as has been suggested. First, you will be unable to easily distinguish between similar complexities such as O(N log(N)) and O(N log(N)^2). It also risks being fooled by underlying implementation details. For example, many sorts can exhibit O(N) behavior when the data is mostly sorted or contains a few unique attributes. If you gave in an input with few unique values the runtime would exhibit faster results then the expected general case.
It's O(N . T). The linear dependency on T is certain as the user can select the number of trees and they are trained sequentially.
I think the complexity of fitting trees in sklearn is O(N) where N is the number of samples in the training set. The number of features also has a linear impact, when max_features is left to its default value.
To make sure you can write a script that measures the training time of adaboost models for 10%, 20%, ... 100% of your data and for n_estimators=10, 20, ... 100, then plot the results with matplotlib.
Edit: as AdaBoost is generally applied to shallow trees (with max_depth between 1 and 7 in general), it might be the case that the dependency of the complexity is actually not linear N. I think I measured a linear dependency on fully developed trees in the past (e.g. as in random forests). Shallow trees might have a complexity closer to O(N . log(N)) but I am not sure.
This is a question about linear regression with ngrams, using Tf-IDF (term frequency - inverse document frequency). To do this, I am using numpy sparse matrices and sklearn for linear regression.
I have 53 cases and over 6000 features when using unigrams. The predictions are based on cross validation using LeaveOneOut.
When I create a tf-idf sparse matrix of only unigram scores, I get slightly better predictions than when I create a tf-idf sparse matrix of unigram+bigram scores. The more columns I add to the matrix (columns for trigram, quadgram, quintgrams, etc.), the less accurate the regression prediction.
Is this common? How is this possible? I would have thought that the more features, the better.
It's not common for bigrams to perform worse than unigrams, but there are situations where it may happen. In particular, adding extra features may lead to overfitting. Tf-idf is unlikely to alleviate this, as longer n-grams will be rarer, leading to higher idf values.
I'm not sure what kind of variable you're trying to predict, and I've never done regression on text, but here's some comparable results from literature to get you thinking:
In random text generation with small (but non-trivial) training sets, 7-grams tend to reconstruct the input text almost verbatim, i.e. cause complete overfit, while trigrams are more likely to generate "new" but still somewhat grammatical/recognizable text (see Jurafsky & Martin; can't remember which chapter and I don't have my copy handy).
In classification-style NLP tasks performed with kernel machines, quadratic kernels tend to fare better than cubic ones because the latter often overfit on the training set. Note that unigram+bigram features can be thought of as a subset of the quadratic kernel's feature space, and {1,2,3}-grams of that of the cubic kernel.
Exactly what is happening depends on your training set; it might simply be too small.
As larsmans said, adding more variables / features makes it easier for the model to overfit hence lose in test accuracy. In the master branch of scikit-learn there is now a min_df parameter to cut-off any feature with less than that number of occurrences. Hence min_df==2 to min_df==5 might help you get rid of spurious bi-grams.
Alternatively you can use L1 or L1 + L2 penalized linear regression (or classification) using either the following classes:
sklearn.linear_model.Lasso (regression)
sklearn.linear_model.ElasticNet (regression)
sklearn.linear_model.SGDRegressor (regression) with penalty == 'elastic_net' or 'l1'
sklearn.linear_model.SGDClassifier (classification) with penalty == 'elastic_net' or 'l1'
This will make it possible to ignore spurious features and lead to a sparse model with many zero weights for noisy features. Grid Searching the regularization parameters will be very important though.
You can also try univariate feature selection such as done the text classification example of scikit-learn (check the SelectKBest and chi2 utilities.