What is the O() runtime complexity of AdaBoost? - machine-learning

I am using AdaBoost from scikit-learn using the typical DecisionTree weak learners. I would like to understand the runtime complexity in terms of data size N and number of weak learners T. I have searched for this info including in some of the original Adaboost paper from Yoav Freund and Robert Schapire and have not seen a very clear cut answer.

No disrespect meant to orgrisel, but his answer is lacking, as it completely ignores the number of features.
AdaBoost's time complexity is trivially O(T f), where f is the runtime of the weak learner in use.
For a normal style decision tree such as C4.5 the time complexity is O(N D^2), where D is the number of features. A single level decision tree would be O(N D)
You should never use experimentation to determine the runtime complexity of an algorithm as has been suggested. First, you will be unable to easily distinguish between similar complexities such as O(N log(N)) and O(N log(N)^2). It also risks being fooled by underlying implementation details. For example, many sorts can exhibit O(N) behavior when the data is mostly sorted or contains a few unique attributes. If you gave in an input with few unique values the runtime would exhibit faster results then the expected general case.

It's O(N . T). The linear dependency on T is certain as the user can select the number of trees and they are trained sequentially.
I think the complexity of fitting trees in sklearn is O(N) where N is the number of samples in the training set. The number of features also has a linear impact, when max_features is left to its default value.
To make sure you can write a script that measures the training time of adaboost models for 10%, 20%, ... 100% of your data and for n_estimators=10, 20, ... 100, then plot the results with matplotlib.
Edit: as AdaBoost is generally applied to shallow trees (with max_depth between 1 and 7 in general), it might be the case that the dependency of the complexity is actually not linear N. I think I measured a linear dependency on fully developed trees in the past (e.g. as in random forests). Shallow trees might have a complexity closer to O(N . log(N)) but I am not sure.

Related

Why are Random Forests easier to parallelize than Gradient Boosted Machines?

Is it because GBMs, each decision tree is dependent on the previous decision trees? In other words, there is no independence?
As you have already suspected, it is exactly because in GBM, each decision tree depends on the previous ones, so the trees cannot be fit independently, thus parallelization is in principle not possible.
Consider the following excerpt, quoted from The Elements of Statistical Learning, Ch. 10 (Boosting and Additive Trees), pp. 337-339 (emphasis mine):
A weak classifier is one whose error rate is only slightly better than
random guessing. The purpose of boosting is to sequentially apply the
weak classification algorithm to repeatedly modified versions of the data,
thereby producing a sequence of weak classifiers Gm(x), m = 1, 2, . . . , M. The predictions from all of them are then combined through a weighted
majority vote to produce the final prediction.
[...]
Each successive classifier is thereby forced to concentrate on those training observations that are missed by previous ones in the sequence.
In a picture (ibid, p. 338):
In Random Forest, on the other hand, all trees are independent, thus the parallelization of the algorithm is relatively straightforward.

random forest tuning - tree depth and number of trees

I have basic question about tuning a random forest classifier. Is there any relation between the number of trees and the tree depth? Is it necessary that the tree depth should be smaller than the number of trees?
For most practical concerns, I agree with Tim.
Yet, other parameters do affect when the ensemble error converges as a function of added trees. I guess limiting the tree depth typically would make the ensemble converge a little earlier. I would rarely fiddle with tree depth, as though computing time is lowered, it does not give any other bonus. Lowering bootstrap sample size both gives lower run time and lower tree correlation, thus often a better model performance at comparable run-time.
A not so mentioned trick: When RF model explained variance is lower than 40%(seemingly noisy data), one can lower samplesize to ~10-50% and increase trees to e.g. 5000(usually unnecessary many). The ensemble error will converge later as a function of trees. But, due to lower tree correlation, the model becomes more robust and will reach a lower OOB error level converge plateau.
You see below samplesize gives the best long run convergence, whereas maxnodes starts from a lower point but converges less. For this noisy data, limiting maxnodes still better than default RF. For low noise data, the decrease in variance by lowering maxnodes or sample size does not make the increase in bias due to lack-of-fit.
For many practical situations, you would simply give up, if you only could explain 10% of variance. Thus is default RF typically fine. If your a quant, who can bet on hundreds or thousands of positions, 5-10% explained variance is awesome.
the green curve is maxnodes which kinda tree depth but not exactly.
library(randomForest)
X = data.frame(replicate(6,(runif(1000)-.5)*3))
ySignal = with(X, X1^2 + sin(X2) + X3 + X4)
yNoise = rnorm(1000,sd=sd(ySignal)*2)
y = ySignal + yNoise
plot(y,ySignal,main=paste("cor="),cor(ySignal,y))
#std RF
rf1 = randomForest(X,y,ntree=5000)
print(rf1)
plot(rf1,log="x",main="black default, red samplesize, green tree depth")
#reduced sample size
rf2 = randomForest(X,y,sampsize=.1*length(y),ntree=5000)
print(rf2)
points(1:5000,rf2$mse,col="red",type="l")
#limiting tree depth (not exact )
rf3 = randomForest(X,y,maxnodes=24,ntree=5000)
print(rf2)
points(1:5000,rf3$mse,col="darkgreen",type="l")
It is true that generally more trees will result in better accuracy. However, more trees also mean more computational cost and after a certain number of trees, the improvement is negligible. An article from Oshiro et al. (2012) pointed out that, based on their test with 29 data sets, after 128 of trees there is no significant improvement(which is inline with the graph from Soren).
Regarding the tree depth, standard random forest algorithm grow the full decision tree without pruning. A single decision tree do need pruning in order to overcome over-fitting issue. However, in random forest, this issue is eliminated by random selecting the variables and the OOB action.
Reference:
Oshiro, T.M., Perez, P.S. and Baranauskas, J.A., 2012, July. How many trees in a random forest?. In MLDM (pp. 154-168).
I agree with Tim that there is no thumb ratio between the number of trees and tree depth. Generally you want as many trees as will improve your model. More trees also mean more computational cost and after a certain number of trees, the improvement is negligible. As you can see in figure below, after sometime there is no significant improvement in error rate even if we are increasing no of tree.
The depth of the tree meaning length of tree you desire. Larger tree helps you to convey more info whereas smaller tree gives less precise info.So depth should large enough to split each node to your desired number of observations.
Below is example of short tree(leaf node=3) and long tree(leaf node=6) for Iris dataset: Short tree(leaf node=3) gives less precise info compared to long tree(leaf node=6).
Short tree(leaf node=3):
Long tree(leaf node=6):
It all depends on your data set.
I have an example where I was building the Random Forest classifier on Adult Income dataset and reducing the depth of trees (from 42 to 6) improved the performance of the model. The side effect of reducing the depth of trees was How can I reduce the long feature vector which is a list of double values? model size (in RAM and disk space after save)
Regarding the number of trees, I was doing the experiment on 72 classification tasks from OpenML-CC18 benchmark and I found that:
the more rows in the data, the more trees are needed,
the best performance is obtained by tuning the number of trees with 1 tree precision. Train large Random Forest (for example with 1000 trees) and then use validation data to find optimal number of trees.

unigrams & bigrams (tf-idf) less accurate than just unigrams (ff-idf)?

This is a question about linear regression with ngrams, using Tf-IDF (term frequency - inverse document frequency). To do this, I am using numpy sparse matrices and sklearn for linear regression.
I have 53 cases and over 6000 features when using unigrams. The predictions are based on cross validation using LeaveOneOut.
When I create a tf-idf sparse matrix of only unigram scores, I get slightly better predictions than when I create a tf-idf sparse matrix of unigram+bigram scores. The more columns I add to the matrix (columns for trigram, quadgram, quintgrams, etc.), the less accurate the regression prediction.
Is this common? How is this possible? I would have thought that the more features, the better.
It's not common for bigrams to perform worse than unigrams, but there are situations where it may happen. In particular, adding extra features may lead to overfitting. Tf-idf is unlikely to alleviate this, as longer n-grams will be rarer, leading to higher idf values.
I'm not sure what kind of variable you're trying to predict, and I've never done regression on text, but here's some comparable results from literature to get you thinking:
In random text generation with small (but non-trivial) training sets, 7-grams tend to reconstruct the input text almost verbatim, i.e. cause complete overfit, while trigrams are more likely to generate "new" but still somewhat grammatical/recognizable text (see Jurafsky & Martin; can't remember which chapter and I don't have my copy handy).
In classification-style NLP tasks performed with kernel machines, quadratic kernels tend to fare better than cubic ones because the latter often overfit on the training set. Note that unigram+bigram features can be thought of as a subset of the quadratic kernel's feature space, and {1,2,3}-grams of that of the cubic kernel.
Exactly what is happening depends on your training set; it might simply be too small.
As larsmans said, adding more variables / features makes it easier for the model to overfit hence lose in test accuracy. In the master branch of scikit-learn there is now a min_df parameter to cut-off any feature with less than that number of occurrences. Hence min_df==2 to min_df==5 might help you get rid of spurious bi-grams.
Alternatively you can use L1 or L1 + L2 penalized linear regression (or classification) using either the following classes:
sklearn.linear_model.Lasso (regression)
sklearn.linear_model.ElasticNet (regression)
sklearn.linear_model.SGDRegressor (regression) with penalty == 'elastic_net' or 'l1'
sklearn.linear_model.SGDClassifier (classification) with penalty == 'elastic_net' or 'l1'
This will make it possible to ignore spurious features and lead to a sparse model with many zero weights for noisy features. Grid Searching the regularization parameters will be very important though.
You can also try univariate feature selection such as done the text classification example of scikit-learn (check the SelectKBest and chi2 utilities.

Gradient boosting predictions in low-latency production environments?

Can anyone recommend a strategy for making predictions using a gradient boosting model in the <10-15ms range (the faster the better)?
I have been using R's gbm package, but the first prediction takes ~50ms (subsequent vectorized predictions average to 1ms, so there appears to be overhead, perhaps in the call to the C++ library). As a guideline, there will be ~10-50 inputs and ~50-500 trees. The task is classification and I need access to predicted probabilities.
I know there are a lot of libraries out there, but I've had little luck finding information even on rough prediction times for them. The training will happen offline, so only predictions need to be fast -- also, predictions may come from a piece of code / library that is completely separate from whatever does the training (as long as there is a common format for representing the trees).
I'm the author of the scikit-learn gradient boosting module, a Gradient Boosted Regression Trees implementation in Python. I put some effort in optimizing prediction time since the method was targeted at low-latency environments (in particular ranking problems); the prediction routine is written in C, still there is some overhead due to Python function calls. Having said that: prediction time for single data points with ~50 features and about 250 trees should be << 1ms.
In my use-cases prediction time is often governed by the cost of feature extraction. I strongly recommend profiling to pin-point the source of the overhead (if you use Python, I can recommend line_profiler).
If the source of the overhead is prediction rather than feature extraction you might check whether its possible to do batch predictions instead of predicting single data points thus limiting the overhead due to the Python function call (e.g. in ranking you often need to score the top-K documents, so you can do the feature extraction first and then run predict on the K x n_features matrix.
If this doesn't help either you should try the limit the number of trees because the runtime cost for prediction is basically linear in the number of trees.
There are a number of ways to limit the number of trees without affecting the model accuracy:
Proper tuning of the learning rate; the smaller the learning rate, the more trees are needed and thus the slower is prediction.
Post-process GBM with L1 regularization (Lasso); See Elements of Statistical Learning Section 16.3.1 - use predictions of each tree as new features and run the representation through a L1 regularized linear model - remove those trees that don't get any weight.
Fully-corrective weight updates; instead of doing the line-search/weight update just for the most recent tree, update all trees (see [Warmuth2006] and [Johnson2012]). Better convergence - fewer trees.
If none of the above does the trick you could investigate cascades or early-exit strategies (see [Chen2012])
References:
[Warmuth2006] M. Warmuth, J. Liao, and G. Ratsch. Totally corrective boosting algorithms that maximize the margin. In Proceedings of the 23rd international conference on Machine learning, 2006.
[Johnson2012] Rie Johnson, Tong Zhang, Learning Nonlinear Functions Using Regularized Greedy Forest, arxiv, 2012.
[Chen2012] Minmin Chen, Zhixiang Xu, Kilian Weinberger, Olivier Chapelle, Dor Kedem, Classifier Cascade for Minimizing Feature Evaluation Cost, JMLR W&CP 22: 218-226, 2012.

Multivariate Decision Tree learner

A lot univariate decision tree learner implementations (C4.5 etc) do exist, but does actually someone know multivariate decision tree learner algorithms?
Bennett and Blue's A Support Vector Machine Approach to Decision Trees does multivariate splits by using embedded SVMs for each decision in the tree.
Similarly, in Multicategory classification via discrete support vector machines (2009) , Orsenigo and Vercellis embed a multicategory variant of discrete support vector machines (DSVM) into the decision tree nodes.
CART algorithm for decisions tree can be made into a Multivariate. CART is a binary splitting algorithm as opposed to C4.5 which creates a node per unique value for discrete values. They use the same algorithm for MARS as for missing values too.
To create a Multivariant tree you compute the best split at each node, but instead of throwing away all splits that weren't the best you take a portion of those (maybe all), then evaluate all of the data's attributes by each of the potential splits at that node weighted by the order. So the first split (which lead to the maximum gain) is weighted at 1. Then the next highest gain split is weighted by some fraction < 1.0, and so on. Where the weights decrease as the gain of that split decreases. That number is then compared to same calculation of the nodes within the left node if it's above that number go left. Otherwise go right. That's pretty rough description, but that's a multi-variant split for decision trees.
Yes, there are some, such as OC1, but they are less common than ones which make univariate splits. Adding multivariate splits expands the search space enormously. As a sort of compromise, I have seen some logical learners which simply calculate linear discriminant functions and add them to the candidate variable list.

Resources