Is it possible to calculate AUC using OOB sample in Bagged trees? - machine-learning

I have few questions on OOB sample in Bagged trees.
1.Do we always calculate only error on OOB samples? If yes, which error metric is used for evaluation(like rmse, misclassification err)?
2.Also, do we have this OOB concept in boosting also?

Is it possible to calculate AUC using OOB sample in Bagged trees?
An ROC curve is the most commonly used way to visualize the performance of a binary classifier, and AUC is (arguably) the best way to summarize its performance in a single number. It does not matter wether you are using Bagged Trees or not. You can find a nice explanation here
1.Do we always calculate only error on OOB samples?
Not necessarily, before bootstrapping, you can set aside validation set and do cross validation
If yes, which error metric is used for evaluation(like rmse, misclassification err)?
If it is Regression problem, sum of squared errors(RSS) for the tree can be used
For a Classification problem, Misclassification error rate can be used.
2.Also, do we have this OOB concept in boosting also?
Let's see what is OOB ? The key to bagging is that trees are repeatedly fit to bootstrapped subsets of the observations.On average, each bagged tree makes use of around two-thirds of the observations.The remaining one-third of the observations not used to fit a given bagged tree are referred to as the out-of-bag (OOB) observations. Reference: An Introduction to Statistical Learning, Section 8.2.1, Out-of-Bag Error Estimation
Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set. Reference: An Introduction to Statistical Learning, Section 8.2.3
Therefore going by the definition,OOB concept is not applicable for Boosting.
But note that most implementation of Boosted Tree algorithms will have an option to set OOB in some way. Please refer to documentation of respective implementation to understand their version.

Related

Relation between coefficients in linear regression and feature importance in decision trees

Recently I have a Machine Learning(ML) project, which needs to identify the features(inputs, a1,a2,a3 ... an) that have large impacts on target/outputs.
I used linear regression to get the coefficients of the feature, and decision trees algorithm (for example Random Forest Regressor) to get important features (or feature importance).
Is my understanding right that the feature with large coefficient in linear regression shall be among the top list of importance of features in Decision tree algorithm?
Not really, if your input features are not normalized, you could have a relatively big co-efficient for features with a relatively big mean/std. If your features are normalized, then yes, this could be an indicator to the features importance, but there are still other things to consider.
You could try some of sklearn's feature selection classes that should do this automatically for you here.
Short answer to your question is No, not necessarily. Considering the fact that we do not know what are your different inputs, if they are in the same unit system, range of variation and etc.
I am not sure why you have combined Linear regression with Decision tree. But I just assume you have a working model, say a linear regression which provides good accuracy on the test set. From what you have asked, you probably need to look at sensitivity analysis based on the obtained model. I would suggest doing some reading on "SALib" library and generally the subject of sensitivity analysis.

Machine Learning Experiment Design with Small Positive Sample Set in Sci-kit Learn

I am interested in any tips on how to train a set with a very limited positive set and a large negative set.
I have about 40 positive examples (quite lengthy articles about a particular topic), and about 19,000 negative samples (most drawn from the sci-kit learn newsgroups dataset). I also have about 1,000,000 tweets that I could work with.. negative about the topic I am trying to train on. Is the size of the negative set versus the positive going to negatively influence training a classifier?
I would like to use cross-validation in sci-kit learn. Do I need to break this into train / test-dev / test sets? Is know there are some pre-built libraries in sci-kit. Any implementation examples that you recommend or have used previously would be helpful.
Thanks!
The answer to your first question is yes, the amount by which it will affect your results depends on the algorithm. My advive would be to keep an eye on the class-based statistics such as recall and precision (found in classification_report).
For RandomForest() you can look at this thread which discusses
the sample weight parameter. In general sample_weight is what
you're looking for in scikit-learn.
For SVM's have a look at either this example or this
example.
For NB classifiers, this should be handled implicitly by Bayes
rule, however in practice you may see some poor performances.
For you second question it's up for discussion, personally I break my data into a training and test split, perform cross validation on the training set for parameter estimation, retrain on all the training data and then test on my test set. However the amount of data you have may influence the way you split your data (more data means more options).
You could probably use Random Forest for your classification problem. There are basically 3 parameters to deal with data imbalance. Class Weight, Samplesize and Cutoff.
Class Weight-The higher the weight a class is given, the more its error rate is decreased.
Samplesize- Oversample the minority class to improve class imbalance while sampling the defects for each tree[not sure if Sci-kit supports this, used to be param in R)
Cutoff- If >x% trees vote for the minority class, classify it as minority class. By default x is 1/2 in Random forest for 2-class problem. You can set it to a lower value for the minority class.
Check out balancing predict error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
For the 2nd question if you are using Random Forest, you do not need to keep separate train/validation/test set. Random Forest does not choose any parameters based on a validation set, so validation set is un-necessary.
Also during the training of Random Forest, the data for training each individual tree is obtained by sampling by replacement from the training data, thus each training sample is not used for roughly 1/3 of the trees. We can use the votes of these 1/3 trees to predict the out of box probability of the Random forest classification. Thus with OOB accuracy you just need a training set, and not validation or test data to predict performance on unseen data. Check Out of Bag error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm for further study.

What is an example of using Adaboost (Adaptive Boosting) approach with Decision Trees

Is there any good tutorial that explains how to weight the samples during successive iterations of constructing the decision trees for a sample training set? I want to specifically how to the weights are assigned after the first decision tree is constructed.
Decision tree is designed using Information Gain as an anchor and I am wondering how is this affected due to the misclassifications in the previous iterations being weighted.
Any good tutorial / example is highly appreciated.
A Short Introduction to Boosting from Freund and Schapire supplies an example of the AdaBoost algorithm using Quinlan's C4.5 Decision Tree model.

What are the metrics to evaluate a machine learning algorithm

I would like to know what are the various techniques and metrics used to evaluate how accurate/good an algorithm is and how to use a given metric to derive a conclusion about a ML model.
one way to do this is to use precision and recall, as defined here in wikipedia.
Another way is to use the accuracy metric as explained here. So, what I would like to know is whether there are other metrics for evaluating an ML model?
I've compiled, a while ago, a list of metrics used to evaluate classification and regression algorithms, under the form of a cheatsheet. Some metrics for classification: precision, recall, sensitivity, specificity, F-measure, Matthews correlation, etc. They are all based on the confusion matrix. Others exist for regression (continuous output variable).
The technique is mostly to run an algorithm on some data to get a model, and then apply that model on new, previously unseen data, and evaluate the metric on that data set, and repeat.
Some techniques (actually resampling techniques from statistics):
Jacknife
Crossvalidation
K-fold validation
bootstrap.
Talking about ML in general is a quite vast field, but I'll try to answer any way. The Wikipedia definition of ML is the following
Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.
In this context learning can be defined parameterization of an algorithm. The parameters of the algorithm are derived using input data with a known output. When the algorithm has "learned" the association between input and output, it can be tested with further input data for which the output is well known.
Let's suppose your problem is to obtain words from speech. Here the input is some kind of audio file containing one word (not necessarily, but I supposed this case to keep it quite simple). You'd record X words N times and then use (for example) N/2 of the repetitions to parameterize your algorithm, disregarding - at the moment - how your algorithm would look like.
Now on the one hand - depending on the algorithm - if you feed your algorithm with one of the remaining repetitions, it may give you some certainty estimate which may be used to characterize the recognition of just one of the repetitions. On the other hand you may use all of the remaining repetitions to test the learned algorithm. For each of the repetitions you pass it to the algorithm and compare the expected output with the actual output. After all you'll have an accuracy value for the learned algorithm calculated as the quotient of correct and total classifications.
Anyway, the actual accuracy will depend on the quality of your learning and test data.
A good start to read on would be Pattern Recognition and Machine Learning by Christopher M Bishop
There are various metrics for evaluating the performance of ML model and there is no rule that there are 20 or 30 metrics only. You can create your own metrics depending on your problem. There are various cases wherein when you are solving real - world problem where you would need to create your own custom metrics.
Coming to the existing ones, it is already listed in the first answer, I would just highlight each metrics merits and demerits to better have an understanding.
Accuracy is the simplest of the metric and it is commonly used. It is the number of points to class 1/ total number of points in your dataset. This is for 2 class problem where some points belong to class 1 and some to belong to class 2. It is not preferred when the dataset is imbalanced because it is biased to balanced one and it is not that much interpretable.
Log loss is a metric that helps to achieve probability scores that gives you better understanding why a specific point is belonging to class 1. The best part of this metric is that it is inbuild in logistic regression which is famous ML technique.
Confusion metric is best used for 2-class classification problem which gives four numbers and the diagonal numbers helps to get an idea of how good is your model.Through this metric there are others such as precision, recall and f1-score which are interpretable.

SGD model "overconfidence"

I'm working on binary classification problem using Apache Mahout. The algorithm I use is OnlineLogisticRegression and the model which I currently have strongly tends to produce predictions which are either 1 or 0 without any middle values.
Please suggest a way to tune or tweak the algorithm to make it produce more intermediate values in predictions.
Thanks in advance!
What is the test error rate of the classifier? If it's near zero then being confident is a feature, not a bug.
If the test error rate is high (or at least not low), then the classifier might be overfitting the training set: measure the difference between of the training error and the test error. In that case, increasing regularization as rrenaud suggested might help.
If your classifier is not overfitting, then there might be an issue with the probability calibration. Logistic Regression models (e.g. using the logit link function) should yield good enough probability calibrations (if the problem is approximately linearly separable and the label not too noisy). You can check the calibration of the probabilities with a plot as explained in this paper. If this is really a calibration issue, then implementing a custom calibration based on Platt scaling or isotonic regression might help fix the issue.
From reading the Mahout AbstractOnlineLogisticRegression docs, it looks like you can control the regularization parameter lambda. Increasing lambda should mean your weights are closer to 0, and hence your predictions are more hedged.

Resources