caret: using random forest and include cross-validation - random-forest

I used the caret package to train a random forest, including repeated cross-validation. I’d like to know whether the OOB, as in the original RF by Breiman, is used or whether this is replaced by the cross-validation. If it is replaced, do I have the same advantages as described in Breiman 2001, like increased accuracy by reducing the correlation between input data? As OOB is drawn with replacement and CV is drawn without replacement, are both procedures comparable? What is the OOB estimate of error rate (based on CV)?
How are the trees grown? Is CART used?
As this is my first thread, please let me know if you need more details. Many thanks in advance.

There are a lot of basic questions here and you would be better served by reading a book on machine learning or predictive modeling. Thats probably why you haven't gotten much of a response.
For caret you should also consult the package website where some of these questions are answered.
Here are some notes:
CV and OOB estimation for RF are somewhat different. This post might help explain how. For this application, the OOB rate from random forest is computed while the model is being build whereas CV uses holdout samples that are predicted after the random forest model is computed.
The original random forest model (used here) uses unpruned CART trees. Again, this is in many text books and papers.
Max

I recently got a little confused with this too, but reading chapter 4 in Applied Predictive Modeling by Max Kuhn helped me to understand the difference.
If you use randomForest in R, you grow a number of decision trees by sampling N cases with replacement (N is the number of cases in the training set). You then sample m variables at each node where m is less than the number of predictors. Each tree is then grown fully and terminal nodes are assigned to a class based on the mode of cases in that node. New cases are classified by sending them down all the trees and then taking a vote; the majority vote wins.
The key points to note here are:
how the trees are grown - sampling WITH replacement (a bootstrap). This means that some cases will be represented many times in your bootstrap sample and others may not be represented at all. The bootstrap sample will be the same size as your training dataset.
The cases that are not selected for building trees are referred to as the OOB samples- an OOB error estimate is calculated by classifying the cases that aren't selected when building a tree. About 63% of the data points in the bootstrap sample are represented at least once.
If you use caret in R, you will normally use caret::train(....) and specify the method as "rf" and trControl="repeatedcv". You can change trControl to "oob" if you want out of the bag. The way this works is as follows (I'm going to use a simple example of a 10 fold cv repeated 5 times): the training dataset is split into 10 folds of roughly equal size, a number of trees will be built using only 9 samples - so omitting the 1st fold (which is held out). The held out sample is predicted by running the cases through the trees and used to estimate performance measures. The first subset is returned to the training set and the procedure repeats with the 2nd subset held out, and so on. The process is repeated 10 times. This whole procedure can be repeated multiple times (in my example, I do this 5 times); for each of the 5 runs, the training dataset with be split into 10 slightly different folds. It should be noted that 50 different held out samples are used to calculate model efficacy.
The key points to note are:
this involves sampling WITHOUT replacement - you split the training data and build a model on 9 samples and predict the held out sample (the remaining 1 sample of the 10) and repeat this process as above
the model is built using a dataset that is smaller than the training dataset; this is different to the bootstrap method discussed above
You are using 2 different resampling techniques which will yield different results therefore they are not comparable. The k fold repeated cv tends to have low bias (for k large); where k is 2 or 3, bias is high and comparable to the bootstrap method. K fold cv tends to have high variance though...

Related

Maximum number of feature dimensions

I have a classification problem and my current feature vector does not seem to hold enough information.
My training set has 10k entries and I am using a SVM as classifier (scikit-learn).
What is the maximum reasonable feature vector size (how many dimension)?
(Training and evaluation using Labtop CPU)
100? 1k? 10k? 100k? 1M?
The thing is not how many features should it be for a certain number of cases (i.e. entries) but rather the opposite:
It’s not who has the best algorithm that wins. It’s who has the most data. (Banko and Brill, 2001)
Banko and Brill in 2001 made a comparison among 4 different algorithms, they kept increasing the Training Set Size to millions and came up with the above-quoted conclusion.
Moreover, Prof. Andrew Ng clearly covered this topic, and I’m quoting here:
If a learning algorithm is suffering from high variance, getting more training data is likely to help.
If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much
So as a rule of thumb, your data cases must be greater than the number of features in your dataset taking into account that all features should be informative as much as possible (i.e. the features are not highly collinear (i.e. redundant)).
I read once in more than one place and somewhere in Scikit-Learn Documentation, that the number of inputs (i.e. samples) must be at least the square size of the number of features (i.e. n_samples > n_features ** 2 ).
Nevertheless, for SVM in particular, the number of features n v.s number of entries m is an important factor to specify the type of kernel to use initially, as a second rule of thumb for SVM in particular (also according to Prof. Andrew Ng):
If thr number of features is much greater than number of entries (i.e. n is up to 10K and m is up to 1K) --> use SVM without a kernel (i.e. "linear kernel") or use Logistic Regression.
If the number of features is small and if the number of entries is intermediate (i.e. n is up to 1K and m is up to 10K) --> use SVM with Gaussian kernel.
If the number of feature is small and if the number of entries is much larger (i.e. n is up to 1K and m > 50K) --> Create/add more features, then use SVM without a kernel or use Logistic Regression.

Information leakage in Cross-validation

Description of classification problem:
Assume a regular dataset X with n samples and d features.
This classification problem is somewhat hard (many features, few samples, low overall AUC ~70%).
It might be useful to mention that feature selection/extraction, dimension reduction, kernels, many classifiers have been applied. So I am not interested in trying these.
I am not looking forward to see an improvement in overall AUC. The goal is to find relevant features in haystack of features.
Description of my approach:
I select all pairwise combination of d features and create many two dimensional sub-datasets x with n samples.
On each sub-dataset x, I perform a 10-fold cross-validation (using all samples of the main dataset X). A very long process, assume weeks of computation.
I select top k pairs (according to highest AUC for example) and label them as +. All other pairs are labeled as -.
For each pair, I can compute several properties (e.g. relations between each pair using Expert's knowledge). These properties can be calculated without using the labels in main dataset X.
Now I have pairs which are labeled as + or -. In addition, each pair has many properties calculated based on Expert's knowledge (i.e. features). Hence, I have a new classification problem. Lets call this newly generated dataset Y.
I train a classifier on Y while following cross-validation rules. Surprisingly, I can predict the + and - labels with 90% AUC.
As far as I can see, it means that I am able to select relevant features. However, seeing a 90% AUC makes me worried about information leakage somewhere in this long process. Specially in step 3.
I was wondering if anyone can see any leakage in this approach.
Information Leakage:
Incorporation of target labels in the actual features. Your classifier will produce good prediction while did not learn anything.
Showing your test set to you classifier during the training phase. Your classifier will "memorize" the test set and its corresponding labels without "learning" anything.
Update 1:
I want to stress that indeed I am using all data points of X in step 1. However, I am not using them ever again (even for testing). The final 90% AUC is obtained from predicting labels of dataset Y.
On the other hand, it would be useful to note that, even if I randomize the values of my main dataset X, the computed features for dataset Y is going to be the same. However, the sample labels in Y would change because the previous + pairs might not be a good one anymore. Therefore they will be labeled as -.
Update 2:
Although I haven't got any opinion, I am going to state what I have got during 4 days of talking with pattern recognition researchers. Briefly I became confident that there is no information leakage (as long as I wont go back to the first dataset X and using its labels). Later on, in case I wanted to check to see if I could have better performance in X (i.e. predicting sample labels), I need to use only a part of dataset X for pairwise comparison (as training set). Then I can use the rest of samples in X as test set while using positively predicted pairs of Y as features.
I will set this as an answer in case no one could reject this method.
If your processes in step 1 uses all data. then the features you are learning have information from the whole data set. Since you selected based on the whole dataset and THEN validation, you are leaking serious information.
You should probably stick with tools that are well known / already done for you before running out and trying weird strategies like this. Try using a model with L1 regularization to do feature selection for your, or start with some of the simpler searches like Sequential Backward Selection.
If you do cross validation correctly in the end, each training will perform its own independent feature selection. If you do one global feature selection and then do CV, you are going to be doing it wrong and probably leaking information.

Optimal Feature-to-Instance Ratio in Back Propagation Neural Network

I'm trying to perform leave-one-out cross validation for modelling a particular problem using Back Propagation Neural Network. I have 8 features in my training data and 20 instances. I'm trying to make the NN learn a function in building a prediction model. Now, the problem is that the error rate is quite high in the prediction. My guess is that the number of instances in the training is less when compared to the number of features under consideration. Is this conclusion correct. Is there any optimal feature to instance ratio ?
(This topic is often phrased in the ML literature as acceptable size or shape of the data set, given that a data set is often described as an m x n matrix in which m is the number of rows (data points) and n is the number of columns (features); obvious m >> n is preferred.)
In an event, I am not aware of a general rule for an acceptable range of features-to-observations; there are probably a couple of reasons for this:
such a ratio would depend strongly on the quality of the data
(signal-to-noise ratio); and
the number of features is just one element of model complexity (e.g., interaction among the features); and model complexity is the strongest determinant of the number of data instances (data points).
So there are two sets of approaches to this problem--which, because they are opposing, both can be applied to the same model:
reduce the number of features; or
use a statistical technique to leverage the data that you do have
A couple of suggestions, one for each of the two paths above:
Eliminate "non-important" features--i.e, those features that don't contribute to the variability in your response variable. Principal Component Analysis (PCA) is fast and reliable way to do this, though there are a number of other techniques which are generally subsumed under the rubric "dimension reduction."
Use Bootstrap methods instead of cross-validation. The difference in methodology seems slight but the (often substantial) improvement in reducing prediction error is well documented for multi-layer perceptrons (neural networks) (see e.g., Efron, B. and Tibshirani, R.J., The bootstrap method: Improvements on cross-validation, J. of the American Statistical Association, 92, 548-560., 1997). If you are not familiar with Bootstrap methods for splitting training and testing data, the general technique is similar to cross-validation except that instead of taking subsets of the entire data set you take subsamples. Section 7.11 of Elements is a good introduction to Bootstrap methods.
The best single source on this general topic that i have found is Chapter 7 Model Assessment and Selection from the excellent treatise Elements of Statistical Learning by Hastie, Tibshirani, and Friedman. This book is available free to download from the book's homepage.

measuring the accuracy of a model and the importance of a feature in SVM

I'm starting to use LIBSVM for regression analysis. My world has about 20 features and thousands to millions of training samples.
I'm curious about two things:
Is there a metric that indicates the accuracy or confidence of the model, perhaps in the .model file or elsewhere?
How can I determine whether or not a feature is significant? E.g., if I'm trying to predict body weight as a function of height, shoulder width, gender and hair color, I might discover that hair color is not a significant feature in predicting weight. Is that reflected in the .model file, or is there some way to find out?
libSVM calculates p-values for test points based upon the certainty of the classifier (i.e., how far is the test point from the decision boundary and how wide are the margins).
I think you should consider the determination of feature importance a separate problem from training your SVMs. There are tons of approaches for "feature selection" (just open any text book) but one easy to understand, straightforward approach would be a simple cross-validation as follows:
Divide your dataset into k folds (e.g., k = 10 is common)
For each of the k folds:
Separate your data into train/test sets (the current fold is the test set, the rest are the training set)
Train your SVM classifier using only n-1 of your n features
Measure the prediction performance
Average the performance of your n-1 feature classifier for all k test folds
Repeat 1-3 for all remaining features
You could also do the reverse where you test each of the n features separately but you will likely miss out on important second and higher order interactions between the features.
In general, however, SVMs are good at ignoring irrelevant features.
You may also want to try and visualize your data using Principal Components Analysis to get a feel for how the data is distributed.
The F-score is a metric commonly used for features selection in Machine Learning.
Since version 3.0, LIBSVM library includes a directory called tools. In that directory is a python script called fselect.py, which calculates F-score. To use it, just execute from the command line and pass in the file comprised of training data (and optionally a testing data file).
python fselect.py data_training data_testing
The output is comprised of an fscore for each of the features in your data set which corresponds to the importance of that feature to the model result (regression score).

How to select training data for naive bayes classifier

I want to double check some concepts I am uncertain of regarding the training set for classifier learning. When we select records for our training data, do we select an equal number of records per class, summing to N or should it be randomly picking N number of records (regardless of class)?
Intuitively I was thinking of the former but thought of the prior class probabilities would then be equal and not be really helpful?
It depends on the distribution of your classes and the determination can only be made with domain knowledge of problem at hand.
You can ask the following questions:
Are there any two classes that are very similar and does the learner have enough information to distinguish between them?
Is there a large difference in the prior probabilities of each class?
If so, you should probably redistribute the classes.
In my experience, there is no harm in redistributing the classes, but it's not always necessary.
It really depends on the distribution of your classes. In the case of fraud or intrusion detection, the distribution of the prediction class can be less than 1%.
In this case you must distribute the classes evenly in the training set if you want the classifier to learn differences between each class. Otherwise, it will produce a classifier that correctly classifies over 99% of the cases without ever correctly identifying a fraud case, which is the whole point of creating a classifier to begin with.
Once you have a set of evenly distributed classes you can use any technique, such as k-fold, to perform the actual training.
Another example where class distributions need to be adjusted, but not necessarily in an equal number of records for each, is the case of determining upper-case letters of the alphabet from their shapes.
If you take a distribution of letters commonly used in the English language to train the classifier, there will be almost no cases, if any, of the letter Q. On the other hand, the letter O is very common. If you don't redistribute the classes to allow for the same number of Q's and O's, the classifier doesn't have enough information to ever distinguish a Q. You need to feed it enough information (i.e. more Qs) so it can determine that Q and O are indeed different letters.
The preferred approach is to use K-Fold Cross validation for picking up learning and testing data.
Quote from wikipedia:
K-fold cross-validation
In K-fold cross-validation, the
original sample is randomly
partitioned into K subsamples. Of the
K subsamples, a single subsample is
retained as the validation data for
testing the model, and the remaining K
− 1 subsamples are used as training
data. The cross-validation process is
then repeated K times (the folds),
with each of the K subsamples used
exactly once as the validation data.
The K results from the folds then can
be averaged (or otherwise combined) to
produce a single estimation. The
advantage of this method over repeated
random sub-sampling is that all
observations are used for both
training and validation, and each
observation is used for validation
exactly once. 10-fold cross-validation
is commonly used.
In stratified K-fold cross-validation,
the folds are selected so that the
mean response value is approximately
equal in all the folds. In the case of
a dichotomous classification, this
means that each fold contains roughly
the same proportions of the two types
of class labels.
You should always take the common approach in order to have comparable results with other scientific data.
I built an implementation of a Bayesian classifier to determine if a sample is NSFW (Not safe for work) by examining the occurrence of words in examples. When training a classifier for NSFW detection I've tried making it so that each class in the training sets has the same number of examples. This didn't work out as well as I had planned being that one of the classes had many more words per example than the other class.
Since I was computing the likelihood of NSFW based on these words I found that balancing out the classes based on their actual size (in MB) worked. I tried 10-cross fold validation for both approaches (balancing by number of examples and size of classes) and found that balancing by the size of the data worked well.

Resources