I have a classification problem and my current feature vector does not seem to hold enough information.
My training set has 10k entries and I am using a SVM as classifier (scikit-learn).
What is the maximum reasonable feature vector size (how many dimension)?
(Training and evaluation using Labtop CPU)
100? 1k? 10k? 100k? 1M?

The thing is not how many features should it be for a certain number of cases (i.e. entries) but rather the opposite:
It’s not who has the best algorithm that wins. It’s who has the most data. (Banko and Brill, 2001)
Banko and Brill in 2001 made a comparison among 4 different algorithms, they kept increasing the Training Set Size to millions and came up with the above-quoted conclusion.
Moreover, Prof. Andrew Ng clearly covered this topic, and I’m quoting here:
If a learning algorithm is suffering from high variance, getting more training data is likely to help.
If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much
So as a rule of thumb, your data cases must be greater than the number of features in your dataset taking into account that all features should be informative as much as possible (i.e. the features are not highly collinear (i.e. redundant)).
I read once in more than one place and somewhere in Scikit-Learn Documentation, that the number of inputs (i.e. samples) must be at least the square size of the number of features (i.e. n_samples > n_features ** 2 ).
Nevertheless, for SVM in particular, the number of features n v.s number of entries m is an important factor to specify the type of kernel to use initially, as a second rule of thumb for SVM in particular (also according to Prof. Andrew Ng):
If thr number of features is much greater than number of entries (i.e. n is up to 10K and m is up to 1K) --> use SVM without a kernel (i.e. "linear kernel") or use Logistic Regression.
If the number of features is small and if the number of entries is intermediate (i.e. n is up to 1K and m is up to 10K) --> use SVM with Gaussian kernel.
If the number of feature is small and if the number of entries is much larger (i.e. n is up to 1K and m > 50K) --> Create/add more features, then use SVM without a kernel or use Logistic Regression.


random forest tuning - tree depth and number of trees

I have basic question about tuning a random forest classifier. Is there any relation between the number of trees and the tree depth? Is it necessary that the tree depth should be smaller than the number of trees?
For most practical concerns, I agree with Tim.
Yet, other parameters do affect when the ensemble error converges as a function of added trees. I guess limiting the tree depth typically would make the ensemble converge a little earlier. I would rarely fiddle with tree depth, as though computing time is lowered, it does not give any other bonus. Lowering bootstrap sample size both gives lower run time and lower tree correlation, thus often a better model performance at comparable run-time.
A not so mentioned trick: When RF model explained variance is lower than 40%(seemingly noisy data), one can lower samplesize to ~10-50% and increase trees to e.g. 5000(usually unnecessary many). The ensemble error will converge later as a function of trees. But, due to lower tree correlation, the model becomes more robust and will reach a lower OOB error level converge plateau.
You see below samplesize gives the best long run convergence, whereas maxnodes starts from a lower point but converges less. For this noisy data, limiting maxnodes still better than default RF. For low noise data, the decrease in variance by lowering maxnodes or sample size does not make the increase in bias due to lack-of-fit.
For many practical situations, you would simply give up, if you only could explain 10% of variance. Thus is default RF typically fine. If your a quant, who can bet on hundreds or thousands of positions, 5-10% explained variance is awesome.
the green curve is maxnodes which kinda tree depth but not exactly.
X = data.frame(replicate(6,(runif(1000)-.5)*3))
ySignal = with(X, X1^2 + sin(X2) + X3 + X4)
yNoise = rnorm(1000,sd=sd(ySignal)*2)
y = ySignal + yNoise
#std RF
rf1 = randomForest(X,y,ntree=5000)
plot(rf1,log="x",main="black default, red samplesize, green tree depth")
#reduced sample size
rf2 = randomForest(X,y,sampsize=.1*length(y),ntree=5000)
#limiting tree depth (not exact )
rf3 = randomForest(X,y,maxnodes=24,ntree=5000)
It is true that generally more trees will result in better accuracy. However, more trees also mean more computational cost and after a certain number of trees, the improvement is negligible. An article from Oshiro et al. (2012) pointed out that, based on their test with 29 data sets, after 128 of trees there is no significant improvement(which is inline with the graph from Soren).
Regarding the tree depth, standard random forest algorithm grow the full decision tree without pruning. A single decision tree do need pruning in order to overcome over-fitting issue. However, in random forest, this issue is eliminated by random selecting the variables and the OOB action.
Oshiro, T.M., Perez, P.S. and Baranauskas, J.A., 2012, July. How many trees in a random forest?. In MLDM (pp. 154-168).
I agree with Tim that there is no thumb ratio between the number of trees and tree depth. Generally you want as many trees as will improve your model. More trees also mean more computational cost and after a certain number of trees, the improvement is negligible. As you can see in figure below, after sometime there is no significant improvement in error rate even if we are increasing no of tree.
The depth of the tree meaning length of tree you desire. Larger tree helps you to convey more info whereas smaller tree gives less precise info.So depth should large enough to split each node to your desired number of observations.
Below is example of short tree(leaf node=3) and long tree(leaf node=6) for Iris dataset: Short tree(leaf node=3) gives less precise info compared to long tree(leaf node=6).
Short tree(leaf node=3):
Long tree(leaf node=6):
It all depends on your data set.
I have an example where I was building the Random Forest classifier on Adult Income dataset and reducing the depth of trees (from 42 to 6) improved the performance of the model. The side effect of reducing the depth of trees was How can I reduce the long feature vector which is a list of double values? model size (in RAM and disk space after save)
Regarding the number of trees, I was doing the experiment on 72 classification tasks from OpenML-CC18 benchmark and I found that:
the more rows in the data, the more trees are needed,
the best performance is obtained by tuning the number of trees with 1 tree precision. Train large Random Forest (for example with 1000 trees) and then use validation data to find optimal number of trees.

finding maximum depth of random forest given the number of features

How do we find maximum depth of Random Forest if we know the number of features ?
This is needed for regularizing random forest classifier.
I have not thought about this before. In general the trees are non-deterministic. Instead of asking what is the maximum depth? You may want to know what would be the average depth, or what is the chance of a tree has depth 20... Anyways it is possible to calculate some bounds of the maximum depth. So either a node runs out of (a)inbag samples or (b)possible splits.
(a) If inbag samples(N) is the limiting part, one could imagine a classification tree, where all samples except one are forwarded left for each split. Then the maximum depth is N-1. This outcome is highly unlikely, but possible. The minimal depth tree, where all child nodes are equally big, then the minimal depth would be ~log2(N), e.g. 16,8,4,2,1. In practice the tree depth will be somewhere in between maximal in minimal. Settings controlling minimal node size, would reduce the depth.
(b) To check if features are limiting tree depth and you on before hand know the training set, then count how many training samples are unique. Unique samples (U) cannot be split. Do to boostrapping only ~0.63 of samples will be selected for every tree. N ~ U * 0.63. Use the rules from section (a). All unique samples could be selected during bootstrapping, but that is unlikely too.
If you do not know your training set, try to estimate how many levels (L[i]) possible could be found in each feature (i) out of d features. For categorical features the answer may given. For numeric features drawn from a real distribution, there would be as many levels as there are samples. Possible unique samples would be U = L[1] * L[2] * L[3] ... * L[d].

caret: using random forest and include cross-validation

I used the caret package to train a random forest, including repeated cross-validation. I’d like to know whether the OOB, as in the original RF by Breiman, is used or whether this is replaced by the cross-validation. If it is replaced, do I have the same advantages as described in Breiman 2001, like increased accuracy by reducing the correlation between input data? As OOB is drawn with replacement and CV is drawn without replacement, are both procedures comparable? What is the OOB estimate of error rate (based on CV)?
How are the trees grown? Is CART used?
As this is my first thread, please let me know if you need more details. Many thanks in advance.
There are a lot of basic questions here and you would be better served by reading a book on machine learning or predictive modeling. Thats probably why you haven't gotten much of a response.
For caret you should also consult the package website where some of these questions are answered.
Here are some notes:
CV and OOB estimation for RF are somewhat different. This post might help explain how. For this application, the OOB rate from random forest is computed while the model is being build whereas CV uses holdout samples that are predicted after the random forest model is computed.
The original random forest model (used here) uses unpruned CART trees. Again, this is in many text books and papers.
I recently got a little confused with this too, but reading chapter 4 in Applied Predictive Modeling by Max Kuhn helped me to understand the difference.
If you use randomForest in R, you grow a number of decision trees by sampling N cases with replacement (N is the number of cases in the training set). You then sample m variables at each node where m is less than the number of predictors. Each tree is then grown fully and terminal nodes are assigned to a class based on the mode of cases in that node. New cases are classified by sending them down all the trees and then taking a vote; the majority vote wins.
The key points to note here are:
how the trees are grown - sampling WITH replacement (a bootstrap). This means that some cases will be represented many times in your bootstrap sample and others may not be represented at all. The bootstrap sample will be the same size as your training dataset.
The cases that are not selected for building trees are referred to as the OOB samples- an OOB error estimate is calculated by classifying the cases that aren't selected when building a tree. About 63% of the data points in the bootstrap sample are represented at least once.
If you use caret in R, you will normally use caret::train(....) and specify the method as "rf" and trControl="repeatedcv". You can change trControl to "oob" if you want out of the bag. The way this works is as follows (I'm going to use a simple example of a 10 fold cv repeated 5 times): the training dataset is split into 10 folds of roughly equal size, a number of trees will be built using only 9 samples - so omitting the 1st fold (which is held out). The held out sample is predicted by running the cases through the trees and used to estimate performance measures. The first subset is returned to the training set and the procedure repeats with the 2nd subset held out, and so on. The process is repeated 10 times. This whole procedure can be repeated multiple times (in my example, I do this 5 times); for each of the 5 runs, the training dataset with be split into 10 slightly different folds. It should be noted that 50 different held out samples are used to calculate model efficacy.
The key points to note are:
this involves sampling WITHOUT replacement - you split the training data and build a model on 9 samples and predict the held out sample (the remaining 1 sample of the 10) and repeat this process as above
the model is built using a dataset that is smaller than the training dataset; this is different to the bootstrap method discussed above
You are using 2 different resampling techniques which will yield different results therefore they are not comparable. The k fold repeated cv tends to have low bias (for k large); where k is 2 or 3, bias is high and comparable to the bootstrap method. K fold cv tends to have high variance though...

How Feature length depend on prediction in SVM classifier

Currently I am doing English alphabet classification using SVM classifier in opencv.
I have following doubts in doing above thing
How length of feature vector depends on the classification ?
(What will happen if feature length increases (my current feature length is 125))
Is time taken for prediction depend on number of data used for training ?
Why we need normalization of feature vector (will this improve accuracy of prediction and time required for the prediction of the class) ?
How to determine best method for normalizing feature vector ?
1) Length of features does not matter per se, what matters is predictive quality of features
2) No, it does not depend on number of samples, but it depends on number of features (prediction is generally very fast)
3) Normalization is required if features are in very different ranges of values
4) There are basically standarization (mean, stdev) and scaling (xmax -> +1, xmean -> -1 or 0) - you could do both and see which one is better
when talking about classification the data consists of feature vectors with a number of features. in image processing there is also features which are mapped to classification feature vectors. so your "feature length" is actually the number of features or feature vector size.
1) the number of features matter. in principle more features allow better classification but also lead to overtraining. to avoid the latter you can add more samples (more feature vectors).
2) yes, as the prediction time depends on the number of support vectors and the size of the support vectors. but as prediction is very fast this is not an issue unless you have some real time requirements.
3) while SVM as a maximum margin classifier is quite robust against different feature value ranges a feature with a bigger value range would have more weight than one with a smaller range. this especially applies to penalty calculation if classes are not completely separable.
4) as SVM is quite robust against different value ranges (compared to cluster oriented algorithms) this is not the biggest issue. typically absolute min/max are scaled to -1/+1. if you know the expected range of your data you could scale that range and measurement errors in your data would not influence the scaling. a fixed range is also preferable when adding trraining data in an iterative process.

What is the relation between the number of Support Vectors and training data and classifiers performance? [closed]

I am using LibSVM to classify some documents. The documents seem to be a bit difficult to classify as the final results show. However, I have noticed something while training my models. and that is: If my training set is for example 1000 around 800 of them are selected as support vectors.
I have looked everywhere to find if this is a good thing or bad. I mean is there a relation between the number of support vectors and the classifiers performance?
I have read this previous post but I am performing a parameter selection and also I am sure that the attributes in the feature vectors are all ordered.
I just need to know the relation.
p.s: I use a linear kernel.
Support Vector Machines are an optimization problem. They are attempting to find a hyperplane that divides the two classes with the largest margin. The support vectors are the points which fall within this margin. It's easiest to understand if you build it up from simple to more complex.
Hard Margin Linear SVM
In a training set where the data is linearly separable, and you are using a hard margin (no slack allowed), the support vectors are the points which lie along the supporting hyperplanes (the hyperplanes parallel to the dividing hyperplane at the edges of the margin)
All of the support vectors lie exactly on the margin. Regardless of the number of dimensions or size of data set, the number of support vectors could be as little as 2.
Soft-Margin Linear SVM
But what if our dataset isn't linearly separable? We introduce soft margin SVM. We no longer require that our datapoints lie outside the margin, we allow some amount of them to stray over the line into the margin. We use the slack parameter C to control this. (nu in nu-SVM) This gives us a wider margin and greater error on the training dataset, but improves generalization and/or allows us to find a linear separation of data that is not linearly separable.
Now, the number of support vectors depends on how much slack we allow and the distribution of the data. If we allow a large amount of slack, we will have a large number of support vectors. If we allow very little slack, we will have very few support vectors. The accuracy depends on finding the right level of slack for the data being analyzed. Some data it will not be possible to get a high level of accuracy, we must simply find the best fit we can.
Non-Linear SVM
This brings us to non-linear SVM. We are still trying to linearly divide the data, but we are now trying to do it in a higher dimensional space. This is done via a kernel function, which of course has its own set of parameters. When we translate this back to the original feature space, the result is non-linear:
Now, the number of support vectors still depends on how much slack we allow, but it also depends on the complexity of our model. Each twist and turn in the final model in our input space requires one or more support vectors to define. Ultimately, the output of an SVM is the support vectors and an alpha, which in essence is defining how much influence that specific support vector has on the final decision.
Here, accuracy depends on the trade-off between a high-complexity model which may over-fit the data and a large-margin which will incorrectly classify some of the training data in the interest of better generalization. The number of support vectors can range from very few to every single data point if you completely over-fit your data. This tradeoff is controlled via C and through the choice of kernel and kernel parameters.
I assume when you said performance you were referring to accuracy, but I thought I would also speak to performance in terms of computational complexity. In order to test a data point using an SVM model, you need to compute the dot product of each support vector with the test point. Therefore the computational complexity of the model is linear in the number of support vectors. Fewer support vectors means faster classification of test points.
A good resource:
A Tutorial on Support Vector Machines for Pattern Recognition
800 out of 1000 basically tells you that the SVM needs to use almost every single training sample to encode the training set. That basically tells you that there isn't much regularity in your data.
Sounds like you have major issues with not enough training data. Also, maybe think about some specific features that separate this data better.
Both number of samples and number of attributes may influence the number of support vectors, making model more complex. I believe you use words or even ngrams as attributes, so there are quite many of them, and natural language models are very complex themselves. So, 800 support vectors of 1000 samples seem to be ok. (Also pay attention to #karenu's comments about C/nu parameters that also have large effect on SVs number).
To get intuition about this recall SVM main idea. SVM works in a multidimensional feature space and tries to find hyperplane that separates all given samples. If you have a lot of samples and only 2 features (2 dimensions), the data and hyperplane may look like this:
Here there are only 3 support vectors, all the others are behind them and thus don't play any role. Note, that these support vectors are defined by only 2 coordinates.
Now imagine that you have 3 dimensional space and thus support vectors are defined by 3 coordinates.
This means that there's one more parameter (coordinate) to be adjusted, and this adjustment may need more samples to find optimal hyperplane. In other words, in worst case SVM finds only 1 hyperplane coordinate per sample.
When the data is well-structured (i.e. holds patterns quite well) only several support vectors may be needed - all the others will stay behind those. But text is very, very bad structured data. SVM does its best, trying to fit sample as well as possible, and thus takes as support vectors even more samples than drops. With increasing number of samples this "anomaly" is reduced (more insignificant samples appear), but absolute number of support vectors stays very high.
SVM classification is linear in the number of support vectors (SVs). The number of SVs is in the worst case equal to the number of training samples, so 800/1000 is not yet the worst case, but it's still pretty bad.
Then again, 1000 training documents is a small training set. You should check what happens when you scale up to 10000s or more documents. If things don't improve, consider using linear SVMs, trained with LibLinear, for document classification; those scale up much better (model size and classification time are linear in the number of features and independent of the number of training samples).
There is some confusion between sources. In the textbook ISLR 6th Ed, for instance, C is described as a "boundary violation budget" from where it follows that higher C will allow for more boundary violations and more support vectors.
But in svm implementations in R and python the parameter C is implemented as "violation penalty" which is the opposite and then you will observe that for higher values of C there are fewer support vectors.
