finding maximum depth of random forest given the number of features - machine-learning

How do we find maximum depth of Random Forest if we know the number of features ?
This is needed for regularizing random forest classifier.

I have not thought about this before. In general the trees are non-deterministic. Instead of asking what is the maximum depth? You may want to know what would be the average depth, or what is the chance of a tree has depth 20... Anyways it is possible to calculate some bounds of the maximum depth. So either a node runs out of (a)inbag samples or (b)possible splits.
(a) If inbag samples(N) is the limiting part, one could imagine a classification tree, where all samples except one are forwarded left for each split. Then the maximum depth is N-1. This outcome is highly unlikely, but possible. The minimal depth tree, where all child nodes are equally big, then the minimal depth would be ~log2(N), e.g. 16,8,4,2,1. In practice the tree depth will be somewhere in between maximal in minimal. Settings controlling minimal node size, would reduce the depth.
(b) To check if features are limiting tree depth and you on before hand know the training set, then count how many training samples are unique. Unique samples (U) cannot be split. Do to boostrapping only ~0.63 of samples will be selected for every tree. N ~ U * 0.63. Use the rules from section (a). All unique samples could be selected during bootstrapping, but that is unlikely too.
If you do not know your training set, try to estimate how many levels (L[i]) possible could be found in each feature (i) out of d features. For categorical features the answer may given. For numeric features drawn from a real distribution, there would be as many levels as there are samples. Possible unique samples would be U = L[1] * L[2] * L[3] ... * L[d].

Related

Maximum number of feature dimensions

I have a classification problem and my current feature vector does not seem to hold enough information.
My training set has 10k entries and I am using a SVM as classifier (scikit-learn).
What is the maximum reasonable feature vector size (how many dimension)?
(Training and evaluation using Labtop CPU)
100? 1k? 10k? 100k? 1M?
The thing is not how many features should it be for a certain number of cases (i.e. entries) but rather the opposite:
It’s not who has the best algorithm that wins. It’s who has the most data. (Banko and Brill, 2001)
Banko and Brill in 2001 made a comparison among 4 different algorithms, they kept increasing the Training Set Size to millions and came up with the above-quoted conclusion.
Moreover, Prof. Andrew Ng clearly covered this topic, and I’m quoting here:
If a learning algorithm is suffering from high variance, getting more training data is likely to help.
If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much
So as a rule of thumb, your data cases must be greater than the number of features in your dataset taking into account that all features should be informative as much as possible (i.e. the features are not highly collinear (i.e. redundant)).
I read once in more than one place and somewhere in Scikit-Learn Documentation, that the number of inputs (i.e. samples) must be at least the square size of the number of features (i.e. n_samples > n_features ** 2 ).
Nevertheless, for SVM in particular, the number of features n v.s number of entries m is an important factor to specify the type of kernel to use initially, as a second rule of thumb for SVM in particular (also according to Prof. Andrew Ng):
If thr number of features is much greater than number of entries (i.e. n is up to 10K and m is up to 1K) --> use SVM without a kernel (i.e. "linear kernel") or use Logistic Regression.
If the number of features is small and if the number of entries is intermediate (i.e. n is up to 1K and m is up to 10K) --> use SVM with Gaussian kernel.
If the number of feature is small and if the number of entries is much larger (i.e. n is up to 1K and m > 50K) --> Create/add more features, then use SVM without a kernel or use Logistic Regression.

Which machine learning algorithm to use for high dimensional matching?

Let say, I can define a person by 1000 different way, so i have 1,000 features for a given person.
PROBLEM: How can I run machine learning algorithm to determine the best possible match, or closest/most similar person, given the 1,000 features?
I have attempted Kmeans but this appears to be more for 2 features, rather than high dimensions.
You basically after some kind of K Nearest Neighbors Algorithm.
Since your data has high dimension you should explore the following:
Dimensionality Reduction - You may have 1000 features but probably some of them are better than others. So it would be a wise move to apply some kind of Dimensionality Reduction. Easiest and teh first point o start with would be Principal Component Analysis (PCA) which preserves ~90% of the data (Namely use enough Eigen Vectors which match 90% o the energy with their matching Eigen Values). I would assume you'll see a significant reduction from this.
Accelerated K Nearest Neighbors - There are many methods out there to accelerate the search of K-NN in high dimensional case. The K D Tree Algorithm would be a good start for that.
Distance metrics
You can try to apply a distance metric (e.g. cosine similarity) directly.
Supervised
If you know how similar the people are, you can try the following:
Neural networks, Approach #1
Input: 2x the person feature vector (hence 2000 features)
Output: 1 float (similarity of the two people)
Scalability: Linear with the number of people
See neuralnetworksanddeeplearning.com for a nice introduction and Keras for a simple framework
Neural networks, Approach #2
A more advanced approach is called metric learning.
Input: the person feature vector (hence 2000 features)
Output: k floats (you choose k, but it should be lower than 1000)
For training, you have to give the network first on person, store the result, then the second person, store the result, apply a distance metric of your choice (e.g. Euclidean distance) of the two results and then backpropagate the error.

random forest tuning - tree depth and number of trees

I have basic question about tuning a random forest classifier. Is there any relation between the number of trees and the tree depth? Is it necessary that the tree depth should be smaller than the number of trees?
For most practical concerns, I agree with Tim.
Yet, other parameters do affect when the ensemble error converges as a function of added trees. I guess limiting the tree depth typically would make the ensemble converge a little earlier. I would rarely fiddle with tree depth, as though computing time is lowered, it does not give any other bonus. Lowering bootstrap sample size both gives lower run time and lower tree correlation, thus often a better model performance at comparable run-time.
A not so mentioned trick: When RF model explained variance is lower than 40%(seemingly noisy data), one can lower samplesize to ~10-50% and increase trees to e.g. 5000(usually unnecessary many). The ensemble error will converge later as a function of trees. But, due to lower tree correlation, the model becomes more robust and will reach a lower OOB error level converge plateau.
You see below samplesize gives the best long run convergence, whereas maxnodes starts from a lower point but converges less. For this noisy data, limiting maxnodes still better than default RF. For low noise data, the decrease in variance by lowering maxnodes or sample size does not make the increase in bias due to lack-of-fit.
For many practical situations, you would simply give up, if you only could explain 10% of variance. Thus is default RF typically fine. If your a quant, who can bet on hundreds or thousands of positions, 5-10% explained variance is awesome.
the green curve is maxnodes which kinda tree depth but not exactly.
library(randomForest)
X = data.frame(replicate(6,(runif(1000)-.5)*3))
ySignal = with(X, X1^2 + sin(X2) + X3 + X4)
yNoise = rnorm(1000,sd=sd(ySignal)*2)
y = ySignal + yNoise
plot(y,ySignal,main=paste("cor="),cor(ySignal,y))
#std RF
rf1 = randomForest(X,y,ntree=5000)
print(rf1)
plot(rf1,log="x",main="black default, red samplesize, green tree depth")
#reduced sample size
rf2 = randomForest(X,y,sampsize=.1*length(y),ntree=5000)
print(rf2)
points(1:5000,rf2$mse,col="red",type="l")
#limiting tree depth (not exact )
rf3 = randomForest(X,y,maxnodes=24,ntree=5000)
print(rf2)
points(1:5000,rf3$mse,col="darkgreen",type="l")
It is true that generally more trees will result in better accuracy. However, more trees also mean more computational cost and after a certain number of trees, the improvement is negligible. An article from Oshiro et al. (2012) pointed out that, based on their test with 29 data sets, after 128 of trees there is no significant improvement(which is inline with the graph from Soren).
Regarding the tree depth, standard random forest algorithm grow the full decision tree without pruning. A single decision tree do need pruning in order to overcome over-fitting issue. However, in random forest, this issue is eliminated by random selecting the variables and the OOB action.
Reference:
Oshiro, T.M., Perez, P.S. and Baranauskas, J.A., 2012, July. How many trees in a random forest?. In MLDM (pp. 154-168).
I agree with Tim that there is no thumb ratio between the number of trees and tree depth. Generally you want as many trees as will improve your model. More trees also mean more computational cost and after a certain number of trees, the improvement is negligible. As you can see in figure below, after sometime there is no significant improvement in error rate even if we are increasing no of tree.
The depth of the tree meaning length of tree you desire. Larger tree helps you to convey more info whereas smaller tree gives less precise info.So depth should large enough to split each node to your desired number of observations.
Below is example of short tree(leaf node=3) and long tree(leaf node=6) for Iris dataset: Short tree(leaf node=3) gives less precise info compared to long tree(leaf node=6).
Short tree(leaf node=3):
Long tree(leaf node=6):
It all depends on your data set.
I have an example where I was building the Random Forest classifier on Adult Income dataset and reducing the depth of trees (from 42 to 6) improved the performance of the model. The side effect of reducing the depth of trees was How can I reduce the long feature vector which is a list of double values? model size (in RAM and disk space after save)
Regarding the number of trees, I was doing the experiment on 72 classification tasks from OpenML-CC18 benchmark and I found that:
the more rows in the data, the more trees are needed,
the best performance is obtained by tuning the number of trees with 1 tree precision. Train large Random Forest (for example with 1000 trees) and then use validation data to find optimal number of trees.

word2vec: negative sampling (in layman term)?

I'm reading the paper below and I have some trouble , understanding the concept of negative sampling.
http://arxiv.org/pdf/1402.3722v1.pdf
Can anyone help , please?
The idea of word2vec is to maximise the similarity (dot product) between the vectors for words which appear close together (in the context of each other) in text, and minimise the similarity of words that do not. In equation (3) of the paper you link to, ignore the exponentiation for a moment. You have
v_c . v_w
-------------------
sum_i(v_ci . v_w)
The numerator is basically the similarity between words c (the context) and w (the target) word. The denominator computes the similarity of all other contexts ci and the target word w. Maximising this ratio ensures words that appear closer together in text have more similar vectors than words that do not. However, computing this can be very slow, because there are many contexts ci. Negative sampling is one of the ways of addressing this problem- just select a couple of contexts ci at random. The end result is that if cat appears in the context of food, then the vector of food is more similar to the vector of cat (as measures by their dot product) than the vectors of several other randomly chosen words (e.g. democracy, greed, Freddy), instead of all other words in language. This makes word2vec much much faster to train.
Computing Softmax (Function to determine which words are similar to the current target word) is expensive since requires summing over all words in V (denominator), which is generally very large.
What can be done?
Different strategies have been proposed to approximate the softmax. These approaches can be grouped into softmax-based and sampling-based approaches. Softmax-based approaches are methods that keep the softmax layer intact, but modify its architecture to improve its efficiency (e.g hierarchical softmax). Sampling-based approaches on the other hand completely do away with the softmax layer and instead optimise some other loss function that approximates the softmax (They do this by approximating the normalization in the denominator of the softmax with some other loss that is cheap to compute like negative sampling).
The loss function in Word2vec is something like:
Which logarithm can decompose into:
With some mathematic and gradient formula (See more details at 6) it converted to:
As you see it converted to binary classification task (y=1 positive class, y=0 negative class). As we need labels to perform our binary classification task, we designate all context words c as true labels (y=1, positive sample), and k randomly selected from corpora as false labels (y=0, negative sample).
Look at the following paragraph. Assume our target word is "Word2vec". With window of 3, our context words are: The, widely, popular, algorithm, was, developed. These context words consider as positive labels. We also need some negative labels. We randomly pick some words from corpus (produce, software, Collobert, margin-based, probabilistic) and consider them as negative samples. This technique that we picked some randomly example from corpus is called negative sampling.
Reference :
(1) C. Dyer, "Notes on Noise Contrastive Estimation and Negative Sampling", 2014
(2) http://sebastianruder.com/word-embeddings-softmax/
I wrote an tutorial article about negative sampling here.
Why do we use negative sampling? -> to reduce computational cost
The cost function for vanilla Skip-Gram (SG) and Skip-Gram negative sampling (SGNS) looks like this:
Note that T is the number of all vocabs. It is equivalent to V. In the other words, T = V.
The probability distribution p(w_t+j|w_t) in SG is computed for all V vocabs in the corpus with:
V can easily exceed tens of thousand when training Skip-Gram model. The probability needs to be computed V times, making it computationally expensive. Furthermore, the normalization factor in the denominator requires extra V computations.
On the other hand, the probability distribution in SGNS is computed with:
c_pos is a word vector for positive word, and W_neg is word vectors for all K negative samples in the output weight matrix. With SGNS, the probability needs to be computed only K + 1 times, where K is typically between 5 ~ 20. Furthermore, no extra iterations are necessary to compute the normalization factor in the denominator.
With SGNS, only a fraction of weights are updated for each training sample, whereas SG updates all millions of weights for each training sample.
How does SGNS achieve this? -> by transforming multi-classification task into binary classification task.
With SGNS, word vectors are no longer learned by predicting context words of a center word. It learns to differentiate the actual context words (positive) from randomly drawn words (negative) from the noise distribution.
In real life, you don't usually observe regression with random words like Gangnam-Style, or pimples. The idea is that if the model can distinguish between the likely (positive) pairs vs unlikely (negative) pairs, good word vectors will be learned.
In the above figure, current positive word-context pair is (drilling, engineer). K=5 negative samples are randomly drawn from the noise distribution: minimized, primary, concerns, led, page. As the model iterates through the training samples, weights are optimized so that the probability for positive pair will output p(D=1|w,c_pos)≈1, and probability for negative pairs will output p(D=1|w,c_neg)≈0.

Maximum Support Vector

I studying on SVM and Support Vector recently. for example if I select Hard Linear SVM in a two dimensional classification problem with n Data, then result consist k=2 Support Vector. if I add another labeled data in previous data and retrain SVM. what's the maximum Number of SV?
I think N+1. but I need some proof. anyone help?
There is no maximum bound on the number of support vectors, and the whole data set can be selected as support vectors. The proof is fairly simple (i.e.: left as an exercise for the reader).
Assuming that N is the size of your training set, and giving the fact that all of them can be selected as support vectors then N is the maximum amount of SVs for that particular scenario.
You might also want to take a look on this: What is the relation between the number of Support Vectors and training data and classifiers performance?

Resources