Random Forests Decorrelation - random-forest

In Random Forests , you choose out of m features at each node not the complete feature-set. This is said to de-correlate the predictors. Intuitively I understand it but is there any statistics behind when can be say the predictors will be de-correlated and how we can prove in this case this is the scenario

Random forest tries to improve on bagging by de-correlating the trees and reduce variation, so it can select randomly the predictor and form the tree
typically m= log2 P where p is number of features, we cant see the randomly selected predictor logic since it will taken care by algorith

Related

Which machine learning algorithm to use for high dimensional matching?

Let say, I can define a person by 1000 different way, so i have 1,000 features for a given person.
PROBLEM: How can I run machine learning algorithm to determine the best possible match, or closest/most similar person, given the 1,000 features?
I have attempted Kmeans but this appears to be more for 2 features, rather than high dimensions.
You basically after some kind of K Nearest Neighbors Algorithm.
Since your data has high dimension you should explore the following:
Dimensionality Reduction - You may have 1000 features but probably some of them are better than others. So it would be a wise move to apply some kind of Dimensionality Reduction. Easiest and teh first point o start with would be Principal Component Analysis (PCA) which preserves ~90% of the data (Namely use enough Eigen Vectors which match 90% o the energy with their matching Eigen Values). I would assume you'll see a significant reduction from this.
Accelerated K Nearest Neighbors - There are many methods out there to accelerate the search of K-NN in high dimensional case. The K D Tree Algorithm would be a good start for that.
Distance metrics
You can try to apply a distance metric (e.g. cosine similarity) directly.
Supervised
If you know how similar the people are, you can try the following:
Neural networks, Approach #1
Input: 2x the person feature vector (hence 2000 features)
Output: 1 float (similarity of the two people)
Scalability: Linear with the number of people
See neuralnetworksanddeeplearning.com for a nice introduction and Keras for a simple framework
Neural networks, Approach #2
A more advanced approach is called metric learning.
Input: the person feature vector (hence 2000 features)
Output: k floats (you choose k, but it should be lower than 1000)
For training, you have to give the network first on person, store the result, then the second person, store the result, apply a distance metric of your choice (e.g. Euclidean distance) of the two results and then backpropagate the error.

binary classification with sparse binary matrix

My crime classification dataset has indicator features, such as has_rifle.
The job is to train and predict whether data points are criminals or not. The metric is weighted mean absolute error, where if the person is criminal, and the model predicts him/her as not, then the weight is large as 5. If person is not criminal and the model predicts as he/she is, then weight is 1. Otherwise the model predicts correctly, with weight 0.
I've used classif:multinom method in mlr in R, and tuned the threshold to 1/6. The result is not that good. Adaboost is slightly better. Though neither is perfect.
I'm wondering which method is typically used in this kind of binary classification problem with a sparse {0,1} matrix? And how to improve the performance measured by the weighted mean absolute error metric?
Dealing with sparse data is not a trivial task. Lack of information makes difficult to capture features such as variance. I would suggest you to search for subspace clustering methods or to be more specific, soft subspace clustering. The last one usually identifies relevant/irrelevant data dimensions. It is a good approach when you want to improve classification accuracy.

How to get scores along with Machine Learning results?

I am new to machine learning and I am currently working on classification problem. I am able to train the model and predict test data sets. I want to know whether is there some way by which I can get scores along with the prediction. By scores , I mean those are proximity scores along with prediction. For example, in standard age-salary-buy (based on age and salary whether the customer will buy the product or not) classification problem, I want to know what is a score out of 100 that he will buy that product in addition to the prediction of whether he will buy it or not.
Currently, I am using LibSVM Algo. Is there some algo which provides me above data ?
Thanks.
What you are looking for is a support of your decision. In other words, many classifiers base their decision of x class over labels Y on:
cl(x) = arg max_{y \in Y} p(y|x)
where p(y|x) is their internal estimation of "x having label y". And such classifiers include:
neural networks (with sigmoid output)
logistic regression
naive bayes
voting ensembles (such as RF)
...
These methods can be easily converted to your 0-100 scale, as probability is in 0-1 scale.
Some, on the other hand use measure proportional to probability (such as SVM), but unbounded, here you can get this value (often called decision function) but you cannot convert it to 0-100 score (as you do not have "maximum" value). This is a big drawback, so some modification were proposed. In particular for SVM you have Platt's scaling which actually fits a logistic regression on top of SVM so you get your probability estimate. In libSVM you can set -b to get probability estimates
from libsvm website
-b probability_estimates: whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)

learning, validation, and testing classifier [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I'm working on Sentiment Analysis for text classification and I want to classify tweets from Twitter to 3 categories, positive, negative, or neutral. I have 210 training data, and I'm using Naive Bayes as classifier. I'm implementing using PHP and MySQL as my database for training data.
What I've done is in sequence :
I split my training data based on 10-fold Cross Validation to 189 training data and 21 testing data.
I insert my training data into database, so my classifier can classify based on training data
Then I classify my testing data using my classifier. I got 21 prediction results.
Repeat step 2 and 3 for 10 times based on 10-fold Cross Validation
I evaluate the accuracy of the classifier one by one, so I got 10 accuracy results. Then I take the average of the results.
What i want to know is :
Which is the learning process ? What is the input, process, and output ?
Which is the validation process ? What is the input, process, and output ?
Which is the testing process ? What is the input, process, and output ?
I just want to make sure that my comprehension about these 3 process (learning, validation, and testing) is the right one.
In your example, I don't think there is a meaningful distinction between validation and testing.
Learning is when you train the model, which means that your outputs are, in general, parameters, such as coefficients in a regression model or weights for connections in a neural network. In your case, the outputs are estimated probabilities for the probability of seeing a word w in a tweet given the tweet positive P(w|+), seeing a word given negative P(w|-), and seeing a word given neutral P(w|*). Also the probabilities of not seeing words in the tweet given positive, negative, neutral, P(~w|+), etc. The inputs are the training data, and the process is simply estimating probabilities by measuring the frequencies that words occur (or don't occur) in each of your classes, i.e just counting!
Testing is where you see how well your trained model does on data you haven't seen before. Training tends to produce outputs that overfit the training data, i.e. the coefficients or probabilities are "tuned" to noise in the training data, so you need to see how well your model does on data it hasn't been trained on. In your case, the inputs are the test examples, the process is applying Bayes theorem, and the outputs are classifications for the test examples (you classify based on which probability is highest).
I have come across cross-validation -- in addition to testing -- in situations where you don't know what model to use (or where there are additional, "extrinsic", parameters to estimate that can't be done in the training phase). You split the data into 3 sets.
So, for example, in linear regression you might want to fit a straight line model, i.e. estimate p and c in y = px + c, or you might want to fit a quadratic model, i.e. estimate p, c, and q in y = px + qx^2 + c. What you do here is split your data into three. You train the straight line and quadratic models using part 1 of the data (the training examples). Then you see which model is better by using part 2 of the data (the cross-validation examples). Finally, once you've chosen your model, you use part 3 of the data (the test set) to determine how good your model is. Regression is a nice example because a quadratic model will always fit the training data better than the straight line model, so can't just look at the errors on the training data alone to decide what to do.
In the case of Naive Bayes, it might make sense to explore different prior probabilities, i.e. P(+), P(-), P(*), using a cross-validation set, and then use the test set to see how well you've done with the priors chosen using cross-validation and the conditional probabilities estimated using the training data.
As an example of how to calculate the conditional probabilities, consider 4 tweets, which have been classified as "+" or "-" by a human
T1, -, contains "hate", "anger"
T2, +, contains "don't", "hate"
T3, +, contains "love", "friend"
T4, -, contains "anger"
So for P(hate|-) you add up the number of times hate appears in negative tweets. It appears in T1 but not in T4, so P(hate|-) = 1/2. For P(~hate|-) you do the opposite, hate doesn't appear in 1 out of 2 of the negative tweets, so P(~hate|-) = 1/2.
Similar calculations give P(anger|-) = 1, and P(love|+) = 1/2.
A fly in the ointment is that any probability that is 0 will mess things up in the calculation phase, so you instead of using a zero probability you use a very low number, like 1/n or 1/n^2, where n is the number of training examples. So you might put P(~anger|-) = 1/4 or 1/16.
(The maths of the calculation I put in this answer).

importance of PCA or SVD in machine learning

All this time (specially in Netflix contest), I always come across this blog (or leaderboard forum) where they mention how by applying a simple SVD step on data helped them in reducing sparsity in data or in general improved the performance of their algorithm in hand.
I am trying to think (since long time) but I am not able to guess why is it so.
In general, the data in hand I get is very noisy (which is also the fun part of bigdata) and then I do know some basic feature scaling stuff like log-transformation stuff , mean normalization.
But how does something like SVD helps.
So lets say i have a huge matrix of user rating movies..and then in this matrix, I implement some version of recommendation system (say collaborative filtering):
1) Without SVD
2) With SVD
how does it helps
SVD is not used to normalize the data, but to get rid of redundant data, that is, for dimensionality reduction. For example, if you have two variables, one is humidity index and another one is probability of rain, then their correlation is so high, that the second one does not contribute with any additional information useful for a classification or regression task. The eigenvalues in SVD help you determine what variables are most informative, and which ones you can do without.
The way it works is simple. You perform SVD over your training data (call it matrix A), to obtain U, S and V*. Then set to zero all values of S less than a certain arbitrary threshold (e.g. 0.1), call this new matrix S'. Then obtain A' = US'V* and use A' as your new training data. Some of your features are now set to zero and can be removed, sometimes without any performance penalty (depending on your data and the threshold chosen). This is called k-truncated SVD.
SVD doesn't help you with sparsity though, only helps you when features are redundant. Two features can be both sparse and informative (relevant) for a prediction task, so you can't remove either one.
Using SVD, you go from n features to k features, where each one will be a linear combination of the original n. It's a dimensionality reduction step, just like feature selection is. When redundant features are present, though, a feature selection algorithm may lead to better classification performance than SVD depending on your data set (for example, maximum entropy feature selection). Weka comes with a bunch of them.
See: http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Dimensionality_Reduction/Singular_Value_Decomposition
https://stats.stackexchange.com/questions/33142/what-happens-when-you-apply-svd-to-a-collaborative-filtering-problem-what-is-th
The Singular Value Decomposition is often used to approximate a matrix X by a low rank matrix X_lr:
Compute the SVD X = U D V^T.
Form the matrix D' by keeping the k largest singular values and setting the others to zero.
Form the matrix X_lr by X_lr = U D' V^T.
The matrix X_lr is then the best approximation of rank k of the matrix X, for the Frobenius norm (the equivalent of the l2-norm for matrices). It is computationally efficient to use this representation, because if your matrix X is n by n and k << n, you can store its low rank approximation with only (2n + 1)k coefficients (by storing U, D' and V).
This was often used in matrix completion problems (such as collaborative filtering) because the true matrix of user ratings is assumed to be low rank (or well approximated by a low rank matrix). So, you wish to recover the true matrix by computing the best low rank approximation of your data matrix. However, there are now better ways to recover low rank matrices from noisy and missing observations, namely nuclear norm minimization. See for example the paper The power of convex relaxation: Near-optimal matrix completion by E. Candes and T. Tao.
(Note: the algorithms derived from this technique also store the SVD of the estimated matrix, but it is computed differently).
PCA or SVD, when used for dimensionality reduction, reduce the number of inputs. This, besides saving computational cost of learning and/or predicting, can sometimes produce more robust models that are not optimal in statistical sense, but have better performance in noisy conditions.
Mathematically, simpler models have less variance, i.e. they are less prone to overfitting. Underfitting, of-course, can be a problem too. This is known as bias-variance dilemma. Or, as said in plain words by Einstein: Things should be made as simple as possible, but not simpler.

Resources