I am trying to determine the optimal group of variables for a classification task. Sometimes instead of a group of variables, only a single variable should be selected (but the data was pretty weak looking at each variable alone).
I used several classifiers (Random Forest, Logistic regression, SVM) and I have a small problem in understanding the results (the best results were achieved by using RF).
Can someone with a deeper conceptual understanding of random forest than me please explain what a random forest using one variable is doing? Since it is only one variable, it is hard for me to see how the random forest can achieve a better sens/spec than that single variable can ever achieve alone (which it does). Is (in this case) the RF a decision tree? I was thinking that it might be the case, and after testing I observed that all the scores (accuracy, F1, precision, recall) were the same for the two of them.
Thanks for the help.
Related
I do have a question and I need your support. I have a data set which I am analyzing. I need to predict a target. To do this I did some data cleaning, among others drop highly (linear correlated feautes)
After preparing my data I applied random forest regressor (it is a regression problem). I am stucked a bit, since I really cannot catch the meaning and thus the value for max_features
I found the following page answer, where it is written
features=n_features for regression is a mistake on scikit's part. The original paper for RF gave max_features = n_features/3 for regression
I do get different results if I use max_features=sqrt(n) or max_features=n_features
Can any1 give me a good explanation how to approach this parameter?
That would be really great
max_features is a parameter that needs to be tuned. Values such as sqrt or n/3 are defaults and usually perform decently, but the parameter needs to be optimized for every dataset, as it will depend on the features you have, their correlations and importances.
Therefore, I suggest training the model many times with a grid of values for max_features, trying every possible value from 2 to the total number of your features. Train your RandomForestRegressor with oob_score=True and use oob_score_ to assess the performance of the Forest. Once you have looped over all possible values of max_features, keep the one that obtained the highest oob_score.
For safety, keep the n_estimators on the high end.
PS: this procedure is basically a grid search optimization for one parameter, and is usually done via Cross Validation. Since RFs give you OOB scores, you can use these instead of CV scores, as they are quicker to compute.
I am working on optimizing a manufacturing based dataset which consists of a huge number of controllable parameters. The goal is to attain the best run settings of these parameters.
I familiarized myself with several predictive algorithms while doing my research and if I say, use Random Forest to predict my dependent variable to understand how important each independent variable is, is there a way to extract the final equation/relationship the algorithm uses?
I'm not sure if my question was clear enough, please let me know if there's anything else I can add here.
There is no general way to get an interpretable equation from a random forest, explaining how your covariates affect the dependent variable. For that you can use a different model more suitable, e.g., linear regression (perhaps with kernel functions), or a decision tree. Note that you can use one model for prediction, and one model for descriptive analysis - there's no inherent reason to stick with a single model.
use Random Forest to predict my dependent variable to understand how important each independent variable is
Understanding how important each dependent variable, does not necessarily mean you need the question in the title of your question, namely getting the actual relationship. Most random forest packages have a method quantifying how much each covariate affected the model over the train set.
There is a number of methods to estimate feature importance based on trained model. For Random Forest, most famous methods are MDI (Mean Decrease of Impurity) and MDA (Mean Decrease of Accuracy). Many popular ML libraries support feature importance estimation out of the box for Random Forest.
I read this line today :
Every regression gets better with the addition of more features or variables... But adding more features increases complexity and reduces interpretability of the model as well.
I am unable to understand what is interpretability? (searched it on google but still did not get it)
Please help thank you
I would say that interpretability in a regression problems is when you can explain the result of your model to non statistician / domain experts.
For example: you try to predict the size of people depending on many variable, including sex. If you use linear regression, you will be able to say that the model will add 20cm (again, for example) to the predicted size if the person is a man (compared to a woman). The domain expert will understand the relationship between explanatory variable and the predicted result, without understanding statistics or how a linear regression works.
In addition, I disagree with the fact that the addition of more features or variables always improve regression result.
What is a better regression ? Improvement in choosen metrics ? For training or test set ? A "better regression" doesn't mean anything...
If we assume that a better regression is a regression which is better to predict the target for a new dataset, more variable doesn't always improve prediction power, especially when there is no regularization, if the added feature contains futures variables or many others cases.
I have a small doubt about variable selection in Random forest. I am aware of the fact that it chooses "m" random variables out of "M" variables for splitting and keeps the value (m) constant throughout.
My question is why these m variables are not same at each node. What is the reason behind it? Can someone help on this.
Thanks,
Fact that it is using different set (randomly chosen) of m features for each tree is actually advantage for RF. That way final model is more robust and accurate. It also helps in identifying which features are contributing most and have best predictive power.
btw that's why it is called Random Forest after all...
I've a Regression model that is most suitably solved using elastic net.
It has a very large number of predictors that I need to select only subset of them. Moreover, there could be correlation between the predictors, so Elastic net was the choice)
My question is:
If I have knowledge that a specific subset of the predictors must exist in the output (they shouldn't be penalized), how can this information be added to the elastic net?
Or even to the Regression model if elastic net is suitable in this case.
I need advise about papers that propose such solutions if possible.
I'm using Scikit-learn in Python, but I'm concerned more about the algorithm more than just how to do it.
If you're using the glmnet package in R, the penalty.factor argument addresses this.
From ?glmnet:
penalty.factor
Separate penalty factors can be applied to each coefficient. This is a number that multiplies lambda to allow differential shrinkage. Can be 0 for some variables, which implies no shrinkage, and that variable is always included in the model. Default is 1 for all variables (and implicitly infinity for variables listed in exclude). Note: the penalty factors are internally rescaled to sum to nvars, and the lambda sequence will reflect this change.
It depends on the kind of knowledge that you have. Regularization is a kind of adding prior knowledge to your model. For example Ridge regression encodes the knowledge that your coefficients should be small. Lasso regression encodes the knowledge that not all predictors are important. Elastic net is a more complicated prior that combine both of the assumptions in your model. There are other regularizers that you may check for example if you know that your predictors are grouped in certain groups you may check grouped Lasso. Also, if they interact in certain way (maybe some predictors are correlated with each other). You may also check Bayesian regression if you need more control over your prior.