The CRAN implementation of random forests offers both variable importance measures: the Gini importance as well as the widely used permutation importance defined as
For classification, it is the increase in percent of times a case is
OOB and misclassified when the variable is permuted. For regression,
it is the average increase in squared OOB residuals when the variable
is permuted
By default h2o.varimp() computes only the former. Is there really no option in h2o to get the alternative measure out of a random forest model?
Thanks!
ML
H2O does not calculate permutation importance. Please see the documentation for the explanation of how variable importance is calculated.
For your convenience I'll paste it as well below:
How is variable importance calculated for DRF?
Variable importance is determined by calculating the relative influence of each variable: whether that variable was selected during splitting in the tree building process and how much the squared error (over all trees) improved as a result.
A feature request has been previously made for this issue, you can follow it here (though note it is currently open).
Related
For a particular prediction problem, I observed that a certain variable ranks high in the XGBoost feature importance that gets generated (on the basis of Gain) while it ranks quite low in the SHAP output.
How to interpret this? As in, is the variable highly important or not that important for our prediction problem?
Impurity-based importances (such as sklearn and xgboost built-in routines) summarize the overall usage of a feature by the tree nodes. This naturally gives more weight to high cardinality features (more feature values yield more possible splits), while gain may be affected by tree structure (node order matters even though predictions may be same). There may be lots of splits with little effect on the prediction or the other way round (many splits diluting the average importance) - see https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27 and https://www.actuaries.digital/2019/06/18/analytics-snippet-feature-importance-and-the-shap-approach-to-machine-learning-models/ for various mismatch examples.
In an oversimplified way:
impurity-base importance explains the feature usage for generalizing on the train set;
permutation importance explains the contribution of a feature to the model accuracy;
SHAP explains how much would changing a feature value affect the prediction (not necessarily correct).
I have a dataset which has 5 variables and 1 response. The variables are discrete. I want to find the key variable and its value which leads to a significant increase or decrease to the response.
You will need to perform some statistical tests in order to find which variables are the most significant.
If you are familiar with python you could use SelectKBest from scikit-learn. It will give you a score, the highest the score, the stronger the link between the feature and the output.
Additionally you can train an explainable ML model, strong enough to converge, and find the pattern within the data, from that you could compute the feature importance.
For example you could use DecisionTreeClasifier from scikit-learn. It has a decision_path class function that will plot the decision path taken by the tree, decision_path has a property called feature_importances_ that uses Gini coefficient to compute the importance of the features.
Last but not the least, you can use feature reduction techniques, such as PCA, it's used to find the variance between variables, from the PCA you will compute new Principal Components that are linked to the features, from the most explenatory ones you can find the features importance. Check this stack overflow answer that explains everything you should know for that.
I am a beginner in machine learning. So any help or suggestion would be of great help.
I have read that putting weights on features and Predicting is a very bad idea. But what if few features needs to be weighted.
In a classification problem let's say it's a common norm that age is most dependent, how do I give weights to this feature. I was thinking to normalize it but with a variance of 1.5 or 2 (other features with variance 1), I believe that this feature will have more weight. Is this fundamentally wrong ? If wrong any other method.
Does it effect differently for classification and regression problems ?
If we are talking specifically about random forests (as you tagged) then you can use the Weighted Subspace Random Forest algorithm (in R wsrf package). The algorithm determines a weight for each variable and then uses these during the model building.
The informativeness of a variable with respect to the class is
measured by an information gain ratio. The measure is used as the
probability of that variable being selected for inclusion in the
variable subspace when splitting a specific node during the tree
building process. Therefore, variables with higher values by the
measure are more likely to be chosen as candidates during variable
selection and a stronger tree can be built.
Generally if a feature has more Importance compared to other features and the model is Dense enough, with enough training sample, your model will automatically give it more Importance by optimizing weight matrices to account for that because we have partial derivatives in back propagation which calculate change by each connection, so it learns to give more importance to that feature on itself. If you don't normalize it, but scale it to a higher scale, you might have overstated it's important.
In practice a neural network works best if the inputs are centered and white. That means that their covariance is diagonal and the mean is the zero vector. This improves optimization of the neural net, since the hidden activation functions don't saturate that fast and thus do not give you near zero gradients early on in learning.
If you do scale just one feature up by a small value, it may or may not have desired effects, but the higher probability is of saturated gradients, so we avoid it.
I am working on optimizing a manufacturing based dataset which consists of a huge number of controllable parameters. The goal is to attain the best run settings of these parameters.
I familiarized myself with several predictive algorithms while doing my research and if I say, use Random Forest to predict my dependent variable to understand how important each independent variable is, is there a way to extract the final equation/relationship the algorithm uses?
I'm not sure if my question was clear enough, please let me know if there's anything else I can add here.
There is no general way to get an interpretable equation from a random forest, explaining how your covariates affect the dependent variable. For that you can use a different model more suitable, e.g., linear regression (perhaps with kernel functions), or a decision tree. Note that you can use one model for prediction, and one model for descriptive analysis - there's no inherent reason to stick with a single model.
use Random Forest to predict my dependent variable to understand how important each independent variable is
Understanding how important each dependent variable, does not necessarily mean you need the question in the title of your question, namely getting the actual relationship. Most random forest packages have a method quantifying how much each covariate affected the model over the train set.
There is a number of methods to estimate feature importance based on trained model. For Random Forest, most famous methods are MDI (Mean Decrease of Impurity) and MDA (Mean Decrease of Accuracy). Many popular ML libraries support feature importance estimation out of the box for Random Forest.
Is there a way to calculate Accuracy instead of Error metrics for neural networks when doing regression (prediction of continuous variable) the same way we do when classifying categorical variables?
Though, the concept of accuracy comes in the classification, but you can print the predicted values and check them with dependent variables.
The problem with continuous variable, is that the probability to reproduce exactly a given value is (practically) zero. For instance if your neural network produces 2.000001 and the actual value is 2, then this would count as a wrong prediction as both values are different (although they are very close). Error metric like the root mean square, measure therefore at the average difference (squared).
However, depending on your application, you could introduce a threshold value ϵ and consider a given output of your neural network as correct if the absolute value of the difference between the observed value and the output is smaller than ϵ and compute the percentage of correct prediction.
In practice such a metric is not minimized directly, because it is difficult to compute its gradient, but it is still a useful quantity to compute.