I am working on Random Forest model on antibiotic sensitivity data. There are total 11 variables in my dataset. Antimicrobial sensitivity result is categorized as Sensitive (S), Intermediate (I) and Resistant. When I run Random Forest Model on my data it gives only one variable important rather than total 11 variables.
Only one variable i.e., gentamicin is important but not any other
Table showing values of important variable when I run Random Forest Model
Why remaining variables are not important?
Related
I’m working with an auto dataset I found on kaggle. Besides numerical values like horsepower, car length, car weight etc., it has multiple categorical variables such as:
car type (sedan, suv, hatchback etc.): cardinality=5
car brand (toyota, Nissan, bmw etc.): cardinality=21
Doors (2door and 4door): cardinality=2
Fuel type (gas and diesel): cardinality =2
I would like to use a random forest classifier to perform feature selection with all these variables as input. I’m aware that the categorical variables need to be encoded before doing so. What is the best approach to handling data with such varying cardinalities?
Can I apply different encoding techniques to different variables? Say for example, one hot encoding on fuel type and label encoding on car type?
You can apply different encoding techniques to different variables. However, label encoding introduces hierarchy/order, which doesn't look appropriate for any of the predictors you mention. For this example it looks like one-hot-encoding all the categorical predictors would be better, unless there are some ordinal variables you haven't mentioned.
EDIT: in response to your comment. I would only ever use label encoding if a categorical predictor was ordinal. If they aren't, I would not try and enforce it, and would use one-hot-encoding if the model type couldn't cope with categorical predictors. Whether this causes an issue regarding sparse trees and too many predictors depends entirely on your dataset. If you still have many rows compared to predictors then it generally isn't a problem. You can run into issues with random forests if you have a lot of predictors that aren't correlated at all with the target variable. In this case, as predictors are chosen randomly, you can end up with lots of trees that don't contain any relevant predictors, creating noise. In this case you could try and remove non-relevant predictors before running the random forest model. Or you could try using a different type of model, e.g. penalized regression.
I have built random forest model and saved the model. There are few numeric variables and few hot coded and converted as factor variables.
I have a situation where some of the records in new data are also part of the train data and the prediction probabilities are differing for the similar records.
Saved model name is Rfmod.
Used the following code run the predictions:
load(Rfmod)
Pred <- predict(Rfmod, newdata, type ='prob')
The probabilities for the common records in both train and new data is not same. Any thoughts on this? I have also tried passing newdata option in predict function, but the difference is still there.
What's your approach to solving a machine learning problem with multiple data sets with different parameters, columns and lengths/widths? Only one of them has a dependent variable. Rest of the files contain supporting data.
Your query is too generic and irrelevant to some extent as well. The concern around columns length and width is not justified when building a ML model. Given the fact that only one of the datasets has a dependent variable, there will be a need to merge the datasets based on keys that are common across datasets. Typically, the process followed before doing modelling is :
step 0: Identify the dependent variable and decide whether to do regression or classification (assuming you are predicting variable value)
Clean up the provided data by handling duplicates, spelling mistakes
Scan through the categorical variables to handle any discrepancies.
Merge the datasets and create a single dataset that has all the independent variables and the dependent variable for which prediction has to be done.
Do exploratory data analysis in order to understand the dependent variable's behavior with other independent variables.
Create model and refine the model based on VIF (Variance Inflation factor) and p-value.
Iterate and keep reducing the variables till you get a model which has all the
significant variables, stable R^2 value. Finalize the model.
Apply the trained model on the test dataset and see the predicted value against the variable in test dataset.
Following these steps at high level will help you to build models.
I have a (probably stupid) question about predicting a new instance with a missing predictor(s).
I am given a data. Let's say I preprocess, clean data and as a result, let's say, 10 predictors left. Then, I train my model on a resulting data, so I am ready to use model to predict.
Now, what should I do if I want to predict a new instance which 1 or 2 predictors are missing?
There are at least two reasonable solutions.
(1) Average the output over the possible values of the missing variable or variables, conditional on the values of the non-missing variables. That is, compute a weighted average of the output prediction(missing, non-missing) for each possible value of missing, weighted by the probability of missing given non-missing. This is essentially a variety of what's called "multiple imputation" in the literature.
The first thing to try is to just weight by the unconditional distribution of missing. If that seems too complicated, a very rough approximation is to substitute the mean value of missing into the prediction.
(2) Build a model for each combination variables. If you have n variables, this means building 2^n variables. If n = 10, 1024 models is not a big deal these days. Then if you are missing some variables, just use the model for the ones that are present.
By the way, you might get more interest in this question at stats.stackexchange.com.
I have performed a random forest analysis of 100,000 classification trees on a rather small dataset (i.e. 28 obs. of 11 variables).
I then made a plot of the variable importance
In the resulting plots there is a substantial mismatch between %IncMSE and IncNodePurity for at least one of the important variables. The variable in fact which appears to be seventh for importance in the former (i.e. %IncMSE<0) but third in the latter.
Could anyone enlighten me on how should I interpreter this mismatch?
The variable in question is significantly correlated to one other variable that appears consistently in second place in both graphs. Could this be a clue?
The first graph shows that if a variable is assigned values by random permutation by how much will the MSE increase. Higher the value, higher the variable importance.
On the other hand, Node purity is measured by Gini Index which is the the difference between RSS before and after the split on that variable.
Since the concept of criteria of variable importance is different in two cases, you have different rankings for different variables.
There is no fixed criterion to select the "best" measure of variable importance it depends on the problem you have at hand.