Query about variable selection in Random Forest - machine-learning

I have a small doubt about variable selection in Random forest. I am aware of the fact that it chooses "m" random variables out of "M" variables for splitting and keeps the value (m) constant throughout.
My question is why these m variables are not same at each node. What is the reason behind it? Can someone help on this.
Thanks,

Fact that it is using different set (randomly chosen) of m features for each tree is actually advantage for RF. That way final model is more robust and accurate. It also helps in identifying which features are contributing most and have best predictive power.
btw that's why it is called Random Forest after all...

Related

Predicting houseprice: is it okay to use a constant (int) to indicate "unknown"

I have a dataset and am trying to predict houseprices. Several variables (#bedrooms, #bathrooms, area, ...) use the constants 0 or -1 to indicate "not known". Is this good practice?
Dropping these values would result in the loss of too much data. Interpolation does not seem like a good option, especially since there are cases where multiple of these values are unknown and they are relatively highly correlated to each other.
Taking the mean of the column to substitute these values with would not work seeing as all houses are fundamentally different.
Does anyone have advise on this?
Totally depends on what ML algorithm you want to use. Some can handle null values for missing data and others can't.
Usually interpolating/predicting these missing values is a reasonable idea. You could run another algorithm first to predict the missing values based on your available data and then run a second algorithm for the housing price prediction.

Tree vs Regression algorithm- which works better for a model with mostly categorical features?

I'm working on a regression problem to predict the selling price of a product. The features are a 4-level product hierarchy and a proposed price. In summary, there are 4 categorical features and one numerical feature. There are about 1000K rows in total.
I think a decision tree or random forest would work better than regression in this scenario. The reasoning is that there is only one numerical feature. Also, I plan to convert the numerical feature (proposed price) into price buckets, making it another categorical feature.
Does my reasoning make sense? Is there any other algorithm that might be worthy to try? Is there any other clever feature engineering that is worthy trying?
Note 1: This is actually a challenge problem (like Kaggle), so the features have been masked and encoded. Looking at the data, I can say for sure that there is a 4-level product hierarchy, but I'm not very sure about the one numerical feature (which I think is the proposed price), because there is a lot of difference in some scenarios between this number and the sold price (y variable). Also, there are a lot of outliers (probably forcibly introduced to confuse) in this column.
I would not recommend binning the proposed price variable, as one would expect that variable to carry most information needed to predict the selling price. Binning that variable is advantageous when the variable is noisy, however it comes at a cost since you throw away valuable information. You do not have to bin your continuous variable, Trees will do it for you (and RFs likewise). If your categorical variables are ordinal you do not have to do anything, however if they are not you may consider encoding the variables (map the distinct values to let's say one hot vectors -- 0,0,1) and try other regressors that way, such as SVR from https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html (in this case you may consider scaling the variables to [0,1]).
edit: RFs are better than trees in general just make sure you know what you're doing. And make sure you understand that RFs are many trees ensembled together.

Random Forest using one variable

I am trying to determine the optimal group of variables for a classification task. Sometimes instead of a group of variables, only a single variable should be selected (but the data was pretty weak looking at each variable alone).
I used several classifiers (Random Forest, Logistic regression, SVM) and I have a small problem in understanding the results (the best results were achieved by using RF).
Can someone with a deeper conceptual understanding of random forest than me please explain what a random forest using one variable is doing? Since it is only one variable, it is hard for me to see how the random forest can achieve a better sens/spec than that single variable can ever achieve alone (which it does). Is (in this case) the RF a decision tree? I was thinking that it might be the case, and after testing I observed that all the scores (accuracy, F1, precision, recall) were the same for the two of them.
Thanks for the help.

ML method to identify data subset with lower average values

I have a task where I need to bin groups of observations [y,X] by the average value of y, using the feature set X.
A good interpretation is that these are customers and I want to find the most valuable ones.
One idea is decision trees, but I'm not sure this yields the optimal groups for my problem. My intuition is that decision trees mainly are used for prediction, so the focus of the algorithms might be different. Another idea is a brute force search.
Can someone give point me in the right direction with some concepts (maybe just search keywords) of how this problem is solved in ML?

Machine learning, Do unbalanced non-numeric variable classes matter

If I have a non-numeric variable in my data set that contains many of one class but few of another does this cause the same issues as when the target classes are unbalanced?
For example if one of my variables was title and the aim was to identify whether a person is obese. The data obese class is split 50:50 but there is only one row with the title 'Duke' and this row is in the obese class. Does this mean that an algorithm like logistic regression (after numeric encoding) would start predicting that all Dukes are obese (or have a disproportionate weighting for the title 'Duke')? If so, are some algorithms better/worse at handling this case? Is there a way to prevent this issue?
Yes, any vanilla machine learning algorithm will treat categorical data the same way as numerical data in terms of information entropy from a specific feature.
Consider this, before applying any machine learning algorithm you should analyze your input features and identify the explained variance each cause on the target. In your case if the label Duke always gets identified as obese, then given that specific dataset that is an extremely high information feature and should be weighted as such.
I would mitigate this issue by adding a weight to that feature, thus minimizing the impact it will have on the target. However, this would be a shame if this is an otherwise very informative feature for other instances.
An algorithm which could easily circumvent this problem is random forest (decision trees). You can eliminate any rule that is based on this feature being Duke.
Be very careful in mapping this feature to numbers as this will have an impact on the importance attributed to this feature with most algorithms.

Resources