How many features does the RandomForest algorithm select? - machine-learning

I'm working with random forest and I'd like to know how does the feature selection works.
I have a set of 423 features and I understand that they are randomnly selected using log2(F)+ 1. So this way I get a 12/13 features subset. But what I cannot understand is how random is the selection and if those subsets are supposed to be different for each tree or if the subset is the same for all the trees but what differ are the multiple combinations.
If I have a model with 10 trees, is the feature selection supposed to vary from tree to tree? Thanks for your help.

Each tree in the forest gets a different random sample of features. Decision tree learning is usually deterministic so if each tree had the same set of features, they would all learn the same decision tree, which defeats the purpose. You want them all to be trained on different subsets of features.
If the algorithm is selecting a subset of 12 features from the original set of 423 features, then each tree will get its own sample (without replacement) of 12 features from the full set.

Related

Which predictive models in sklearn are affected by the order of the columns in the training dataframe?

I'm wondering if any of the estimators that Sci-kit Learn provides is affected by the order of the columns in the dataframe by which it is being trained. I tried establishing a baseline by using ExtraTreesRegressor and it came out to 3 different scores:
.531687 for the regular order
.535309 for the reverse order
.554458 for the regular order
Obviously ExtraTreesRegressor is not a good example here, so I tried LinearRegression but it gave .295898 no matter what the order of the columns were.
What I want to know is if there are ANY estimators that are affected by the order of the columns and if there are not then can you point me in the direction of some way, or provide some code, that I can use to make sure that the order of the columns does matter?
Any algorithm that involves some randomness in selecting features while building the model is expected to be affected from their order; AFAIK, the only cases present in scikit-learn are the Extra Trees and the Random Forest (in both their incarnations as classifiers or regressors), which indeed share some similarities.
The smoking gun for such a behavior is the argument max_features; from the RF docs (the description is identical in the Extra Trees as well):
max_features : {“auto”, “sqrt”, “log2”} int or float, default=”auto”
The number of features to consider when looking for the best split
I am not aware of other algorithms that involve such kind of random feature selection (linear models, decision trees, SVMs, naive Bayes, neural nets, and gradient boosted trees do not), but if you glimpse something similar enough in the documentation, you can bet that the respective algorithm is also affected by the order of the features.
Keep in mind that such slight discrepancies that should not happen in theory are rather to be expected in models where randomness enters from way too many angles. For a similar case with RF in R (slightly different results when asking for importance=TRUE), check my answer in Why does the importance parameter influence performance of Random Forest in R?

when adaboost is better than XGboost in some data combinations?

my name is Eslam a masters' student in Egypt, my thesis is in the field of education data mining. I used AdaBoost and XGBoost techniques for my predictive model to predict students success rate based on Open Learning Analytics data-set - OLAD.
the idea behind the analysis is tying various techniques (including ensemble and non ensemble techniques) on different combinations of features,interesting results showed up
Results:
the question is why some techniques performs better than others in specific features combinations? specially for Random Rorest,XGB and ADA?
ML model could achieve different results based on what kind of space and what kind of function you want to approximate. You can expect that SVM achieve highest score on data which are naturally embedded on Hilbert space. On the other hand if data does not fit this kind of space (i.e. many categorical, not ordered features) you can expect boosting trees methods would outperform SVM.
However if I good understood that 'Decision Tree Accuracy' is a single decision tree based on results from the picture I believe your tests was done on small data sets or your boosting and RF was incorrectly parametrized.

How to derive cluster properties

I have clustered ~40000 points into 79 clusters. Each point is a vector of 18 features. I want to 'derive' the characteristics of each cluster - the prominent features/characteristics of the clusters. Are there machine-learning algorithms to derive this?
If you are confident the clusters are meaningful for your particular needs, you could view it as a classification problem.
One option would be to apply a feature selection algorithm to rank the features. You could use recursive feature elimination to identify a subset of features that are predictive for the cluster labels.
Another good option for interpreting the clusters could be building a decision tree. With decision trees you can see what features are used to best separate the classes (clusters in your case). You could also use an ensemble like Random Forest and ask for feature importance scores.

Find the best set of features to separate 2 known group of data

I need some point of view to know if what I am doing is good or wrong or if there is better way to do it.
I have 10 000 elements. For each of them I have like 500 features.
I am looking to measure the separability between 2 sets of those elements. (I already know those 2 groups I don't try to find them)
For now I am using svm. I train the svm on 2000 of those elements, then I look at how good the score is when I test on the 8000 other elements.
Now I would like to now which features maximize this separation.
My first approach was to test each combination of feature with the svm and follow the score given by the svm. If the score is good those features are relevant to separate those 2 sets of data.
But this takes too much time. 500! possibility.
The second approach was to remove one feature and see how much the score is impacted. If the score changes a lot that feature is relevant. This is faster, but I am not sure if it is right. When there is 500 feature removing just one feature don't change a lot the final score.
Is this a correct way to do it?
Have you tried any other method ? Maybe you can try decision tree or random forest, it would give out your best features based on entropy gain. Can i assume all the features are independent of each other. if not please remove those as well.
Also for Support vectors , you can try to check out this paper:
http://axon.cs.byu.edu/Dan/778/papers/Feature%20Selection/guyon2.pdf
But it's based more on linear SVM.
You can do statistical analysis on the features to get indications of which terms best separate the data. I like Information Gain, but there are others.
I found this paper (Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002) to be a good theoretical treatment of text classification, including feature reduction by a variety of methods from the simple (Term Frequency) to the complex (Information-Theoretic).
These functions try to capture the intuition that the best terms for ci are the
ones distributed most differently in the sets of positive and negative examples of
ci. However, interpretations of this principle vary across different functions. For instance, in the experimental sciences χ2 is used to measure how the results of an observation differ (i.e., are independent) from the results expected according to an initial hypothesis (lower values indicate lower dependence). In DR we measure how independent tk and ci are. The terms tk with the lowest value for χ2(tk, ci) are thus the most independent from ci; since we are interested in the terms which are not, we select the terms for which χ2(tk, ci) is highest.
These techniques help you choose terms that are most useful in separating the training documents into the given classes; the terms with the highest predictive value for your problem. The features with the highest Information Gain are likely to best separate your data.
I've been successful using Information Gain for feature reduction and found this paper (Entropy based feature selection for text categorization Largeron, Christine and Moulin, Christophe and Géry, Mathias - SAC - Pages 924-928 2011) to be a very good practical guide.
Here the authors present a simple formulation of entropy-based feature selection that's useful for implementation in code:
Given a term tj and a category ck, ECCD(tj , ck) can be
computed from a contingency table. Let A be the number
of documents in the category containing tj ; B, the number
of documents in the other categories containing tj ; C, the
number of documents of ck which do not contain tj and D,
the number of documents in the other categories which do
not contain tj (with N = A + B + C + D):
Using this contingency table, Information Gain can be estimated by:
This approach is easy to implement and provides very good Information-Theoretic feature reduction.
You needn't use a single technique either; you can combine them. Term-Frequency is simple, but can also be effective. I've combined the Information Gain approach with Term Frequency to do feature selection successfully. You should experiment with your data to see which technique or techniques work most effectively.
If you want a single feature to discriminate your data, use a decision tree, and look at the root node.
SVM by design looks at combinations of all features.
Have you thought about Linear Discriminant Analysis (LDA)?
LDA aims at discovering a linear combination of features that maximizes the separability. The algorithm works by projecting your data in a space where the variance within classes is minimum and the one between classes is maximum.
You can use it reduce the number of dimensions required to classify, and also use it as a linear classifier.
However with this technique you would lose the original features with their meaning, and you may want to avoid that.
If you want more details I found this article to be a good introduction.

Regression trees with standard deviation reduction

I have a data set of 1k records and my job is to do a decision algorithm based on those records.
Here is what I can share:
The target is a continuous value.
Some of the predictors (or attributes) are continuous values,
some of them are discrete and some are arrays of discrete values
(there can be more than one option)
My initial thoughts were to separate the arrays of discrete values and make them individual features (predictors). For the continuous values in the predictors I was thinking about just randomly picking a few decision boundaries and see which one reduces the entropy the most. Then make a decision tree (or a random forest) which use standard deviation reduction when creating the tree.
My question is: Am I on the right path? Is there a better way to do that?
I know this comes probably a bit late but what you are searching for are Model Trees. Model trees are decision trees with continuous rater than categorical values in the leafs. In general these values are predicted by linear regression models. One of the more prominent model trees and one that more or less suits your needs is the M5 model tree introduced by Quinlan. Wang and Witten re-implemented M5 and extended its functionality so that it can handle both, continuous and categorical attributes. Their version is called M5', you can find an implementation e.g. in Weka. The only thing left would be to handle the arrays. However, your description is a bit generic in that respect. From what I gather your choices are either flattening or, as you suggested, seperating them.
Note that, since Wang and Witten's work, more sophisticated model trees have been introduced. However, M5' is robust and does not need any parameterization in its original formulation, which makes it easy to use.

Resources