How to make spark decisiontree model use feature subsetting? - random-forest

I am trying to build a random forest model using pyspark ml library. However, there is some special bootstrapping strategy that fits my dataset. So my plan is to do the bootstrapping separately and then train a bunch of decision tree models, bagging them on my own as a random forest model. Here comes the problem: it seems that feature subsetting in spark decision tree is only reserved for random forest (this is my understanding of the source code). Is there any way to enable such behavior instead of some work-around like training multiple random forest with numTrees=1?

Related

Application and Deployment of K-Fold Cross-Validation

K-Fold Cross Validation is a technique applied for splitting up the data into K number of Folds for testing and training. The goal is to estimate the generalizability of a machine learning model. The model is trained K times, once on each train fold and then tested on the corresponding test fold.
Suppose I want to compare a Decision Tree and a Logistic Regression model on some arbitrary dataset with 10 Folds. Suppose after training each model on each of the 10 folds and obtaining the corresponding test accuracies, Logistic Regression has a higher mean accuracy across the test folds, indicating that it is the better model for the dataset.
Now, for application and deployment. Do I retrain the Logistic Regression model on all the data, or do I create an ensemble from the 10 Logistic Regression models that were trained on the K-Folds?
The main goal of CV is to validate that we did not get the numbers by chance. So, I believe you can just use a single model for deployment.
If you are already satisfied with hyper-parameters and model performance one option is to train on all data that you have and deploy that model.
And, the other option is obvious that you can deploy one of the CV models.
About the ensemble option, I believe it should not give significant better results than a model trained on all data; as each model train for same amount of time with similar paparameters and they have similar architecture; but train data is slightly different. So, they shouldn't show different performance. In my experience, ensemble helps when the output of models are different due to architecture or input data (like different image sizes).
The models trained during k-fold CV should never be reused. CV is only used for reliably estimating the performance of a model.
As a consequence, the standard approach is to re-train the final model on the full training data after CV.
Note that evaluating different models is akin to hyper-parameter tuning, so in theory the performance of the selected best model should be reevaluated on a fresh test set. But with only two models tested I don't think this is important in your case.
You can find more details about k-fold cross-validation here and there.

How does RandomForestClassifier work for classification?

I have learned that Sklearn treats multi-class classification problems as a collection of binary problems. Quoting the Sklearn user guide:
In extending a binary metric to multiclass or multilabel problems, the data is treated as a collection of binary problems, one for each class.
So, binary classification models like LogisticRegression or Support vector matrices can support multi-class cases by using either One-vs-One or One-vs-Rest strategies. I wanted to know if that was the case for RandomForestClassifier too? How about other classifiers in Sklearn - are they all used as binary classifiers under the hood when dealing with a multi-class problem?
According to the documentation for Decision Trees, multi-output problems add a small change to the leaves of each tree in a random forest.
Suppose you have set criterion='gini'. In essence, each node is built by picking a subset of max_features features, calculating the average reduction in the gini impurity for all N classes and choosing the variable-threshold combination that reduces it most.
This means that random forests do not create one model for each class. Instead, it's only one model that simultaneously reduces the criterion metric for all classes in each node of every tree and predicts the most common class at each leaf.

How to use over-sampled data in cross validation?

I have a imbalanced dataset. I am using SMOTE (Synthetic Minority Oversampling Technique)to perform oversampling. When performing the binary classification, I use 10-fold cross validation on this oversampled dataset.
However, I recently came accross this paper; Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models that mentions that it is incorrect to use the oversampled dataset during cross-validation as it leads to overoptimistic performance estimates.
I want to verify the correct approach/procedure of using the over-sampled data in cross validation?
To avoid overoptimistic performance estimates from cross-validation in Weka when using a supervised filter, use FilteredClassifier (in the meta category) and configure it with the filter (e.g. SMOTE) and classifier (e.g. Naive Bayes) that you want to use.
For each cross-validation fold Weka will use only that fold's training data to parameterise the filter.
When you do this with SMOTE you won't see a difference in the number of instances in the Weka results window, but what's happening is that Weka is building the model on the SMOTE-applied dataset, but showing the output of evaluating it on the unfiltered training set - which makes sense in terms of understanding the real performance. Try changing the SMOTE filter settings (e.g. the -P setting, which controls how many additional minority-class instances are generated as a percentage of the number in the dataset) and you should see the performance changing, showing you that the filter is actually doing something.
The use of FilteredClassifier is illustrated in this video and these slides from the More Data Mining with Weka online course. In this example the filtering operation is supervised discretisation, not SMOTE, but the same principle applies to any supervised filter.
If you have further questions about the SMOTE technique I suggest asking them on Cross Validated and/or the Weka mailing list.
The correct approach would be first splitting the data into multiple folds and then applying sampling just to the training data and let the validation data be as is. The image below states the correct approach of how the dataset should be resampled in a K-fold fashion.
If you want to achieve this in python, there is a library for that:
Link to the library: https://pypi.org/project/k-fold-imblearn/

In stacking for machine learning which order should you train the models in?

I am currently learning to do stacking in a machine learning problem. I am going to get the outputs of the first model and use these outputs as features for the second model.
My question is: Does the order matter? I am using a lasso regression model and a boosted tree. In my problem the regression model outperforms the boosted tree. I am thinking therefore that I should use the regression tree second and the boosted tree first.
What are the factors I need to think about when making this decision?
Why don't you try feature engineering to create more features?
Don't try to use predictions from one model as features for another model.
You can try using K-means to cluster similar training samples.
For stacking, just use different models and then average the results (assuming that you have a continuous y variable).

How to correctly combine my classifiers?

I have to solve 2 class classification problem.
I have 2 classifiers that output probabilities. Both of them are neural networks of different architecture.
Those 2 classifiers are trained and saved into 2 files.
Now I want to build meta classifier that will take probabilities as input and learn weights of those 2 classifiers.
So it will automatically decide how much should I "trust" each of my classifiers.
This model is described here:
http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/#stackingclassifier
I plan to use mlxtend library, but it seems that StackingClassifier refits models.
I do not want to refit because it takes very huge amount of time.
From the other side I understand that refitting is necessary to "coordinate" work of each classifier and "tune" the whole system.
What should I do in such situation?
I won't talk about mlxtend because I haven't worked with it but I'll tell you the general idea.
You don't have to refit these models to the training set but you have to refit them to parts of it so you can create out-of-fold predictions.
Specifically, split your training data in a few pieces (usually 3 to 10). Keep one piece (i.e. fold) as validation data and train both models on the other folds. Then, predict the probabilities for the validation data using both models. Repeat the procedure treating each fold as a validation set. In the end, you should have the probabilities for all data points in the training set.
Then, you can train a meta-classifier using these probabilities and the ground truth labels. You can use the trained meta-classifier on your new data.

Resources