After analyze feature importane with SHAP what is next? - random-forest

I am working on Boosting algorithms. In the first part, i build model, evaluate it (getting satisfied r2) and i look up feature importance with SHAP. Then i found 2 of 5 feature is not big deal.
Then i rearrange my dataset into 3 feature (removing 2 unnecessary features) and refresh this process.
My question is: Should i rearrange my model's hyperparameters ? or should't change anything on model and evaluate on 3 features with same hyperparameters ?
I am asking because in my opinion SHAP tells me these 2 feature is unnecessary on my model which i build with some hyperparameters lets say hyperparameter A. Removing 2 features, should i use for same model hyperparameter A or hyperparameter B which i can find.

Related

Application and Deployment of K-Fold Cross-Validation

K-Fold Cross Validation is a technique applied for splitting up the data into K number of Folds for testing and training. The goal is to estimate the generalizability of a machine learning model. The model is trained K times, once on each train fold and then tested on the corresponding test fold.
Suppose I want to compare a Decision Tree and a Logistic Regression model on some arbitrary dataset with 10 Folds. Suppose after training each model on each of the 10 folds and obtaining the corresponding test accuracies, Logistic Regression has a higher mean accuracy across the test folds, indicating that it is the better model for the dataset.
Now, for application and deployment. Do I retrain the Logistic Regression model on all the data, or do I create an ensemble from the 10 Logistic Regression models that were trained on the K-Folds?
The main goal of CV is to validate that we did not get the numbers by chance. So, I believe you can just use a single model for deployment.
If you are already satisfied with hyper-parameters and model performance one option is to train on all data that you have and deploy that model.
And, the other option is obvious that you can deploy one of the CV models.
About the ensemble option, I believe it should not give significant better results than a model trained on all data; as each model train for same amount of time with similar paparameters and they have similar architecture; but train data is slightly different. So, they shouldn't show different performance. In my experience, ensemble helps when the output of models are different due to architecture or input data (like different image sizes).
The models trained during k-fold CV should never be reused. CV is only used for reliably estimating the performance of a model.
As a consequence, the standard approach is to re-train the final model on the full training data after CV.
Note that evaluating different models is akin to hyper-parameter tuning, so in theory the performance of the selected best model should be reevaluated on a fresh test set. But with only two models tested I don't think this is important in your case.
You can find more details about k-fold cross-validation here and there.

Evaluate CNN model for multiclass image classification

i want to ask what metric can be used to evalutate my CNN model for multi class, i have 3 classes for now and i’m just using accuracy and confussion matrix also plot the loss of model, is there any metric can be used to evaluate my model performance?
Evaluating the performance of a model is one of the most crucial phase of any Machine Learning project cycle and must be done effectively. Since, you have mentioned that you are using accuracy and confusion metrics for the evaluation. I would like to add some points for developing a better evaluation strategy:
Consider you are developing a classifier that classifies an EMAIL into SPAM or NON - SPAM (HAM), now one of the possible evaluation criteria can be the FALSE POSITIVE RATE because it can be really annoying if a non-spam email ends in spam category (which means you will read a valuable email)
So, I recommend you to consider metrics based on the problem you are targeting. There are many metrics such as F1 score, recall, precision that you can choose based on the problem you are havning.
You can visit: https://medium.com/apprentice-journal/evaluating-multi-class-classifiers-12b2946e755b for better understanding.

Why does my Random Forest Classifier perform better on test and validation data than on training data?

I'm currently training a random forest on some data I have and I'm finding that the model performs better on the validation set, and even better on the test set, than on the train set. Here are some details of what I'm doing - please let me know if I've missed any important information and I will add it in.
My question
Am I doing anything obviously wrong and do you have any advice for how I should improve my approach because I just can't believe that I'm doing it right when my model predicts significantly better on unseen data than training data!
Data
My underlying data consists of tables of features describing customer behaviour and a binary target (so this is a binary classification problem). Technically I have one such table per month and I tend to use several months of data to train and then a different month to predict (e.g. Train on Apr, May and Predict on Jun)
Generally this means I end up with a training dataset of about 100k rows and 20 features (I've previously looked into feature selection and found a set of 7 features which seem to perform best, so have been using these lately). My prediction set generally has around 50k rows.
My dataset is heavily unbalanced (approximately 2% incidence of target feature), so I'm using oversampling techniques - more on that below.
Method
I've searched around online quite a lot and this has led me to the following approach:
Take scaleable (continuous) features in the training data and standardise them (currently using sklearn StandardScaler)
Take categorical features and encode them into separate binary columns (one-hot) using Pandas get_dummies function
Remove 10% of the training data to form a validation set (I'm currently using a random seed in this process for comparability whilst I vary different things such as hyperparameters in the model)
Take the remaining 90% of training data and perform a grid search across a few parameters of the RandomForestClassifier() (currently min_samples_split, max_depth, n_estimators and max_features)
Within each hyperparameter combination from the grid I perform kfold validation with 5 folds and using a random state
Within each fold I oversample my minority class for training data only (sometimes using imbalanced-learn's RandomOverSampler() and sometimes using SMOTE() from the same package), train the model on the training data and then apply the model to the kth fold and record performance metrics (precision, recall, F1 and AUC)
Once I've been through 5 folds on each hyperparameter combination I find the best F1 score (and best precision if two combinations are tied on F1 score) and retrain a random forest on the entire 90% training data using those hyperparameters. During this step I use the same oversampling technique as I did in the kfold process
I then use this model to make predictions on the 10% of training data that I put aside earlier as a validation set, evaluating the same metrics as above
Finally I have a test set, which is actually based on data from another month, which I apply the already trained model to and evaluate the same metrics
Outcome
At the moment I'm finding that my training set achieves an F1 score of around 30%, the validation set is consistently slightly higher than this at around 36% (mostly driven by a much better precision than the training data e.g. 60% vs. 30%) and then the testing set is getting an F1 score of between 45% and 50% which is again driven by a better precision (around 65%)
Notes
Please do ask about any details I haven't mentioned; I've had my stuck in this for weeks and so have doubtless omitted some details
I've had a brief look (not a systematic analysis) of the stability of metrics between folds in the kfold validation and it seems that they aren't varying very much, so I'm fairly happy with the stability of the model here
I'm actually performing the grid search manually rather than using a Python pipeline because try as I might I couldn't get imbalanced-learn's Pipeline function to work with the oversampling functions and so I run a loop with combinations of hyperparameters, but I'm confident that this isn't impacting the results I've talked about above in an adverse way
When I apply the final model to the prediction data (and get an F1 score around 45%) I also apply it back to the training data itself out of interest and get F1 scores around 90% - 100%. I suppose this is to be expected as the model is trained and predicts on almost exactly the same data (except the 10% holdout validation set)

Why is cross validation used in decision tree classification?

I trying to learn about decision trees (and other models) and I came across cross validation, now I first thought that cross validation was used to determine the optimal parameters for the model. For example the optimal max_tree_depth in decision tree classification or the optimal number_of_neighbors in k_nearest_neighbor classification. But as I am looking at some examples I think this might be wrong.
Is this wrong?
Cross-validation is used to determine the accuracy of your model in a more accurate way for example in a n-fold cross validation you divide you data into n partitions and use n-1 parts as train set and 1 part as test set and repeat this for all partitions each partition gets to be test set once) then you average results to get a better estimation of your model's accuracy

Error Correction methodologies Time Series Forecast

Do you have any readings recommendation on correcting forecast bias? For example, I use an ARIMA model to predict a time series. Is there a way based on the backtesting results to correct the bias of the forecast?
How to handle an all present Bias / Overfit struggle? Using a tactical methodology:
one principal approach to this is to systematically tune a Predictor ( be it ARIMA or some other ) via a two step approach.
You have to split available DataSET into two parts, so as to emulate a near "Future", and "hide" the -- say about 20-30% of the observations -- second part of the DataSET from a process of [1] Training and find it's use in a step [2] called CrossValidation of predictions.
This methodology allows one to search both the StateSPACE of a Predictor engine's configurations and data-related bias/overfit. Some use only the former part of the minimiser search ( lowest error / highest utility function ), some only the latter ( alike Leo Breiman's RandomForest modification of ensemble based method ) and some use both.
Train a pre-configured Predictor on aTrainingSubPartOfAvailableDataSET
Once such a configuration of a Predictor got trained, cross-validate this configuration's ability to predict against aCrossValidationSubPartOfAvailableDataSET not seen in the process of training (Step 1.) to observe the Bias / Overfit artefacts and proceed towards the lowest Cross-Validation error / best generalisation area of plausible configuration settings.

Resources