stack ensemble model versus stack generalization ensemble model - stack

Is there any difference between stack ensemble model and stack generalization ensemble model? I have read different resources and It seems they should be the same. However, when I summarize their algorithms, they do not seem equal.
Here is what I summarized from stack ensemble procedure.
The test data will not contribute to model training at all.
In contrast, when I summarize stack generalization algorithm, there is no test data. The cross-validation (CV) data (this time it is fold 1 to 10) and the main data will contribute to level 0 models, separately. We build the meta model (level 1 model) based on the predictions from CV data.
Using the built level 1 model, we calculate the main data predictions by feeding the level 0 predictions as the new data to level 1 model. The stack generalization algorithm is based on https://doi.org/10.1098/rsif.2017.0520 paper.
RF = random forest, GB: gradient boosting

Related

Application and Deployment of K-Fold Cross-Validation

K-Fold Cross Validation is a technique applied for splitting up the data into K number of Folds for testing and training. The goal is to estimate the generalizability of a machine learning model. The model is trained K times, once on each train fold and then tested on the corresponding test fold.
Suppose I want to compare a Decision Tree and a Logistic Regression model on some arbitrary dataset with 10 Folds. Suppose after training each model on each of the 10 folds and obtaining the corresponding test accuracies, Logistic Regression has a higher mean accuracy across the test folds, indicating that it is the better model for the dataset.
Now, for application and deployment. Do I retrain the Logistic Regression model on all the data, or do I create an ensemble from the 10 Logistic Regression models that were trained on the K-Folds?
The main goal of CV is to validate that we did not get the numbers by chance. So, I believe you can just use a single model for deployment.
If you are already satisfied with hyper-parameters and model performance one option is to train on all data that you have and deploy that model.
And, the other option is obvious that you can deploy one of the CV models.
About the ensemble option, I believe it should not give significant better results than a model trained on all data; as each model train for same amount of time with similar paparameters and they have similar architecture; but train data is slightly different. So, they shouldn't show different performance. In my experience, ensemble helps when the output of models are different due to architecture or input data (like different image sizes).
The models trained during k-fold CV should never be reused. CV is only used for reliably estimating the performance of a model.
As a consequence, the standard approach is to re-train the final model on the full training data after CV.
Note that evaluating different models is akin to hyper-parameter tuning, so in theory the performance of the selected best model should be reevaluated on a fresh test set. But with only two models tested I don't think this is important in your case.
You can find more details about k-fold cross-validation here and there.

In KNN, why is the data the model?

In the machine learning lecture slide, it is said that there is no specific model for KNN where the data is the model of KNN.
The previous assignment was NCC(Nearest Centroid Classifier) where there were two methods, one was fit_ncc, and the other was predict_ncc. So fit_ncc creates a model and using this model, predict_ncc makes a prediction.
However, for KNN, it is written that the data is the model. This statement is not clear to me and my question is why data is the model for KNN?
Please see the attached screenshot:
I think that means, that in knn model fit method just remembers data, and predict method calculates distance matrix for each test data sample using some kind of distance metrics(Manhattan or Euclidean distances),and predicts the label of the most frequent one among k nearest samples from train dataset.
Here's links about distance metrics:
Importance of Distance Metrics in Machine Learning Modelling
Types of metrics

Why does my Random Forest Classifier perform better on test and validation data than on training data?

I'm currently training a random forest on some data I have and I'm finding that the model performs better on the validation set, and even better on the test set, than on the train set. Here are some details of what I'm doing - please let me know if I've missed any important information and I will add it in.
My question
Am I doing anything obviously wrong and do you have any advice for how I should improve my approach because I just can't believe that I'm doing it right when my model predicts significantly better on unseen data than training data!
Data
My underlying data consists of tables of features describing customer behaviour and a binary target (so this is a binary classification problem). Technically I have one such table per month and I tend to use several months of data to train and then a different month to predict (e.g. Train on Apr, May and Predict on Jun)
Generally this means I end up with a training dataset of about 100k rows and 20 features (I've previously looked into feature selection and found a set of 7 features which seem to perform best, so have been using these lately). My prediction set generally has around 50k rows.
My dataset is heavily unbalanced (approximately 2% incidence of target feature), so I'm using oversampling techniques - more on that below.
Method
I've searched around online quite a lot and this has led me to the following approach:
Take scaleable (continuous) features in the training data and standardise them (currently using sklearn StandardScaler)
Take categorical features and encode them into separate binary columns (one-hot) using Pandas get_dummies function
Remove 10% of the training data to form a validation set (I'm currently using a random seed in this process for comparability whilst I vary different things such as hyperparameters in the model)
Take the remaining 90% of training data and perform a grid search across a few parameters of the RandomForestClassifier() (currently min_samples_split, max_depth, n_estimators and max_features)
Within each hyperparameter combination from the grid I perform kfold validation with 5 folds and using a random state
Within each fold I oversample my minority class for training data only (sometimes using imbalanced-learn's RandomOverSampler() and sometimes using SMOTE() from the same package), train the model on the training data and then apply the model to the kth fold and record performance metrics (precision, recall, F1 and AUC)
Once I've been through 5 folds on each hyperparameter combination I find the best F1 score (and best precision if two combinations are tied on F1 score) and retrain a random forest on the entire 90% training data using those hyperparameters. During this step I use the same oversampling technique as I did in the kfold process
I then use this model to make predictions on the 10% of training data that I put aside earlier as a validation set, evaluating the same metrics as above
Finally I have a test set, which is actually based on data from another month, which I apply the already trained model to and evaluate the same metrics
Outcome
At the moment I'm finding that my training set achieves an F1 score of around 30%, the validation set is consistently slightly higher than this at around 36% (mostly driven by a much better precision than the training data e.g. 60% vs. 30%) and then the testing set is getting an F1 score of between 45% and 50% which is again driven by a better precision (around 65%)
Notes
Please do ask about any details I haven't mentioned; I've had my stuck in this for weeks and so have doubtless omitted some details
I've had a brief look (not a systematic analysis) of the stability of metrics between folds in the kfold validation and it seems that they aren't varying very much, so I'm fairly happy with the stability of the model here
I'm actually performing the grid search manually rather than using a Python pipeline because try as I might I couldn't get imbalanced-learn's Pipeline function to work with the oversampling functions and so I run a loop with combinations of hyperparameters, but I'm confident that this isn't impacting the results I've talked about above in an adverse way
When I apply the final model to the prediction data (and get an F1 score around 45%) I also apply it back to the training data itself out of interest and get F1 scores around 90% - 100%. I suppose this is to be expected as the model is trained and predicts on almost exactly the same data (except the 10% holdout validation set)

Homogeneous vs heterogeneous ensembles

I would like to check with you if my understanding about ensemble learning (homogeneous vs heterogeneous) is correct.
Is the following statement correct?
An homogeneous ensemble is a set of classifiers of the same type built upon different data as random forest and an heterogeneous ensemble is a set of classifiers of different types built upon same data.
If it's not correct, could you please clarify this point?
Homogeneous ensemble consists of members having a single-type base learning algorithm. Popular methods like bagging and boosting generate
diversity by sampling from or assigning weights to training
examples but generally utilize a single type of base classifier
to build the ensemble.
On the other hand, Heterogeneous ensemble consists of members having different base learning algorithms such as SVM, ANN and Decision Trees. A popular heterogeneous ensemble method is stacking, which is similar to boosting.
This table contains examples for both homogeneous and heterogeneous ensemble models.
EDIT:
Homogeneous ensemble methods, use the same feature selection method with different training data and distributing the dataset over several nodes while
Heterogeneous ensemble methods use different feature selection methods with the same training data.
Heterogeneous Ensembles (HEE) use different fine-tunes algorithms. They usually work well if we have a small amount of estimators. Note that the number of algorithms should always be odd (3+) in order to avoid ties. For example, we could combine a decision tree, a SVM and a logistic regression using a voting mechanism to improve the results. Then use combined wisdom through majority vote in order to classify a given sample. Besides voting, we can also use averaging or stacking to aggregate the results of the models.The data for each model is the same.
Homogeneous Ensembles (HOE), such as bagging work by applying the same algorithm on all the estimators. These algorithms should not be fine-tuned -> They should be weak ! In contrast to HEE we will use a large amount of estimators. Note that the datsets for this model should be separately sampled in order to guarantee independence. Furthermore, the datasets should be different for each model. This will allow us to be more precise when aggregating the results of each model. Bagging reduces variances as the sampling is truly random. Through using the ensemble itself, we can reduce the risk of over-fitting and we create a robust model. Unfortunately bagging is computationally expensive.
EDIT: Here an example in code
Heterogeneous Ensemble Function:
# Instantiate the individual models
clf_knn = KNeighborsClassifier(5)
clf_decision_tree= DecisionTreeClassifier()
clf_logistic_regression = LogisticRegression()
# Create voting classifier
clf_voting = VotingClassifier(
estimators=[
('knn', clf_knn),
('dt', clf_decision_tree),
('lr', clf_logistic_regression )])
# Fit it to the training set and predict
clf_voting.fit(X_train, y_train)
y_pred = clf_voting.predict(X_test)
Homogeneous Ensemble Function:
# Instantiate the base estimator, which is a weak model (we set max depth to 3)
clf_decision_tree = DecisionTreeClassifier(max_depth=3)
# Build the Bagging classifier with 5 estimators (we use 5 decision trees)
clf_bag = BaggingClassifier(
base_estimator=clf_decision_tree,
n_estimators=5
)
# Fit the Bagging model to the training set
clf_bag.fit(X_train, y_train)
# Make predictions on the test set
y_pred = clf_bag.predict(X_test)
Conclusion: In summary what you say is correct, yes.

Early stopping : neural networks

I'm working on relation classification with the SemEval2010 Task 8 dataset. The dataset is already split into 8'000 samples for the training and 2'717 for the testing. In order to be as fair as possible, I use only my model at the end to computing its performance (F1-Score).
In order to tune my convolutional neural networks, I keep 6'400 samples for the training and 1'600 for the validation. I train the model and after each epoch (~10' of computation) I compute the F1-Score of my predictions.
I read the paper http://page.mi.fu-berlin.de/prechelt/Biblio/stop_tricks1997.pdf and stop training when the last 3 performances were increasing (similar to UP in the paper). In the paper, they return the model corresponding to the best performance seen so far.
My question is : in order to be as accurate as possible, we need the whole 8'000 samples for the training. Is it correct to say we will train until the epoch which had the best performance on the validation set and then do the predictions ? Or should we save the model corresponding to the best performance and "waste" 1'600 samples ?

Resources