Weka Decision Tree getting too big (out of memory) - machine-learning

For classification I used Weka's J48 decision tree to build a model on several nominal attributes. Now there is more data for classification (5 nonimal attributes) but each attribute has 3000 different values. I used J48 with pruning but it ran out of memory (associated 4GB). With a smaller dataset, I saw in the output, that J48 keeps all leaves with no instances associated with it. Why are they kept in the model? Should I switch to another classifcation algorithm?

Related

Feature selection with Decision Tree

I'm supposed to perform feature selection of my dataset (independent variables: some aspects of a patient, target varibale: patient ill or not) using a dcision tree. After that with the features selected I've to implement a different ML model.
My doubt is: when I'm implementing the decison tree is it necessary having a train and a test set or just fit the model on the whole data?
it's necessary to split the dataset into train-test because otherwise you will measure the performance with data used in training and could end up into over-fitting.
Over-fitting is where the training error constantly decrease but the generalization error increase, where by generalization error is intended as the ability of the model to classify correctly new (never seen before) samples.

Application and Deployment of K-Fold Cross-Validation

K-Fold Cross Validation is a technique applied for splitting up the data into K number of Folds for testing and training. The goal is to estimate the generalizability of a machine learning model. The model is trained K times, once on each train fold and then tested on the corresponding test fold.
Suppose I want to compare a Decision Tree and a Logistic Regression model on some arbitrary dataset with 10 Folds. Suppose after training each model on each of the 10 folds and obtaining the corresponding test accuracies, Logistic Regression has a higher mean accuracy across the test folds, indicating that it is the better model for the dataset.
Now, for application and deployment. Do I retrain the Logistic Regression model on all the data, or do I create an ensemble from the 10 Logistic Regression models that were trained on the K-Folds?
The main goal of CV is to validate that we did not get the numbers by chance. So, I believe you can just use a single model for deployment.
If you are already satisfied with hyper-parameters and model performance one option is to train on all data that you have and deploy that model.
And, the other option is obvious that you can deploy one of the CV models.
About the ensemble option, I believe it should not give significant better results than a model trained on all data; as each model train for same amount of time with similar paparameters and they have similar architecture; but train data is slightly different. So, they shouldn't show different performance. In my experience, ensemble helps when the output of models are different due to architecture or input data (like different image sizes).
The models trained during k-fold CV should never be reused. CV is only used for reliably estimating the performance of a model.
As a consequence, the standard approach is to re-train the final model on the full training data after CV.
Note that evaluating different models is akin to hyper-parameter tuning, so in theory the performance of the selected best model should be reevaluated on a fresh test set. But with only two models tested I don't think this is important in your case.
You can find more details about k-fold cross-validation here and there.

Recall and f1-score are pretty low (~0.55 and 0.65) for unseen instances of custom entity on transfer-learned spacy NER model

I have a dataset annotated with custom entity. Each data point is long text (not a single sentence), possibly with multiple entities. The corpus size is around 1200 texts. This corpus divided into train-validation-test set as follows:
train-set(~60% of the data)
validation set(~20% containing some instances which are not present in training set for the entity)
test-set(~20% containing some instances that are not present in either train or validation set for entity).
I'm using transfer learning with pretrained en_core_web_sm model.
I have also custom function to get precision-recall-f1 score separately for unseen instances in the dataset. (based off get_ner_prf from spacy)
When i train model, the precision , recall and f1-score values reach till 1 for seen instances of the entity in the validation set , but it has very poor recall on unseen instances.
When predictions made on the test set, model has very poor performance, especially on unseen instances (~0.55 recall and ~0.65 f1 score).
Are there any recommendations to improve the performance of the model (especially for unseen instances) ?

Extract trees and weights from trained xgboost model

I have already trained an xgboost model with about X trees. I want to create some replicas of the model with exact same hyper parameters, but prune the number of trees. for example i want to create a model with same weight and parameters with just half the number of trees . Is it possible to do this using xgboost api .
I tried a naive quick approach of de-serializing a trained xgboost model and resetting the booster_params['num_boost_round'] to half of what it was. But this didnt seem to impact any of the model quality and prediction scores, implying this parameter is not used when scoring/evaluation.
The only options left is to dump a text or pmml file , parse it back with a subset of trees. Wondering if it is possible to do it with xgboost api itself (like changing a parameter that would bring the same effect) without converting to a separate representation/format and parsing myself.

How to correctly combine my classifiers?

I have to solve 2 class classification problem.
I have 2 classifiers that output probabilities. Both of them are neural networks of different architecture.
Those 2 classifiers are trained and saved into 2 files.
Now I want to build meta classifier that will take probabilities as input and learn weights of those 2 classifiers.
So it will automatically decide how much should I "trust" each of my classifiers.
This model is described here:
http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/#stackingclassifier
I plan to use mlxtend library, but it seems that StackingClassifier refits models.
I do not want to refit because it takes very huge amount of time.
From the other side I understand that refitting is necessary to "coordinate" work of each classifier and "tune" the whole system.
What should I do in such situation?
I won't talk about mlxtend because I haven't worked with it but I'll tell you the general idea.
You don't have to refit these models to the training set but you have to refit them to parts of it so you can create out-of-fold predictions.
Specifically, split your training data in a few pieces (usually 3 to 10). Keep one piece (i.e. fold) as validation data and train both models on the other folds. Then, predict the probabilities for the validation data using both models. Repeat the procedure treating each fold as a validation set. In the end, you should have the probabilities for all data points in the training set.
Then, you can train a meta-classifier using these probabilities and the ground truth labels. You can use the trained meta-classifier on your new data.

Resources