Recall and f1-score are pretty low (~0.55 and 0.65) for unseen instances of custom entity on transfer-learned spacy NER model - named-entity-recognition

I have a dataset annotated with custom entity. Each data point is long text (not a single sentence), possibly with multiple entities. The corpus size is around 1200 texts. This corpus divided into train-validation-test set as follows:
train-set(~60% of the data)
validation set(~20% containing some instances which are not present in training set for the entity)
test-set(~20% containing some instances that are not present in either train or validation set for entity).
I'm using transfer learning with pretrained en_core_web_sm model.
I have also custom function to get precision-recall-f1 score separately for unseen instances in the dataset. (based off get_ner_prf from spacy)
When i train model, the precision , recall and f1-score values reach till 1 for seen instances of the entity in the validation set , but it has very poor recall on unseen instances.
When predictions made on the test set, model has very poor performance, especially on unseen instances (~0.55 recall and ~0.65 f1 score).
Are there any recommendations to improve the performance of the model (especially for unseen instances) ?

Related

Why does my Random Forest Classifier perform better on test and validation data than on training data?

I'm currently training a random forest on some data I have and I'm finding that the model performs better on the validation set, and even better on the test set, than on the train set. Here are some details of what I'm doing - please let me know if I've missed any important information and I will add it in.
My question
Am I doing anything obviously wrong and do you have any advice for how I should improve my approach because I just can't believe that I'm doing it right when my model predicts significantly better on unseen data than training data!
Data
My underlying data consists of tables of features describing customer behaviour and a binary target (so this is a binary classification problem). Technically I have one such table per month and I tend to use several months of data to train and then a different month to predict (e.g. Train on Apr, May and Predict on Jun)
Generally this means I end up with a training dataset of about 100k rows and 20 features (I've previously looked into feature selection and found a set of 7 features which seem to perform best, so have been using these lately). My prediction set generally has around 50k rows.
My dataset is heavily unbalanced (approximately 2% incidence of target feature), so I'm using oversampling techniques - more on that below.
Method
I've searched around online quite a lot and this has led me to the following approach:
Take scaleable (continuous) features in the training data and standardise them (currently using sklearn StandardScaler)
Take categorical features and encode them into separate binary columns (one-hot) using Pandas get_dummies function
Remove 10% of the training data to form a validation set (I'm currently using a random seed in this process for comparability whilst I vary different things such as hyperparameters in the model)
Take the remaining 90% of training data and perform a grid search across a few parameters of the RandomForestClassifier() (currently min_samples_split, max_depth, n_estimators and max_features)
Within each hyperparameter combination from the grid I perform kfold validation with 5 folds and using a random state
Within each fold I oversample my minority class for training data only (sometimes using imbalanced-learn's RandomOverSampler() and sometimes using SMOTE() from the same package), train the model on the training data and then apply the model to the kth fold and record performance metrics (precision, recall, F1 and AUC)
Once I've been through 5 folds on each hyperparameter combination I find the best F1 score (and best precision if two combinations are tied on F1 score) and retrain a random forest on the entire 90% training data using those hyperparameters. During this step I use the same oversampling technique as I did in the kfold process
I then use this model to make predictions on the 10% of training data that I put aside earlier as a validation set, evaluating the same metrics as above
Finally I have a test set, which is actually based on data from another month, which I apply the already trained model to and evaluate the same metrics
Outcome
At the moment I'm finding that my training set achieves an F1 score of around 30%, the validation set is consistently slightly higher than this at around 36% (mostly driven by a much better precision than the training data e.g. 60% vs. 30%) and then the testing set is getting an F1 score of between 45% and 50% which is again driven by a better precision (around 65%)
Notes
Please do ask about any details I haven't mentioned; I've had my stuck in this for weeks and so have doubtless omitted some details
I've had a brief look (not a systematic analysis) of the stability of metrics between folds in the kfold validation and it seems that they aren't varying very much, so I'm fairly happy with the stability of the model here
I'm actually performing the grid search manually rather than using a Python pipeline because try as I might I couldn't get imbalanced-learn's Pipeline function to work with the oversampling functions and so I run a loop with combinations of hyperparameters, but I'm confident that this isn't impacting the results I've talked about above in an adverse way
When I apply the final model to the prediction data (and get an F1 score around 45%) I also apply it back to the training data itself out of interest and get F1 scores around 90% - 100%. I suppose this is to be expected as the model is trained and predicts on almost exactly the same data (except the 10% holdout validation set)

How to merge two saved keras model?

Let's say I have a dataset of 2 million. At first, I used only 1 million, trained those and saved the model in h5 format like first.h5. Later I used another 1 million data, trained those using the same algorithm and saved as second.h5. Training requires more than a day , hence I can't use all two million data at once. Is there any way , I can merge those two saved model like first.h5 + second.h5 = merged.h5
There is no way you can do that (merge models). Let me put it in simple terms. You train a child named first using some 1 million data to identify if an image is a cat or a dog. Then you trained a second child named second using the other 1 million data to identify if an image is a cat or a dog. Now what you are asking for is to combine the first and second.
However, assume that the training data is IID (independent and identically distributed) then what you can do is create an ensemble of both the models for making predictions.
The simple way to ensemble two models is are
Max Voting
Averaging
Weighted Averaging
Follow this link on how to the ensemble.
Or a simple strategy is to average the final score of both the models and use the averaged score to make the predictions.
A more powerful strategy is to use the validation set to find the weights for the classes and then use these weights for making the final predictions on unseen data.
You could merge - average weights - but this will not be the same as training with full dataset.
Usually training with more data leads to better results, to better model.
If you don' t want to train with full dataset i would recommend not to average weights but to use both models for inference and average predictions.

How do neural networks learn functions instead of memorize them?

For a class project, I designed a neural network to approximate sin(x), but ended up with a NN that just memorized my function over the data points I gave it. My NN took in x-values with a batch size of 200. Each x-value was multiplied by 200 different weights, mapping to 200 different neurons in my first layer. My first hidden layer contained 200 neurons, each one a linear combination of the x-values in the batch. My second hidden layer also contained 200 neurons, and my loss function was computed between the 200 neurons in my second layer and the 200 values of sin(x) that the input mapped to.
The problem is, my NN perfectly "approximated" sin(x) with 0 loss, but I know it wouldn't generalize to other data points.
What did I do wrong in designing this neural network, and how can I avoid memorization and instead design my NN's to "learn" about the patterns in my data?
It is same with any machine learning algorithm. You have a dataset based on which you try to learn "the" function f(x), which actually generated the data. In real life datasets, it is impossible to get the original function from the data, and therefore we approximate it using something g(x).
The main goal of any machine learning algorithm is to predict unseen data as best as possible using the function g(x).
Given a dataset D you can always train a model, which will perfectly classify all the datapoints (you can use a hashmap to get 0 error on the train set), but which is overfitting or memorization.
To avoid such things, you yourself have to make sure that the model does not memorise and learns the function. There are a few things which can be done. I am trying to write them down in an informal way (with links).
Train, Validation, Test
If you have large enough dataset, use Train, Validation, Test splits. Split the dataset in three parts. Typically 60%, 20% and 20% for Training, Validation and Test, respectively. (These numbers can vary based on need, also in case of imbalanced data, check how to get stratified partitions which preserve the class ratios in every split). Next, forget about the Test partition, keep it somewhere safe, don't touch it. Your model, will be trained using the Training partition. Once you have trained the model, evaluate the performance of the model using the Validation set. Then select another set of hyper-parameter configuration for your model (eg. number of hidden layer, learaning algorithm, other parameters etc.) and then train the model again, and evaluate based on Validation set. Keep on doing this for several such models. Then select the model, which got you the best validation score.
The role of validation set here is to check what the model has learned. If the model has overfit, then the validation scores will be very bad, and therefore in the above process you will discard those overfit models. But keep in mind, although you did not use the Validation set to train the model, directly, but the Validation set was used indirectly to select the model.
Once you have selected a final model based on Validation set. Now take out your Test set, as if you just got new dataset from real life, which no one has ever seen. The prediction of the model on this Test set will be an indication how well your model has "learned" as it is now trying to predict datapoints which it has never seen (directly or indirectly).
It is key to not go back and tune your model based on the Test score. This is because once you do this, the Test set will start contributing to your mode.
Crossvalidation and bootstrap sampling
On the other hand, if your dataset is small. You can use bootstrap sampling, or k-fold cross-validation. These ideas are similar. For example, for k-fold cross-validation, if k=5, then you split the dataset in 5 parts (also be carefull about stratified sampling). Let's name the parts a,b,c,d,e. Use the partitions [a,b,c,d] to train and get the prediction scores on [e] only. Next, use the partitions [a,b,c,e] and use the prediction scores on [d] only, and continue 5 times, where each time, you keep one partition alone and train the model with the other 4. After this, take an average of these scores. This is indicative of that your model might perform if it sees new data. It is also a good practice to do this multiple times and perform an average. For example, for smaller datasets, perform a 10 time 10-folds cross-validation, which will give a pretty stable score (depending on the dataset) which will be indicative of the prediction performance.
Bootstrap sampling is similar, but you need to sample the same number of datapoints (depends) with replacement from the dataset and use this sample to train. This set will have some datapoints repeated (as it was a sample with replacement). Then use the missing datapoins from the training dataset to evaluate the model. Perform this multiple times and average the performance.
Others
Other ways are to incorporate regularisation techniques in the classifier cost function itself. For example in Support Vector Machines, the cost function enforces conditions such that the decision boundary maintains a "margin" or a gap between two class regions. In neural networks one can also do similar things (although it is not same as in SVM).
In neural network you can use early stopping to stop the training. What this does, is train on the Train dataset, but at each epoch, it evaluates the performance on the Validation dataset. If the model starts to overfit from a specific epoch, then the error for Training dataset will keep on decreasing, but the error of the Validation dataset will start increasing, indicating that your model is overfitting. Based on this one can stop training.
A large dataset from real world tends not to overfit too much (citation needed). Also, if you have too many parameters in your model (to many hidden units and layers), and if the model is unnecessarily complex, it will tend to overfit. A model with lesser pameter will never overfit (though can underfit, if parameters are too low).
In the case of you sin function task, the neural net has to overfit, as it is ... the sin function. These tests can really help debug and experiment with your code.
Another important note, if you try to do a Train, Validation, Test, or k-fold crossvalidation on the data generated by the sin function dataset, then splitting it in the "usual" way will not work as in this case we are dealing with a time-series, and for those cases, one can use techniques mentioned here
First of all, I think it's a great project to approximate sin(x). It would be great if you could share the snippet or some additional details so that we could pin point the exact problem.
However, I think that the problem is that you are overfitting the data hence you are not able to generalize well to other data points.
Few tricks that might work,
Get more training points
Go for regularization
Add a test set so that you know whether you are overfitting or not.
Keep in mind that 0 loss or 100% accuracy is mostly not good on training set.

Machine learning: training model from test data

I was wondering if a model trains itself from the test data as well while evaluating it multiple times, leading to a over-fitting scenario. Normally we split the training data into train-test splits and I noticed some people split it into 3 sets of data - train, test and eval. eval is for final evaluation of the model. I might be wrong but my point is that if the above mentioned scenario is not true, then there is no need for an eval data set.
Need some clarification.
The best way to evaluate how well a model will perform in the 'wild' is to evaluate its performance on a data set it has not seen (i.e., been trained on) -- assuming you have the labels in a supervised learning problem.
People split their data into train/test/eval and use the training data to estimate/learn the model parameters and the test set to tune the model (e.g., by trying different hyperparameter combinations). A model is usually selected based on the hyperparameter combination that optimizes a test metric (regression - MSE, R^2, etc.; classification - AUC, accuracy, etc.). Then the selected model is usually retrained on the combined train + test data set. After retraining, the model is evaluated based on its performance on the eval data set (assuming you have some ground truth labels to evaluate your predictions). The eval metric is what you report as the generalization metric -- that is, how well your model performs on novel data.
Does this help?
Consider you have train and test datasets. Train dataset is the one in which you know the output and you train your model on train dataset and you try to predict the output of Test dataset.
Most people split train dataset into train and validation. So first you run your model on train data and evaluate it on validation set. Then again you run the model on test dataset.
Now you are wondering how this will help and of any use?
This helps you to understand your model performance on seen data(validation data) and unseen data(your test data).
Here comes bias-variance trade-off into picture.
https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/
Let's consider a binary classification example where a student's previous semester grades, Sports achievements, Extracurriculars etc are used to predict whether or not he will pass the final semester.
Let's say we have around 10000 samples (data of 10000 students).
Now we split them:
Training set - 6000 samples
Validation set - 2000 samples
Test set - 1000 samples
The training data is generally split into three (training set, validation set, and test set) for the following reasons:
1) Feature Selection: Let's assume you have trained the model using some algorithm. You calculate the training accuracy and validation accuracy. You plot the learning curves and find if the model is overfitting or underfitting and make changes (add or remove features, add more samples etc). Repeat until you have the best validation accuracy. Now test the model with the test set to get your final score.
2) Parameter Selection: When you use algorithms like KNN, And you need to find the best K value which fits the model properly. You can plot the accuracy of different K value and choose the best validation accuracy and use it for your test set. (same applies when you find n_estimators for Random forests etc)
3) Model Selection: Also you can train the model with different algorithms and choose the model which better fits the data by testing out the accuracy using validation set.
So basically the Validation set helps you evaluate your model's performance how you must fine-tune it for best accuracy.
Hope you find this helpful.

How to correctly combine my classifiers?

I have to solve 2 class classification problem.
I have 2 classifiers that output probabilities. Both of them are neural networks of different architecture.
Those 2 classifiers are trained and saved into 2 files.
Now I want to build meta classifier that will take probabilities as input and learn weights of those 2 classifiers.
So it will automatically decide how much should I "trust" each of my classifiers.
This model is described here:
http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/#stackingclassifier
I plan to use mlxtend library, but it seems that StackingClassifier refits models.
I do not want to refit because it takes very huge amount of time.
From the other side I understand that refitting is necessary to "coordinate" work of each classifier and "tune" the whole system.
What should I do in such situation?
I won't talk about mlxtend because I haven't worked with it but I'll tell you the general idea.
You don't have to refit these models to the training set but you have to refit them to parts of it so you can create out-of-fold predictions.
Specifically, split your training data in a few pieces (usually 3 to 10). Keep one piece (i.e. fold) as validation data and train both models on the other folds. Then, predict the probabilities for the validation data using both models. Repeat the procedure treating each fold as a validation set. In the end, you should have the probabilities for all data points in the training set.
Then, you can train a meta-classifier using these probabilities and the ground truth labels. You can use the trained meta-classifier on your new data.

Resources