spacy - 3.1 custom loss function and data augmentation for named entity recognition for imbalanced data - named-entity-recognition

how to write a custom custom loss function for named entity recognition for imbalanced data in spacy v3 and above. My dataset contains imbalanced data for labels. For example: label a has 45000 annotations, label b has only 4000 annotations. How to do augmentation and write custom loss function in spacy.

Related

pytorch class weights for multi class classification

I am using class weights for multiclass classification using sklearn's compute_weight function and pytorch for training the model. For to compute the class weight, whether we need to use all data (training, validation and test) or only training set data for calculating the class weights. Thanks
When training your model you can only assume training data is available for you.
Estimating the class_weights is part of the training -- it defines your loss function.

Recall and f1-score are pretty low (~0.55 and 0.65) for unseen instances of custom entity on transfer-learned spacy NER model

I have a dataset annotated with custom entity. Each data point is long text (not a single sentence), possibly with multiple entities. The corpus size is around 1200 texts. This corpus divided into train-validation-test set as follows:
train-set(~60% of the data)
validation set(~20% containing some instances which are not present in training set for the entity)
test-set(~20% containing some instances that are not present in either train or validation set for entity).
I'm using transfer learning with pretrained en_core_web_sm model.
I have also custom function to get precision-recall-f1 score separately for unseen instances in the dataset. (based off get_ner_prf from spacy)
When i train model, the precision , recall and f1-score values reach till 1 for seen instances of the entity in the validation set , but it has very poor recall on unseen instances.
When predictions made on the test set, model has very poor performance, especially on unseen instances (~0.55 recall and ~0.65 f1 score).
Are there any recommendations to improve the performance of the model (especially for unseen instances) ?

Clustering model like DBSCAAN,OPTICS, KMEANS

I have a doubt whether after clustering using any algorithm is it possible to segment new data based on the learning from the previous data
The issue is that clustering algorithms are unsupervised learning algorithms. They don't need a dependent variable to predict classes. They are used to find structures/similarities in the data points. What you can do is, treat the clustered data as your supervised data.
The approach would be clustering and assigning labels in the train data. Treat it as a multi-class classification data, train a new multi-class classification model using your data and validate it on the test data.
Let train and test be the datasets.
clusters <- Clustering(train)
train[y] <- clusters
model <- Classification(train, train[y])
prediction <- model.predict(test)
However interestingly KMeans in sklearn provides fit and predict method. So using KMeans from sklearn you can predict in the new data. However, DBScan doesn't have predict which is quite obvious from it's working mechanism.
Clustering is an unsupervised mechanism where the number of clusters and the identity of the segments which need to be clustered are not known to the system.
Hence what you can do is to obtain the learning of a model which is trained for Clustering , classification,Identification or verification and apply that learning to your use case of clustering.
If the new data is from the same domain of the trained data most probably you will end up with better accuracy in clustering. (You need to properly choose the clustering methodology based on the type of data which you choose. eg for voice clustering Dominant sets and hierarchical clustering will be the most potential candidates).
If the New data is from a different domain then the selected model may fail as it learned the features in correspond to your domain of training data.

About training HMM by using EM

I am new to EM algorithm, studying Hidden Markov Model.
During training my HMM by EM, I am very confused on the data setting. (text processing)
Please confirm whether my EM usage is okay or not.
At first, I calculated statistics for emission probability matrix with my whole training set. And then, I ran EM with the same set.
-> Emission probability for unseen data converged to zero at the time.
While I read a text, Speech and Language Processing, I found the exercise 8.3 tells two phase training method.
8.3 Extend the HMM tagger you built in Exercise 8.?? by adding the ability to make use of some unlabeled data in addition to your labeled training corpus. First acquire a large unlabeled corpus. Next, implement the forward-backward training algorithm. Now start with the HMM parameters you trained on the training corpus in Exercise 8.??; call this model M0. Run the forward-backward algorithm with these HMM parameters to label the unsupervised corpus. Now you have a new model M1. Test the performance of M1 on some held-out labeled data.
Following this statement, I select some instances from my training set (1/3 of training set) for getting initial statistics.
And then, I run EM procedure with whole training set for optimizing parameters in EM.
Is it ok?
The procedure that the exercise is referring to is a type of unsupervised learning known as self-training. The idea is that you use your entire labeled trainign set to build a model. Then you collect more data that is unlabeled. It is much easier to find new unlabeled data than it is to find new labeled data. After that, you would label the new data using the model you originally trained. Now, using the automatically generated labels, train a new model.

Training DeepBelief Network to recognize multiple categories?

The learning example of the DeepBelief framework demonstrates how to train a neural network to recognize one object category. The method used for training jpcnn_train() does not have a category label parameter.
However, in the DeepBelief simple example, the given neural network can categorize multiple object categories. Is there a way to do that kind of training through DeepBelief? Or should I look in to Caffe and use that instead as DeepBelief is based on Caffe?
Based on their documentation, in particular on a docs for functions jpcnn_train and jpcnn_predict, it does not appear to support multiclass classification for custom labels out of the box. It does seem to support multiclass classification for ImageNet labels.
However, you can train multiple predictors (here's how to train one), one per your custom class, and then choose the class for which the corresponding predictor outputs the highest value.

Resources