How to create incremental NER training model(Appending in existing model)? - machine-learning

I am training customized Named Entity Recognition(NER) model using stanford NLP but the thing is i want to re-train the model.
Example :
Suppose i trained xyz model , then i will test it on some text if model detected somethings wrong then i (end user) will correct it and wanna re-train(append mode) the model on the corrected text.
Stanford Doesn't provide re-training facility so thats why i shifted towards spacy library of python , where i can retrain the model means , i can append new entities into the existing model.But after re-training the model using spacy , it overriding the existing knowledge(means existing training data in it) and just showing the result related to recent training.
Consider , i trained a model on TECHNOLOGY tag using 1000 records.after that lets say i have added one more entity BOOK_NAME to existing trained model.after this if i test model then spacy model just detecting BOOK_NAME from text.
Please give a suggestion to tackle my problem statement.
Thanks in Advance...!

I think it is a bit late to address this here. The issue you are facing is what is also called 'Catastrophic Forgetting problem'. You can get over it by sending in examples for existing examples. Like Spacy can predict well on well formed text like BBC corpus. You can choose such corpus, predict using pretrained model of spacy and create training examples. Mix these examples with your new examples and then train. You should now get better results. It was mentioned already in the spacy issues.

Related

Evaluation of generative models like variational autoencoder

i hope everyone is doing well
I need some help with generative models.
So im working on a project where the main task is to build a binary classification model. In the dataset which contains 300000 sample and 100 feature, there is an imbalance between the 2 classes where majority class is too much bigger than the minory class.
To handle this problem, i'm using VAE (variational autoencoders) to solve this problem.
So i started training the VAE on the minority class and then use the decoder part of the VAE to generate new or fake samples that are similars to the minority class then concatenate this new data with training set in order to have a new balanced training set.
My question is : is there anyway to evalutate generative models like vae, like is there a way to know if the data generated is similar to the real one ??
I have read that there is some metrics to evaluate generated data like inception distance and Frechet inception distance but i saw that they have been only used on image data
I wanna know if i can use them too on my dataset ?
Thanks in advance
I believe your data is not image as you say there are 100 features. What I believe that you can check the similarity between the synthesised features and the original features (the ones belong to minority class), and keep only the ones with certain similarity. Cosine similarity index would be useful for this problem.
That would be also very nice to check a scatter plot of the synthesised features with the original ones to see if they are close to each other. tSNE would be useful at this point.

Using Keras to create a model that can generate new, similar data

I am working with Keras and experimenting with AI and Machine Learning. I have a few projects made already and now I'm looking to replicate a dataset. What direction do I go to learn this? What should I be looking up to begin learning about this model? I just need an expert to point me in the right direction.
To clarify; by replicating a dataset I mean I want to take a series of numbers with an easily distinguishable pattern and then have the AI generate new data that is similar.
There are several ways to generate new data similar to a current dataset, but the most prominent way nowadays is to use a Generative Adversarial Network (GAN). This works by pitting two models against one another. The generator model attempts to generate data, and the discriminator model attempts to tell the difference between real data and generated data. There are plenty of tutorials out there on how to do this, though most of them are probably based on image data.
If you want to generate labels as well, make a conditional GAN.
The only other common method for generating data is a Variational Autoencoder (VAE), but the generated data tend to be lower-quality than what a GAN can generate. I don't know if that holds true for non-image data, though.
You can also use Conditional Variational Autoencoder which produces new data with label.

Train Spacy NER model with 'en_core_web_sm' as base model

I am using Spacy to train my NER model with new entities and I am using en_core_web_sm model as my base model because I also want to detect the basic entities (ORG, PERSON, DATE, etc). I ran en_core_web_sm model over unlabelled sentences, and adding their annotations to my training set.
After I finished with that, now I want to create the training data for the new entities. For example, I want to add a new entity called FRUIT. I have a bunch of sentences (in addition to those that were annotated using en_core_web_sm earlier) that I am going to annotate. The sentence example is:
"James likes eating apples".
My question is: Do I still need to annotate "James" as PERSON as well as annotating "apples" as FRUIT? Or whether I don't need to do it because I already have another bunch of sentences that were annotated with PERSON entity using en_core_web_sm model earlier.
Short answer:
Yes, if you want to keep your model precise.
Long answer:
NER is implemented using Machine Learning algorithms. These classify a token as a Entity based on learned distributions and surrounding tokens.
Therefore, if you provide several samples of annotated text without marking a word (token) as a specific Entity that it usually represents, you may affect your model precision by providing samples to your model where that token is unimportant.

text2vec - Do topics' words update with new data?

I'm currently performing a topic modelling using LDA from text2vec package. I managed to create a dtm matrix and then apply LDA and its fit_transform method with n_topics=50.
While looking at the top words from each topic, a question popped into my mind. I plan to apply the model to new data afterwards and there's a possibility of occurence of new words, which were not encountered by the model before. Will the model still be able to assign each word to its respective topic? Moreover, will these words also be added to the topic, so that I will be able to locate them using get_top_words?
Thank you for answering!
Idea of statistical learning is that underlying distributions of "train" data and "test" data are more or less the same. So if your new documents contains totally different distribution you can't expect LDA will magically work. This is true for any other model.
During inference time topic-word distribution is fixed (it was learned at training stage). So get_top_words will always return same words after model trained.
And of course new words won't be included automatically - Document-Term matrix constructed from a vocabulary (which you learn before construction of DTM) and new documents will also contain only words from fixed vocabulary.

Adding new classes to SGDClassifier?

I'm currently using partial_fit with SGDClassifier to fit a model to predict the hashtags on images.
The problem I'm having is that SGDClassifier requires specifying classes upfront. This is ok to fit a model offline but I'd like to add new classes online when observing new hashtags. Currently, I need to retrain a new model from scratch to accommodate the new classes.
Is there a way to have SGDClassifier accept new classes without having to retrain a new model? Or would I be better off training a separate binary SGDClassifier for each hashtag?
Thanks
Hashtags are usually just tags, thus one object can have many of them. In such setting there is no multiclass scenario - and you should have just a single SGD binary classifier per tag. You can obviously fit more complex models taking into account reasoning between tags, but SGD is not duing so, thus using it in a provided setting does not make any more sense than just having N distinct classifiers.

Resources