deeplearning4j: online Word2Vec training - deeplearning4j

Word2vec is a great tool is deeplearning4j. I managed to create a vector for a corpus following this tutorial.
The question now is how to update the model with new sentences without having to rebuild it again from scratch.
Some thoughts on this, would this method helps?
public void trainSentence(List<VocabWord> sentence){}
Would that update the model? If yes, how to prepare the sentence to be sent to this method?

Yes and no. In the documentation here, it mentions:
Weights update after model serialization/deserialization was added.
That is, you can update model state with, say, 200GB of new text by
calling loadFullModel, adding TokenizerFactory and
SentenceIterator to it, and calling fit() on the restored model.
This means that the model weights could be retrained and updated with new corpus. But no new words will be added to the vocab.
Check code and Javadoc here.

Related

Re-Training CNN with new Data

I built a CNN to classify 10 different classes. It performs well on most of the classes, giving approx. 80-85% accu. per class. Current each class has 10000+ images
But in the future, there is a possibility that I might get more data for each class.
For instance, I get more data for some class, how should I re-train the model?
Should I retrain the entire thing i.e with old and new data?
Should I retrain the model with only new data? Here, I fear that as the model will get new data for a single class, the model can possibly forget what it has already learned or might affect the accuracies of other classes.
If anyone has worked on this problem before, please help......

Should I retrain the model with the whole dataset after using a train-test split to find the best hyper parameters?

I split my dataset into training and testing. At the end after finding the best hyper parameters for the training dataset, should I fit the model again using all the data? The point is to reach the highest possible score for new data.
Yes, that would help to generalize your model, as more data generally means better generalization.
I don't think so. If you do that, you will no longer have a valid test set. What happens when you come back to improve the model later? If you do this, then you will need a new test set each model improvement, which means more labeling. You won't be able to compare experiments across model versions, because the test set won't be identical.
If you consider this model finished forever, then ok.

text2vec - Do topics' words update with new data?

I'm currently performing a topic modelling using LDA from text2vec package. I managed to create a dtm matrix and then apply LDA and its fit_transform method with n_topics=50.
While looking at the top words from each topic, a question popped into my mind. I plan to apply the model to new data afterwards and there's a possibility of occurence of new words, which were not encountered by the model before. Will the model still be able to assign each word to its respective topic? Moreover, will these words also be added to the topic, so that I will be able to locate them using get_top_words?
Thank you for answering!
Idea of statistical learning is that underlying distributions of "train" data and "test" data are more or less the same. So if your new documents contains totally different distribution you can't expect LDA will magically work. This is true for any other model.
During inference time topic-word distribution is fixed (it was learned at training stage). So get_top_words will always return same words after model trained.
And of course new words won't be included automatically - Document-Term matrix constructed from a vocabulary (which you learn before construction of DTM) and new documents will also contain only words from fixed vocabulary.

How to create incremental NER training model(Appending in existing model)?

I am training customized Named Entity Recognition(NER) model using stanford NLP but the thing is i want to re-train the model.
Example :
Suppose i trained xyz model , then i will test it on some text if model detected somethings wrong then i (end user) will correct it and wanna re-train(append mode) the model on the corrected text.
Stanford Doesn't provide re-training facility so thats why i shifted towards spacy library of python , where i can retrain the model means , i can append new entities into the existing model.But after re-training the model using spacy , it overriding the existing knowledge(means existing training data in it) and just showing the result related to recent training.
Consider , i trained a model on TECHNOLOGY tag using 1000 records.after that lets say i have added one more entity BOOK_NAME to existing trained model.after this if i test model then spacy model just detecting BOOK_NAME from text.
Please give a suggestion to tackle my problem statement.
Thanks in Advance...!
I think it is a bit late to address this here. The issue you are facing is what is also called 'Catastrophic Forgetting problem'. You can get over it by sending in examples for existing examples. Like Spacy can predict well on well formed text like BBC corpus. You can choose such corpus, predict using pretrained model of spacy and create training examples. Mix these examples with your new examples and then train. You should now get better results. It was mentioned already in the spacy issues.

Is it possible to retrain googles inception model with one class?

I would like to train this beautiful model to recognize only one type of images. To be clear at the end having the model capable of telling if the new image is part of that class or no. Thank you very much for your help.
You should keep in mind is that when you want to recognize a "dog" for example you need to know what is NOT a "dog" as well. So your classification problem is a two class problem and not one class. Your two classes will be "My Type" and "Not My Type".
About retraining your model, yes it is possible. I guess you use a model pretrained on Imagenet Dataset. There is two cases : If the classification problem is close (for example if your "type" is a class from Imagenet) you can just replace your last layer (replace Fully connected 1x1000 by FC 1x2) and retrain on this layer. If the problem is not the same you may want to retrain more layers.
It also depends on the number of Samples you have for your retrain.
I hope it helps or clarifies your question.
Is it possible to retrain googles inception model with one class?
Yes. Just remove the last layer, add a new layer with one (or two) nodes and train it on your new problem. This way you keep general features learned on the (probably bigger) image net dataset.

Resources