Re-training a tensorflow model - machine-learning

I am training a customized Named Entity Recognition (NER) model using NeuroNER which is written using tensor-flow.I am able to train a model and its performing well but when i am re-training it on new observation for which it showing incorrect result it correcting them but its affecting/forgetting some previous observation for which it showing correct results.
I want online re-training.I tried using stanfordNLP , Spacy and now tensor-flow.please suggest a better way to achieve the desired goals.
Thanks

I think there is a misunderstanding behind this question. When you train a model you adjust a set of parameters, sometimes millions of them. Your model will then learn to fit this data.
The thing with Neural Network is that they may forget. It sounds bad but is actually what makes it really strong: it learn to forget what is useless.
That is, if you retrain you should probably:
- run just a few epoch, otherwise the model will overfit the new dataset thus forgetting everything else
- learn on a bigger dataset i.e. past+new data, would ensure that nothing is forgotten
- maybe use a larger setup (in terms of hidden layers size, or number of layer) since you cannot indefinitely hope to learn more with the same setup.
I'm not expert in online training but that's not something you can expect without effort. It is in fact quite hard to do in practice. It's far from being the default behavior when you "just" continue training.
Hope it helps.

Related

Model overfits when you don't have much varied data

I am trying to understand why it is that a model overfits when you have little data to run with.
I get the typical artistic idea behind it whereby you would essentially have the model "memorize" whatever little data (or variations to be specific) you've given it.
But is there a more robust reason for this?
Couldn't you for example with a small dataset (or large one) with very little variation, just force it to not overfit by constraining the model or adding some form of regularization?
P.S I have seen an explanation detailing how not introducing the type of variance that exists within the population can definitely lead the model to generalize less and less. But is this just a quick way to rationalize it or is there, again as i mentioned above, a way to eliminate this lack of variance in the data?
yes, you can add regularization, batch normalization or even dropout to reduce overfitting. model overfit when you have to less data as compared to number of parameters in models such as weights in neural network.
Also you can fix error of model in batches rather then individual sample that way your model is less likely overfit the data.
You can also add noise to data to reduce overfitting.

Should I retrain the model with the whole dataset after using a train-test split to find the best hyper parameters?

I split my dataset into training and testing. At the end after finding the best hyper parameters for the training dataset, should I fit the model again using all the data? The point is to reach the highest possible score for new data.
Yes, that would help to generalize your model, as more data generally means better generalization.
I don't think so. If you do that, you will no longer have a valid test set. What happens when you come back to improve the model later? If you do this, then you will need a new test set each model improvement, which means more labeling. You won't be able to compare experiments across model versions, because the test set won't be identical.
If you consider this model finished forever, then ok.

Deep learning classification with no labels

I must participate in a research project regarding a deep learning application for classification. I have a huge dataset containing over 35000 features - these are good values, taken from laboratory.
The idea is that I should create a classifier that must tell, given a new input, if the data seems to be good or not. I must use deep learning with keras and tensor flow.
The problem is that the data is not classified. I will enter a new column with 1 for good and 0 for bad. Problem is, how can I find out if an entry is bad, given the fact that the whole training set is good?
I have thought about generating some garbage data but I don't know if this is a good idea - I don't even know how to generate it. Do you have any tips?
I would start with anamoly detection. You can first reduce features with f.e. an (stacked) autoencoder and then use local outlier factor from sklearn: https://scikit-learn.org/stable/modules/outlier_detection.html
The reason why you need to reduce features first is, is because your LOF will be much more stable.

Is it considered overfit a decision tree with a perfect attribute?

I have a 6-dimensional training dataset where there is a perfect numeric attribute which separates all the training examples this way: if TIME<200 then the example belongs to class1, if TIME>=200 then example belongs to class2. J48 creates a tree with only 1 level and this attribute as the only node.
However, the test dataset does not follow this hypothesis and all the examples are missclassified. I'm having trouble figuring out whether this case is considered overfitting or not. I would say it is not as the dataset is that simple, but as far as I understood the definition of overfit, it implies a high fitting to the training data, and this I what I have. Any help?
However, the test dataset does not follow this hypothesis and all the examples are missclassified. I'm having trouble figuring out whether this case is considered overfitting or not. I would say it is not as the dataset is that simple, but as far as I understood the definition of overfit, it implies a high fitting to the training data, and this I what I have. Any help?
Usually great training score and bad testing means overfitting. But this assumes IID of the data, and you are clearly violating this assumption - your training data is completely different from the testing one (there is a clear rule for the training data which has no meaning for testing one). In other words - your train/test split is incorrect, or your whole problem does not follow basic assumptions of where to use statistical ml. Of course we often fit models without valid assumptions about the data, in your case - the most natural approach is to drop a feature which violates the assumption the most - the one used to construct the node. This kind of "expert decisions" should be done prior to building any classifier, you have to think about "what is different in test scenario as compared to training one" and remove things that show this difference - otherwise you have heavy skew in your data collection, thus statistical methods will fail.
Yes, it is an overfit. The first rule in creating a training set is to make it look as much like any other set as possible. Your training set is clearly different than any other. It has the answer embedded within it while your test set doesn't. Any learning algorithm will likely find the correlation to the answer and use it and, just like the J48 algorithm, will regard the other variables as noise. The software equivalent of Clever Hans.
You can overcome this by either removing the variable or by training on a set drawn randomly from the entire available set. However, since you know that there is a subset with an embedded major hint, you should remove the hint.
You're lucky. At times these hints can be quite subtle which you won't discover until you start applying the model to future data.

Predicting Classifications with Naive Bayes and dealing with Features/Words not in the training set

Consider the text classification problem of spam or not spam with the Naive Bayes algorithm.
The question is the following:
how do you make predictions about a document W = if in that set of words you see a new word wordX that was not seen at all by your model (so you do not even have a laplace smoothing probabilty estimated for it)?
Is the usual thing to do is just ignore that wordX eventhough it was seen in the current text because it has no probability associated with? I.e. I know sometimes the laplace smoothing is used to try to solve this problem, but what if that word is definitively new?
Some of the solutions that I've thought of:
1) Just ignore that words in estimating a classification (most simple, but sometimes wrong...?, however, if the training set is large enough, this is probably the best thing to do, as I think its reasonable to assume your features and stuff were selected well enough if you have say 1M or 20M data).
2) Add that word to your model and change your model completely, because the vocabulary changed so probabilities have to change everywhere (this does have a problem though since it could mean that you have to update the model frequently, specially if your analysis 1M documents, say)
I've done some research on this, read some of the Dan Jurafsky NLP and NB slides and watched some videos on coursera and looked through some research papers but I was not able to find something I found useful. It feels to me this problem is not new at all and there should be something (a heuristic..?) out there. If there isn't, it would be awesome to know that too!
Hope this is a useful post for the community and Thanks in advance.
PS: to make the issue a little more explicit with one of the solutions I've seen is, say that we see an unknown new word wordX in a spam, then for that word we can do 1/ count(spams) + |Vocabulary + 1|, the issue I have with doing something like that is that, then, does that mean we change the size of the vocabulary and now, every new document we classify, has a new feature and vocabulary word? This video seems to attempt to solve that issue but I'm not sure if either, thats a good thing to do or 2, maybe I have misunderstood it:
https://class.coursera.org/nlp/lecture/26
From a practical perspective (keeping in mind this is not all you're asking), I'd suggest the following framework:
Train a model using an initial train set, and start using it for classificaion
Whenever a new word (with respect to your current model) appears, use some smoothing method to account for it. e.g. Laplace smoothing, as suggested in the question, might be a good start.
Periodically retrain your model using new data (usually in addition to the original train set), to account for changes in the problem domain, e.g. new terms. This can be done on preset intervals, e.g once a month; after some number of unknown words was encountered, or in an online manner, i.e. after each input document.
This retrain step can be done manually, e.g. collect all documents containing unknown terms, manually label them, and retrain; or using semi-supervised learning methods, e.g. automatically add the highest scored spam/ non spam documents to the respective models.
This will ensure your model stays updated and accounts for new terms - by adding them to the model from time to time, and by accounting for them even before that (simply ignoring them is usually not a good idea).

Resources