In a post "The "Cross-Validation - Train/Predict" misunderstanding" by Patrick Schratz
https://mlr-org.com/docs/cv-vs-predict/
mentioned that:
(a) CV is done to get an estimate of a model’s performance.
(b) Train/predict is done to create the final predictions (which your boss might use to make some decisions on).
It means in mlr3, if we are in academia, need to publish papers, we need to use the CV as we intend to compare the performance of different algorithms. And in industry, if our plan is to train a model and then have to use again and again on industry data to make predictions, we need to use the train/predict methods provided by mlr3 ?
Is it something which I completely picked wrong?
Thank you
You always need a CV if you want to make a statement about a model's performance.
If you want to use the model to make predictions to unknown data, do a single fit and then predict.
So in practice, you need both: CV + "train+predict".
PS: Your post does not really fit to Stackoverflow since it is not related to a coding problem. For statistical questions please see https://stats.stackexchange.com/.
PS2: If you talk about a post, please include the link. I am the author of the post in this case but most other people might not know what you are talking about ;)
Related
I am training a customized Named Entity Recognition (NER) model using NeuroNER which is written using tensor-flow.I am able to train a model and its performing well but when i am re-training it on new observation for which it showing incorrect result it correcting them but its affecting/forgetting some previous observation for which it showing correct results.
I want online re-training.I tried using stanfordNLP , Spacy and now tensor-flow.please suggest a better way to achieve the desired goals.
Thanks
I think there is a misunderstanding behind this question. When you train a model you adjust a set of parameters, sometimes millions of them. Your model will then learn to fit this data.
The thing with Neural Network is that they may forget. It sounds bad but is actually what makes it really strong: it learn to forget what is useless.
That is, if you retrain you should probably:
- run just a few epoch, otherwise the model will overfit the new dataset thus forgetting everything else
- learn on a bigger dataset i.e. past+new data, would ensure that nothing is forgotten
- maybe use a larger setup (in terms of hidden layers size, or number of layer) since you cannot indefinitely hope to learn more with the same setup.
I'm not expert in online training but that's not something you can expect without effort. It is in fact quite hard to do in practice. It's far from being the default behavior when you "just" continue training.
Hope it helps.
I have used a ML approach to my research using python scikit-learn. I found that SVM and logistic regression classifiers work best (eg: 85% accuracy), decision trees works markedly worse (65%), and then Naive Bayes works markedly worse (40%).
I will write up the conclusion to illustrate the obvious that some ML classifiers worked better than the others by a large margin, but what else can I say about my learning task or data structure based on these observations?
Edition:
The data set involved 500,000 rows, and I have 15 features but some of the features are various combination of substrings of certain text, so it naturally expands to tens of thousands of columns as a sparse matrix. I am using people's name to predict some binary class (eg: Gender), though I feature engineer a lot from the name entity like the length of the name, the substrings of the name, etc.
I recommend you to visit this awesome map on choosing the right estimator by the scikit-learn team http://scikit-learn.org/stable/tutorial/machine_learning_map
As describing the specifics of your own case would be an enormous task (I totally understand you didn't do it!) I encourage you to ask yourself several questions. Thus, I think the map on 'choosing the right estimator' is a good start.
Literally, go to the 'start' node in the map and follow the path:
is my number of samples > 50?
And so on. In the end you might end at some point and see if your results match with the recommendations in the map (i.e. did I end up in a SVM, which gives me better results?). If so, go deeper into the documentation and ask yourself why is that one classifier performing better on text data or whatever insight you get.
As I told you, we don't know the specifics of your data, but you should be able to ask such questions: what type of data do I have (text, binary, ...), how many samples, how many classes to predict, ... So ideally your data is going to give you some hints about the context of your problem, therefore why some estimators perform better than others.
But yeah, your question is really broad to grasp in a single answer (and specially without knowing the type of problem you are dealing with). You could also check if there might by any of those approaches more inclined to overfit, for example.
The list of recommendations could be endless, this is why I encourage you to start defining the type of problem you are dealing with and your data (plus to the number of samples, is it normalized? Is it disperse? Are you representing text in sparse matrix, are your inputs floats from 0.11 to 0.99).
Anyway, if you want to share some specifics on your data we might be able to answer more precisely. Hope this helped a little bit, though ;)
In order for me to compare my results of my research in labeled text classification, I need to have a baseline to compare with. One of my colleagues told me one solution would be to make the most easiest and dumbest classifier possible. The classifier makes a decision based on the frequency of a particular label.
This means that, when in my dataset I have a total of 100 samples and when it knows 80% of these samples have the label A, it will classify a sample as 'A' in 80% of the time. Since my entire research is using the Weka API, I have looked into the documentation but unfortunatly haven't found anything about this.
So my question is, is it possible in Weka to implement such a classifier and yes, could someone point out how this is possible? This question is pure informative since I looked into this thing but did not find anything, here is where I hope to find an answer.
That classifier is already implemented in Weka, it is called ZeroR and simply predicts the most frequent class (in the case of nominal class attributes) or the mean (in the case of numeric class attributes). If you want to know how to implement such a classifier yourself, look at the ZeroR source code.
I'm new to Topic Models, Classification, etc… now I'm already a while doing a project and read a lot of research papers. My dataset consists out of short messages that are human-labeled. This is what I have come up with so far:
Since my data is short, I read about Latent Dirichlet Allocation (and all it's variants) that is useful to detect latent words in a document.
Based on this I found a Java implementation of JGibbLDA http://jgibblda.sourceforge.net but since my data is labeled, there is an improvement of this called JGibbLabeledLDA https://github.com/myleott/JGibbLabeledLDA
In most of the research papers, I read good reviews about Weka so I messed around with this on my dataset
However, again, my dataset is labeled and therefore I found an extension of Weka called Meka http://sourceforge.net/projects/meka/ that had implementations for Multi-labeled data
Reading about multi-labeled data, I know most used approaches such as one-vs-all and chain classifiers...
Now the reason me being here is because I hope to have an answer to following questions:
Is LDA a good approach for my problem?
Should LDA be used together with a classifier (NB, SVM, Binary Relevance, Logistic Regression, …) or is LDA 'enough' to function as a classifier/estimator for new, unseen data?
How do I need to interpret the output coming from JGibbLDA / JGibbLabeledLDA. How do I get from these files to something which tells me what words/labels are assigned to the WHOLE message (not just to each word)
How can I use Weka/Meka do get to what I want in previous question (in case LDA is not what I'm looking for)
I hope someone, or more than one person, can help me figure out how I need to do this. The general idea of all is not the issue here, I just don't know how to go from literature to practice. Most of the papers don't give enough description of how they perform their experiments OR are too technical for my background about the topics.
Thanks!
Consider the text classification problem of spam or not spam with the Naive Bayes algorithm.
The question is the following:
how do you make predictions about a document W = if in that set of words you see a new word wordX that was not seen at all by your model (so you do not even have a laplace smoothing probabilty estimated for it)?
Is the usual thing to do is just ignore that wordX eventhough it was seen in the current text because it has no probability associated with? I.e. I know sometimes the laplace smoothing is used to try to solve this problem, but what if that word is definitively new?
Some of the solutions that I've thought of:
1) Just ignore that words in estimating a classification (most simple, but sometimes wrong...?, however, if the training set is large enough, this is probably the best thing to do, as I think its reasonable to assume your features and stuff were selected well enough if you have say 1M or 20M data).
2) Add that word to your model and change your model completely, because the vocabulary changed so probabilities have to change everywhere (this does have a problem though since it could mean that you have to update the model frequently, specially if your analysis 1M documents, say)
I've done some research on this, read some of the Dan Jurafsky NLP and NB slides and watched some videos on coursera and looked through some research papers but I was not able to find something I found useful. It feels to me this problem is not new at all and there should be something (a heuristic..?) out there. If there isn't, it would be awesome to know that too!
Hope this is a useful post for the community and Thanks in advance.
PS: to make the issue a little more explicit with one of the solutions I've seen is, say that we see an unknown new word wordX in a spam, then for that word we can do 1/ count(spams) + |Vocabulary + 1|, the issue I have with doing something like that is that, then, does that mean we change the size of the vocabulary and now, every new document we classify, has a new feature and vocabulary word? This video seems to attempt to solve that issue but I'm not sure if either, thats a good thing to do or 2, maybe I have misunderstood it:
https://class.coursera.org/nlp/lecture/26
From a practical perspective (keeping in mind this is not all you're asking), I'd suggest the following framework:
Train a model using an initial train set, and start using it for classificaion
Whenever a new word (with respect to your current model) appears, use some smoothing method to account for it. e.g. Laplace smoothing, as suggested in the question, might be a good start.
Periodically retrain your model using new data (usually in addition to the original train set), to account for changes in the problem domain, e.g. new terms. This can be done on preset intervals, e.g once a month; after some number of unknown words was encountered, or in an online manner, i.e. after each input document.
This retrain step can be done manually, e.g. collect all documents containing unknown terms, manually label them, and retrain; or using semi-supervised learning methods, e.g. automatically add the highest scored spam/ non spam documents to the respective models.
This will ensure your model stays updated and accounts for new terms - by adding them to the model from time to time, and by accounting for them even before that (simply ignoring them is usually not a good idea).