what should I do when training set contains some error data in supervised classification? - machine-learning

I am working on a project which performs text auto-classification, I have a lot of data set like as below:
Text | CategoryName
xxxxx... | AA
yyyyy... | BB
zzzzz... | AA
then, I will use the above data set to generate a classifier, once new text coming, the classifier can label new text with correct CategoryName
(text is natural language, size between 10-10000)
Now, the problem is, the original data set contains some incorrect data, (E.g. AAA should be labeled as Category AA, but it is labeled as Category BB accidentally ) because these data are classified manually. And I don't know which label is wrong and how many percentages are wrong because I can't review all data manually...
So my question is, what should I do?
Can I find the wrong labels via some automatic way?
How to increase precision and recall when new data coming?
How to evaluate the impact of wrong data? (since I don't know how many percentage data is wrong)
Any other suggestions?

Obviously, there is no easy way to solve your problem - after all, why build a classifier if you already have a system that can detect wrong classifications.
Do you know how much the erroneous classifications affect your learning? If there are only a small percentage of them, they should not hurt the performance much. (Edit. Ah, apparently you don't. Anyway, I suggest you try it out - at least if you can identify a false result when you see one.)
Of course, you could always first train your system and then have it suggest classifications for the training data. This might help you identify (and correct) your faulty training data. This obviously depends on how much training data you have, and if it is sufficiently broad to allow your system to learn correct classification despite the faulty data.

Can you review any of the data manually to find some mislabeled examples? If so, you might be able to train a second classifier to identify mislabeled data, assuming there is some kind of pattern to the mislabeling. It would be useful for you to know if mislabeling is a purely random process (it is just noise in the training data) or if mislabeling correlates with particular features of the data.
You can't evaluate the impact of mislabeled data on your specific data set if you have no estimate regarding what fraction of your training set is actually mislabeled. You mention in a comment that you have ~5M records. If you can correctly manually label a few hundred, you could train your classifier on that data set, then see how the classifier performs after introducing random mislabeling. You could do this multiple times with varying percentages of mislabeled data to see the impact on your classifier.
Qualitatively, having a significant quantity of mislabeled samples will increase the impact of overfitting so it is even more important that you do not overfit your classifier to the data set. If you have a test data set (assuming it also suffers from mislabling), then you might consider training your classifier to less-than-maximal classification accuracy on the test data set.

People usually deal with the problem you a describing by having multiple annotators and computing their agreement (e.g. Fleiss' kappa). This is often seen as the upper bound on the performance of any classifier. If three people give you three different answers, you know the task is quite hard and your classifier stands no chance.
As a side note:
If you do not know how many of your records have been labelled incorrectly, you do not understand one of the key properties of the problem. Select 1000 records at random and spend the day reviewing their labels to get an idea. It really is time well spent. For example, I found I can easily review 500 labelled tweets per hour. Health warning: it is very tedious, but a morning spent reviewing gives me a good idea of how distracted my annotators were. If 5% of the records are incorrect, it is not such a problem. If 50 are incorrect, you should go back you your boss and tell them it can't be done.
As another side note:
Someone mentioned active learning. I think it is worth looking into options from the literature, keeping in mind labels might have to change. You said that it hard.

Related

Deep learning classification with no labels

I must participate in a research project regarding a deep learning application for classification. I have a huge dataset containing over 35000 features - these are good values, taken from laboratory.
The idea is that I should create a classifier that must tell, given a new input, if the data seems to be good or not. I must use deep learning with keras and tensor flow.
The problem is that the data is not classified. I will enter a new column with 1 for good and 0 for bad. Problem is, how can I find out if an entry is bad, given the fact that the whole training set is good?
I have thought about generating some garbage data but I don't know if this is a good idea - I don't even know how to generate it. Do you have any tips?
I would start with anamoly detection. You can first reduce features with f.e. an (stacked) autoencoder and then use local outlier factor from sklearn: https://scikit-learn.org/stable/modules/outlier_detection.html
The reason why you need to reduce features first is, is because your LOF will be much more stable.

Is it considered overfit a decision tree with a perfect attribute?

I have a 6-dimensional training dataset where there is a perfect numeric attribute which separates all the training examples this way: if TIME<200 then the example belongs to class1, if TIME>=200 then example belongs to class2. J48 creates a tree with only 1 level and this attribute as the only node.
However, the test dataset does not follow this hypothesis and all the examples are missclassified. I'm having trouble figuring out whether this case is considered overfitting or not. I would say it is not as the dataset is that simple, but as far as I understood the definition of overfit, it implies a high fitting to the training data, and this I what I have. Any help?
However, the test dataset does not follow this hypothesis and all the examples are missclassified. I'm having trouble figuring out whether this case is considered overfitting or not. I would say it is not as the dataset is that simple, but as far as I understood the definition of overfit, it implies a high fitting to the training data, and this I what I have. Any help?
Usually great training score and bad testing means overfitting. But this assumes IID of the data, and you are clearly violating this assumption - your training data is completely different from the testing one (there is a clear rule for the training data which has no meaning for testing one). In other words - your train/test split is incorrect, or your whole problem does not follow basic assumptions of where to use statistical ml. Of course we often fit models without valid assumptions about the data, in your case - the most natural approach is to drop a feature which violates the assumption the most - the one used to construct the node. This kind of "expert decisions" should be done prior to building any classifier, you have to think about "what is different in test scenario as compared to training one" and remove things that show this difference - otherwise you have heavy skew in your data collection, thus statistical methods will fail.
Yes, it is an overfit. The first rule in creating a training set is to make it look as much like any other set as possible. Your training set is clearly different than any other. It has the answer embedded within it while your test set doesn't. Any learning algorithm will likely find the correlation to the answer and use it and, just like the J48 algorithm, will regard the other variables as noise. The software equivalent of Clever Hans.
You can overcome this by either removing the variable or by training on a set drawn randomly from the entire available set. However, since you know that there is a subset with an embedded major hint, you should remove the hint.
You're lucky. At times these hints can be quite subtle which you won't discover until you start applying the model to future data.

Clustering or other mechanisms for implementing generic spam detection

In normal case I had tried out naive bayes and linear SVM earlier to classify data related to certain specific type of comments related to some page where I had access to training data manually labelled and classified as spam or ham.
Now I am being told to check if there are any ways to classify comments as spam where we don't have a training data. Something like getting two clusters for data which will be marked as spam or ham given any data.
I need to know certain ways to approach this problem and what would be a good way to implement this.
I am still learning and experimenting . Any help will be appreciated
Are the new comments very different from the old comments in terms of vocabulary? Because words is almost everything the classifiers for this task look at.
You always can try using your old training data and apply the classifier to the new domain. You would have to label a few examples from your new domain in order to measure performance (or better, let others do the labeling in order to get more reliable results).
If this doesn't work well, you could try domain adaptation or look for some datasets more similar to your new domain, using Google or looking at this spam/ham corpora.
Finally, there may be some regularity or pattern in your new setting, e.g. downvotes for a comment, which may indicate spam/ham. In such cases, you could compile training data yourself. This would them be called distant supervision (you can search for papers using this keyword).
The best I could get to was this research work which mentions about active learning. So what I came up with is that I first performed Kmeans clustering and got the central clusters (assuming 5 clusters I took 3 clusters descending ordered by length) and took 1000 msgs from each. Then I would assign it to be labelled by the user. The next process would be training using logistic regression on the labelled data and getting the probabilities of unlabelled data and then if I have probability close to 0.5 or in range of 0.4 to 0.6 which means it is uncertain I would assign it to be labelled and then the process would continue.

Input selection for neural networks

I am going to use ANN for my work in which I have a large dataset, let say input[600x40] and output[600x6]. As one can see, the number of inputs (40) is too high for ANN and it may trap in local minimum and/or increases the CPU time dramatically. Is there any way to select the most informative input?
As my first try, I used the following code in Matlab to find the cross-correlation between each two inputs:
[rho, ~] = corr(inputs, 'rows','pairwise')
However, I think this simple correlation cannot identify some hidden complex relation between the inputs.
Any ideas?
First of all 40 inputs is a very small space and it should not be reduced. Large number of inputs is 100,000, not 40. Also, 600x40 is not a big dataset, nor the one "increasing the CPU time dramaticaly", if it learns slowly than check your code because it appears to be the problem, not your data.
Furthermore, feature selection is not a good way to go, you should use it only when gathering features is actually expensive. In any other scenario you are looking for dimensionality reduction, such as PCA, LDA etc. although as said before - your data should not be reduced, rather - you should consider getting more of it (new samples/new features).
Disclaimer: I'm with lejlot on this - you should get more data and
more features instead of trying to remove features. Still, that doesn't answer your question, so here we go.
Try most basic greedy approach - try removing each feature and retrain your ANN (several times, of course) and see if your results got better or worse. Choose this situation where results got better and improvement was the best. Repeat until you'll get no improvement by removing features. This will take a lot of time, so you may want to try doing it on some subset of your data (for example on 3 folds of dataset splitted into 10 folds).
It's ugly, but sometimes it works.
I repeat what I've said in disclaimer - this is not the way to go.

Predicting Classifications with Naive Bayes and dealing with Features/Words not in the training set

Consider the text classification problem of spam or not spam with the Naive Bayes algorithm.
The question is the following:
how do you make predictions about a document W = if in that set of words you see a new word wordX that was not seen at all by your model (so you do not even have a laplace smoothing probabilty estimated for it)?
Is the usual thing to do is just ignore that wordX eventhough it was seen in the current text because it has no probability associated with? I.e. I know sometimes the laplace smoothing is used to try to solve this problem, but what if that word is definitively new?
Some of the solutions that I've thought of:
1) Just ignore that words in estimating a classification (most simple, but sometimes wrong...?, however, if the training set is large enough, this is probably the best thing to do, as I think its reasonable to assume your features and stuff were selected well enough if you have say 1M or 20M data).
2) Add that word to your model and change your model completely, because the vocabulary changed so probabilities have to change everywhere (this does have a problem though since it could mean that you have to update the model frequently, specially if your analysis 1M documents, say)
I've done some research on this, read some of the Dan Jurafsky NLP and NB slides and watched some videos on coursera and looked through some research papers but I was not able to find something I found useful. It feels to me this problem is not new at all and there should be something (a heuristic..?) out there. If there isn't, it would be awesome to know that too!
Hope this is a useful post for the community and Thanks in advance.
PS: to make the issue a little more explicit with one of the solutions I've seen is, say that we see an unknown new word wordX in a spam, then for that word we can do 1/ count(spams) + |Vocabulary + 1|, the issue I have with doing something like that is that, then, does that mean we change the size of the vocabulary and now, every new document we classify, has a new feature and vocabulary word? This video seems to attempt to solve that issue but I'm not sure if either, thats a good thing to do or 2, maybe I have misunderstood it:
https://class.coursera.org/nlp/lecture/26
From a practical perspective (keeping in mind this is not all you're asking), I'd suggest the following framework:
Train a model using an initial train set, and start using it for classificaion
Whenever a new word (with respect to your current model) appears, use some smoothing method to account for it. e.g. Laplace smoothing, as suggested in the question, might be a good start.
Periodically retrain your model using new data (usually in addition to the original train set), to account for changes in the problem domain, e.g. new terms. This can be done on preset intervals, e.g once a month; after some number of unknown words was encountered, or in an online manner, i.e. after each input document.
This retrain step can be done manually, e.g. collect all documents containing unknown terms, manually label them, and retrain; or using semi-supervised learning methods, e.g. automatically add the highest scored spam/ non spam documents to the respective models.
This will ensure your model stays updated and accounts for new terms - by adding them to the model from time to time, and by accounting for them even before that (simply ignoring them is usually not a good idea).

Resources