Multiclass classification problem with loose (not quite accurate) labeling. How to approach? - machine-learning

I'm looking for alt. ways to to build multiclass classification problem described below. The problem with using regular classification algorithms like Decision Tree or Logistic Regression is that they consider one value only to be True Positive, while in my data some classification are loosely done, and by that I mean instead of one single right label, 2 or more could be kindof "right", some of them just a bit more accurate than the others, but only one is always chosen. Here is example to demonstrate what I meant:
Feature
Label
1. "My credentials do not work"
"Can't login"
2. "I do not get email with confirmation code"
"Two Factor Verification Issue"
3. "Do not get a receipt"
"Billing issue"
Feature #2 is classified as "Two Factor Verification Issue" in the training set, but could be classified as "Can't login", which is Ok too. The classification was done by human in the past, and it was up to human how to label the feature, and someone did really classified "2-factor verification" issues as just "can't login". The dataset is huge and re-labeling is not possible. Due to such labeling ambiguity training score comes out pretty lame.
I think about ways to better measure quality of a training: What if algorithm offers 2-3 options of classification outcome, and if at least one of them matches defined label, we count it as a successful guess. However a gap in my knowledge of machine learning algorithms do not let me figure out how to do that. Any suggestions? Thanks

Firstly, it seems that the classes of your labels are not mutually exclusive. There is a overlap among classes and there are genuine cases where more then one label are right. This is usually not a good idea and can happen when you are looking at classes at too high level.
The right way would be to redefine the classes to maybe something more clear and low level such as: "email loss", "credential mismatch" etc. Higher level classes, such as "login issue", can be a combination of such low level classes in application but model should only look at low level classes.
However since you already have the data and can't re-label as you mentioned I would recommend doing 2 things.
Define the problem as a multi-label problem instead of a multi-classification problem. In multi label you are trying to give a separate score to each class of label and multiple classes can get high score as it's not a one-vs-all.
Your dataset only has one label for each datapoint when in reality multiple are possible. So you should augment your dataset. Below is one method I can think of
First create a similarity measure for the input items. Idea is you should be able to score 2 items based on how similar they are. It could be based on jaccard similarity of words, doc2vec embedding or some other similarity measure that works for the data.
Then for each input item, you should get a list of new additional labels based on labels of top-n most similar item to itself.
This way you will get much better list of labels and multi label problem should work well

Related

Computing a similarity score for a set of sentences

My team does a lot of chatbot training, and I'm trying to come up with some tools to improve the quality of our work. In chatbot training, it is really important to train intents with diverse utterances that phrase the same intent in very different ways. Ideally, there would be very little similarity in the syntax of the utterances in the set.
Here's an example for an intent inquiring about medical insurance coverage
Bad set of utterances
Is my daughter covered by insurance?
Is my son covered by medical insurance?
Will my son be covered by insurance?
Decent set of utterances
How can I look up whether we have insurance coverage for the whole family?
Seeking details on eligibility for medical coverage
Is there a document that details who is protected under our medical insurance policy?
I want to be able to take all of the utterances associated to an intent and analyze them for similarity. I would expect my set of bad utterances to have a high similarity score and my set of decent utterances to have a low similarity score.
I've tried playing around with a few doc2vec tutorials, but I feel like I'm missing something. I keep seeing stuff like this:
Train a set of data and then measure the similarity of a new sentence to your set of data
Measure the similarity between two sentences
I need to have an array of sentences and understand how similar they are to each other.
Any advice on achieving this?
Answering some questions:
What makes the bad utterances bad?The utterances themselves are not bad, it is the lack of variety between them. If most of the training had been like the “bad” set, then real user utterances of greater variety will not be recognized correctly.
Are you trying to discover new intents? No, this is for prerelease training, trying to improve the effectiveness of it.
Why do bad utterances have high similarity scores and decent utterances have low similarity scores? This is a hypothesis. I know how varied real user utterances are, and I have found my trainers fall into ruts when training, asking things the same way, and not seeing good accuracy results. Improving the variety in the utterances tends to result in better accuracy.
What will I do with this info? I’ll use it to assess the training quality of an intent, to determine if more training is likely necessary. In the future we might build real time tools as utterances are being added to let trainers know if they’re being too repetitive.
Most applications of text vectors benefit from the vectors capturing the "essential meaning" of a text, **without* regard to variances in word choice.
That is, it's considered a feature, not a flaw, if two completely different wordings with similar meaning have nearly the same vector. (Or, if some similarity-measure indicates they are totally similar.)
For example, to contrive an example similar to yours, consider the two phrasings:
"health coverage for brother"
"male sibling medical insurance"
There's no reuse of words, but the likely intended meaning is the same – so a good text-vectorization for typical purposes would create very similar vectors. And a similarity-measure using those vectors, or otherwise using the words/word-vectors as input, would indicate very high similarity.
But from your clarifying answers, it seems you actually want a more superficial "similarity" measure. You'd like a measure that reveals when certain phrasings show variety/contrast in their wording. (And specifically, you already know form other factors, like how they were hand-crafted, that groups of these phrasings are semantically related.)
What you want this similarity measure to show is actually a behavior that many projects using text-vectors would consider a failure of the vectors. So semantic methods like those in Word2Vec, Paragraph Vectors (aka "Doc2Vec"), etc are likely the wrong tool for your goal.
You could probably do well with a simpler measure based just on the words, or perhaps character-n-grams, of the texts.
For example, for two texts A and B, you could just tally the number of shared words (that appear in both A and B), and divide by the total number of unique words in both A and B, to get a 0.0 to 1.0 "word choice similarity" number.
And, when considering a new text against a set of prior texts, if its average similarity to the prior texts is low, it'd be "good" for your purposes.
Rather than just words, you could also use all n-character substrings ("n-grams") of your texts – which might help better highlight differences in word-forms, or common typos, which may also be useful variances for your purposes.
In general, I'd look at the scikit-learn text-vectorization functionality for ideas:
https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

multi-label text classification with zero or more labels

I need to classify website text with zero or more categories/labels (5 labels such as finance, tech, etc). My problem is handling text that isn't one of these labels.
I tried ML libraries (maxent, naive bayes), but they match "other" text incorrectly with one of the labels. How do I train a model to handle the "other" text? The "other" label is so broad and it's not possible to pick a representative sample.
Since I have no ML background and don't have much time to build a good training set, I'd prefer a simpler approach like a term frequency count, using a predefined list of terms to match for each label. But with the counts, how do I determine a relevancy score, i.e. if the text is actually that label? I don't have a corpus and can't use tf-idf, etc.
Another idea , is to user neural networks with softmax output function, softmax will give you a probability for every class, when the network is very confident about a class, will give it a high probability, and lower probabilities to the other classes, but if its insecure, the differences between probabilities will be low and none of them will be very high, what if you define a treshold like : if the probability for every class is less than 70% , predict "other"
Whew! Classic ML algorithms don't combine both multi-classification and "in/out" at the same time. Perhaps what you could do would be to train five models, one for each class, with a one-against-the-world training. Then use an uber-model to look for any of those five claiming the input; if none claim it, it's "other".
Another possibility is to reverse the order of evaluation: train one model as a binary classifier on your entire data set. Train a second one as a 5-class SVM (for instance) within those five. The first model finds "other"; everything else gets passed to the second.
What about creating histograms? You could use a bag of words approach using significant indicators of for e.g. Tech and Finance. So, you could try to identify such indicators by analyzing the certain website's tags and articles or just browse the web for such inidicators:
http://finance.yahoo.com/news/most-common-words-tech-finance-205911943.html
Let's say your input vactor X has n dimensions where n represents the number of indicators. For example Xi then holds the count for the occurence of the word "asset" and Xi+k the count of the word "big data" in the current article.
Instead of defining 5 labels, define 6. Your last category would be something like a "catch-all" category. That's actually your zero-match category.
If you must match the zero or more category, train a model which returns probability scores (such as a neural net as Luis Leal suggested) per label/class. You could than rate your output by that score and say that every class with a score higher than some threshold t is a matching category.
Try this NBayes implementation.
For identifying "Other" categories, dont bother much. Just train on your required categories which clearly identifies them, and introduce a threshold in the classifier.
If the values for a label does not cross a threshold, then the classifier adds the "Other" label.
It's all in the training data.
AWS Elasticsearch percolate would be ideal, but we can't use it due to the HTTP overhead of percolating documents individually.
Classify4J appears to be the best solution for our needs because the model looks easy to train and it doesn't require training of non-matches.
http://classifier4j.sourceforge.net/usage.html

Clustering or other mechanisms for implementing generic spam detection

In normal case I had tried out naive bayes and linear SVM earlier to classify data related to certain specific type of comments related to some page where I had access to training data manually labelled and classified as spam or ham.
Now I am being told to check if there are any ways to classify comments as spam where we don't have a training data. Something like getting two clusters for data which will be marked as spam or ham given any data.
I need to know certain ways to approach this problem and what would be a good way to implement this.
I am still learning and experimenting . Any help will be appreciated
Are the new comments very different from the old comments in terms of vocabulary? Because words is almost everything the classifiers for this task look at.
You always can try using your old training data and apply the classifier to the new domain. You would have to label a few examples from your new domain in order to measure performance (or better, let others do the labeling in order to get more reliable results).
If this doesn't work well, you could try domain adaptation or look for some datasets more similar to your new domain, using Google or looking at this spam/ham corpora.
Finally, there may be some regularity or pattern in your new setting, e.g. downvotes for a comment, which may indicate spam/ham. In such cases, you could compile training data yourself. This would them be called distant supervision (you can search for papers using this keyword).
The best I could get to was this research work which mentions about active learning. So what I came up with is that I first performed Kmeans clustering and got the central clusters (assuming 5 clusters I took 3 clusters descending ordered by length) and took 1000 msgs from each. Then I would assign it to be labelled by the user. The next process would be training using logistic regression on the labelled data and getting the probabilities of unlabelled data and then if I have probability close to 0.5 or in range of 0.4 to 0.6 which means it is uncertain I would assign it to be labelled and then the process would continue.

Predictive features with high presence in one class

I am doing a logistic regression to predict the outcome of a binary variable, say whether a journal paper gets accepted or not. The dependent variable or predictors are all the phrases used in these papers - (unigrams, bigrams, trigrams). One of these phrases has a skewed presence in the 'accepted' class. Including this phrase gives me a classifier with a very high accuracy (more than 90%), while removing this phrase results in accuracy dropping to about 70%.
My more general (naive) machine learning question is:
Is it advisable to remove such skewed features when doing classification?
Is there a method to check skewed presence for every feature and then decide whether to keep it in the model or not?
If I understand correctly you ask whether some feature should be removed because it is a good predictor (it makes your classifier works better). So the answer is short and simple - do not remove it in fact, the whole concept is to find exactly such features.
The only reason to remove such feature would be that this phenomena only occurs in the training set, and not in real data. But in such case you have wrong data - which does not represnt the underlying data density and you should gather better data or "clean" the current one so it has analogous characteristics as the "real ones".
Based on your comments, it sounds like the feature in your documents that's highly predictive of the class is a near-tautology: "paper accepted on" correlates with accepted papers because at least some of the papers in your database were scraped from already-accepted papers and have been annotated by the authors as such.
To me, this sounds like a useless feature for trying to predict whether a paper will be accepted, because (I'd imagine) you're trying to predict paper acceptance before the actual acceptance has been issued ! In such a case, none of the papers you'd like to test your algorithm with will be annotated with "paper accepted on." So, I'd remove it.
You also asked about how to determine whether a feature correlates strongly with one class. There are three things that come to mind for this problem.
First, you could just compute a basic frequency count for each feature in your dataset and compare those values across classes. This is probably not super informative, but it's easy.
Second, since you're using a log-linear model, you can train your model on your training dataset, and then rank each feature in your model by its weight in the logistic regression parameter vector. Features with high positive weight are indicative of one class, while features with large negative weight are strongly indicative of the other.
Finally, just for the sake of completeness, I'll point out that you might also want to look into feature selection. There are many ways of selecting relevant features for a machine learning algorithm, but I think one of the most intuitive from your perspective might be greedy feature elimination. In such an approach, you train a classifier using all N features in your model, and measure the accuracy on some held-out validation set. Then, train N new models, each with N-1 features, such that each model eliminates one of the N features, and measure the resulting drop in accuracy. The feature with the biggest drop was probably strongly predictive of the class, while features that have no measurable difference can probably be omitted from your final model. As larsmans points out correctly in the comments below, this doesn't scale well at all, but it can be a useful method sometimes.

How to set intercept_scaling in scikit-learn LogisticRegression

I am using scikit-learn's LogisticRegression object for regularized binary classification. I've read the documentation on intercept_scaling but I don't understand how to choose this value intelligently.
The datasets look like this:
10-20 features, 300-500 replicates
Highly non-Gaussian, in fact most observations are zeros
The output classes are not necessarily equally likely. In some cases they are almost 50/50, in other cases they are more like 90/10.
Typically C=0.001 gives good cross-validated results.
The documentation contains warnings that the intercept itself is subject to regularization, like every other feature, and that intercept_scaling can be used to address this. But how should I choose this value? One simple answer is to explore many possible combinations of C and intercept_scaling and choose the parameters that give the best performance. But this parameter search will take quite a while and I'd like to avoid that if possible.
Ideally, I would like to use the intercept to control the distribution of output predictions. That is, I would like to ensure that the probability that the classifier predicts "class 1" on the training set is equal to the proportion of "class 1" data in the training set. I know that this is the case under certain circumstances, but this is not the case in my data. I don't know if it's due to the regularization or to the non-Gaussian nature of the input data.
Thanks for any suggestions!
While you tried oversampling the positive class by setting class_weight="auto"? That effectively oversamples the underrepresented classes and undersamples the majority class.
(The current stable docs are a bit confusing since they seem to have been copy-pasted from SVC and not edited for LR; that's just changed in the bleeding edge version.)

Resources