Multiclass text classification imbalance, dealing with class "other" - machine-learning

I'm looking for a way to use machine learning to correctly classify FAQs that don't fit in with pre-defined classes, and should get lumped into an "other" class.
The problem: in the training dataset contains about 1500 FAQs, with "other" being the largest class (about 250 questions are lumped in this class). These are typically "odd-ball" questions, that are asked very infrequently. However, when I train a model, the "other" class becomes a model favourite, just because of the size and variance compared to other classes. If I now use this model to classify the FAQs without class, a decent amount will be lumped into "other" where they shouldn't.
What I want: a model that classifies questions with the specific classes first, and only lumps it in "other" when it can't find a good hit with the specific classes.
What I've tried: undersample the "other" class. This works OKish, but I think there should be a better solution.
I'll try to use the number of times a FAQ is asked as second predictor (not sure how yet), but I'm looking for any out-of-the-box solutions or pointers. Thanks!

I can suggest two strategies to do this classification (however, it is better to say clusstering since it is an unsupervised learning):
First method: use NLP (nltk for example), to discover n most frequent words in the questions and consider them as the class labels. To do so, you need to create a corpus by integrating all the question, clean the text by removing punctuation, stopwords, digit, mention, hashtags and so on, then tokenise and lemmatise the text, and find the most common tokens. I believe it is better to keep only nouns, and take most common nouns. Besides, you can compute tf–idf, and decide based on that instead.
Second method: use fuzzy techniques to compute the similarities between text. To do so, you can use fuzzywuzzy library that contains several functions for computing the similarities. For your case fuzzywuzzy.token_set_ratio() would be the right choice, since you are comparing two sentences. However, since you have 1500 questions, you have (n * (n-1)) / 2 = 1124250 combination to compute similarity for, which is a lot. To make it efficient, I suggest to use itertools
Hope these help!

Related

How to evaluate word2vec build on a specific context files

Using gensim word2vec, built a CBOW model with a bunch of litigation files for representation of word as vector in a Named-Entity-recognition problem, but I want to known how to evaluate my representation of words. If I use any other datasets like wordsim353(NLTK) or other online datasets of google, it doesn't work because I built the model specific to my domain dataset of files. How do I evaluate my word2vec's representation of word vectors .I want words belonging to similar context to be closer in vector space.How do I ensure that the build model is doing it ?
I started by using a techniques called odd one out. Eg:
model.wv.doesnt_match("breakfast cereal dinner lunch".split()) --> 'cereal'
I created my own dataset(for validating) using the words in the training of word2vec .Started evaluating with taking three words of similar context and an odd word out of context.But the accuracy of my model is only 30 % .
Will the above method really helps in evaluating my w2v model ? Or Is there a better way ?
I want to go with word_similarity measure but I need a reference score(Human assessed) to evaluate my model or is there any techniques to do it? Please ,do suggest any ideas or techniques .
Ultimately this depends on the purpose you intend for the word-vectors – your evaluation should mimic the final use as much as possible.
The "odd one out" approach may be reasonable. It's often done with just 2 words that are somehow, via external knowledge/categorization, known to be related (in the aspects that are important for your end use), then a 3rd word picked at random.
If you think your hand-crafted evaluation set is of high-quality for your purposes, but your word-vectors aren't doing well, it may just be that there are other problems with your training: too little data, errors in preprocessing, poorly-chosen metaparameters, etc.
You'd have to look at individual failure cases in more detail to pick what to improve next. For example, even when it fails at one of your odd-one-out tests, do the lists of most-similar words, for each of the words included, still make superficial sense in an eyeball-test? Does using more data or more training iterations significantly improve the evaluation scoring?
A common mistake during both training and evaluation/deployment is to retain too many rare words, on the (mistaken) intuition that "more info must be better". In fact, words with only a few occurrences can't get very high-quality vectors. (Compared to more-frequent words, their end vectors are more heavily influenced by the random original initialization, and by the idiosyncracies of their few occurrences available rather than their most-general meaning.) And further, their presence tends to interfere with the improvement of other nearby more-frequent words. Then, if you include the 'long tail' of weaker vectors in your evaluations, they tend to somewhat arbitrarily intrude in rankings ahead of common words with strong vectors, hiding the 'right' answers to your evaluation questions.
Also, note that the absolute value of an evaluation score may not be that important, because you're just looking for something that points your other optimizations in the right direction for your true end-goal. Word-vectors that are just slightly-better at precise evaluation questions might still work well-enough in other fuzzier information-retrieval contexts.

Different performance by different ML classifiers, what can I deduce?

I have used a ML approach to my research using python scikit-learn. I found that SVM and logistic regression classifiers work best (eg: 85% accuracy), decision trees works markedly worse (65%), and then Naive Bayes works markedly worse (40%).
I will write up the conclusion to illustrate the obvious that some ML classifiers worked better than the others by a large margin, but what else can I say about my learning task or data structure based on these observations?
Edition:
The data set involved 500,000 rows, and I have 15 features but some of the features are various combination of substrings of certain text, so it naturally expands to tens of thousands of columns as a sparse matrix. I am using people's name to predict some binary class (eg: Gender), though I feature engineer a lot from the name entity like the length of the name, the substrings of the name, etc.
I recommend you to visit this awesome map on choosing the right estimator by the scikit-learn team http://scikit-learn.org/stable/tutorial/machine_learning_map
As describing the specifics of your own case would be an enormous task (I totally understand you didn't do it!) I encourage you to ask yourself several questions. Thus, I think the map on 'choosing the right estimator' is a good start.
Literally, go to the 'start' node in the map and follow the path:
is my number of samples > 50?
And so on. In the end you might end at some point and see if your results match with the recommendations in the map (i.e. did I end up in a SVM, which gives me better results?). If so, go deeper into the documentation and ask yourself why is that one classifier performing better on text data or whatever insight you get.
As I told you, we don't know the specifics of your data, but you should be able to ask such questions: what type of data do I have (text, binary, ...), how many samples, how many classes to predict, ... So ideally your data is going to give you some hints about the context of your problem, therefore why some estimators perform better than others.
But yeah, your question is really broad to grasp in a single answer (and specially without knowing the type of problem you are dealing with). You could also check if there might by any of those approaches more inclined to overfit, for example.
The list of recommendations could be endless, this is why I encourage you to start defining the type of problem you are dealing with and your data (plus to the number of samples, is it normalized? Is it disperse? Are you representing text in sparse matrix, are your inputs floats from 0.11 to 0.99).
Anyway, if you want to share some specifics on your data we might be able to answer more precisely. Hope this helped a little bit, though ;)

Can TF/IDF take classes in account

Using a classsication algorythm (for example naive bayes or SVM), and StringToWordVector,
would it be possible to use TF/IDF and to count terms frequency in the whole current class instead of just looking in a single document?
Let me explain, I would like the computation to give high score to words that are very frequent for a given class (not just for a given document) but not very frequent in the whole corpus.
Is it possible out of the box or does this need some extra developments?
Thanks :)
I would like the computation to give high score to words that are very frequent for a given class (not just for a given document) but not very frequent in the whole corpus.
You seem to want supervised term weighting. I'm not aware of any off-the-shelf implementation of that, but there's a host of literature about it. E.g. the weighting scheme tf-χ² replaces idf with the result of a χ² independence test, so terms that statistically depend on certain classes get boosted, and there are several others.
Tf-idf itself is by its very nature unsupervised.
I think you're confusing yourself here---what you're asking for is essentially the feature weight on that term for documents of that class. This is what the learning algorithm is intended to optimise. Just worry about a useful representation of documents, which must necessarily be invariant to the class to which they belong (since you won't know what the class is for unseen test documents).
A changed idf may help you in some scene.
You can use the idf defined as:
log(1+p(term in this class)/p(term in other class))
Disadvantages : Each class has a different idf, this can be interpreted as every term in different class has various contribution in distinguishing the category.
Application : By add the idf in Native Bayes, I get a improve in query keyword classification. And it perform good when extracting keywords.

Feature selection for machine learning

I'm classifying websites. One of the tasks is to filter out porn. I'm using is a binary SVM classifier with bag-of-words. I have a question about the words I should include in BoW: should it be just porn-related words (words commonly found on porn websites) or should it also include words that are rarely found on porn websites, but found frequently on other websites as well (for example, "mathematics", "engineering", "guitar", "birth", etc)?.
The problem I'm encountering is false positives on medicine and family related sites. If I only look for porn-related words, then the vectors for such sites end up very sparse. Words like "sex" appear fairly often, but in a completely innocent context.
Should I include the non-porn words as well? Or should I look at other ways of resolving the false positives? Suggestions are most welcome.
another possible approach would be to make a Language model specifically for porn sites. I think, if you have n-grams (e.g. 3-grams) it should be easier to identify whether particular word "sex" is related to porn, or other domain.
A theoretical guess: If you have a such language model, you wouldn't even need a classifier. (Perplexity, likelihood of the n-gram should be enough to decide ...)
Topic modelling (try Latent Dirichlet Allocation http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) would be able to handle this well.
Feeding the document topics as features to the classifier would help to avoid the problems you're encountering.
You should include as many words as possible; ideally an entire dictionary. The classifier is able to identify websites by determining how similar they are to the classes you define. You need to give it the means to identify both classes, not just one of them. Think of being asked to identify cats in pictures, but only being shown cats to train. While for any particular picture you might be able to say that it doesn't look a lot like a cat (or rather any cat you've seen), you have no way of determining whether there's enough cat-ness for it still to be a cat.
Include all of the words and let the SVM decide which are useful - the classifier needs to be able to distinguish between the positives and negatives, and negatives can also be characterized with words that are not in your target domain (porn), thus making the split between the examples potentially clearer.
Preferably, use not only single words, but also n-grams (e.g., 2 or 3-grams above a certain frequency) as additional features (this should help with your problem with medicine false positives). N-grams will also fit right in with your approach if you are using TF-IDF weighting.

Topic Detection by Clustering Keywords

I want to text classification based on the keywords appear in the text, because I do not have sample data to use naive bayes for text classification.
Example:
my document has some few words as "family, mother , father , children ... " that the categories of document are family.Or "football, tennis, score ... " that the category is sport
What is the best algorithm in this case ?.And is there any api java for this problem?
What you have are feature labels, i.e., labels on features rather than instances. There are a few methods for exploiting these, but usually it is assumed that one has instance labels (i.e., labels on documents) in addition to feature labels. This paradigm is referred to as dual-supervision.
Anyway, I know of at least two ways to learn from labeled features alone. The first is Generalized Expectation Criteria, which penalizes model parameters for diverging from a priori beliefs (e.g., that "moether" ought usually to correlate with "family"). This method has the disadvantage of being somewhat complex, but the advantage of having a nicely packaged, open-source Java implementation in the Mallet toolkit (see here, specifically).
A second option would basically be to use Naive Bayes and give large priors to the known word/class associations -- e.g., P("family"|"mother") = .8, or whatever. All unlabeled words would be assigned some prior, presumably reflecting class distribution. You would then effectively being making decisions only based on the prevalence of classes and the labeled term information. Settles proposed a model like this recently, and there is a web-tool available.
You likely will need an auxillary data set for this. You cannot rely on your data set to convey the information that "dad" and "father" and "husband" have a similar meaning.
You can try to do mine for co-occurrences to detect near-synonyms, but this is not very reliable.
Probably wordnet etc. are a good place to disambiguate such words.
You can download the freebase topic collection: http://wiki.freebase.com/wiki/Topic_API.

Resources