Using a classsication algorythm (for example naive bayes or SVM), and StringToWordVector,
would it be possible to use TF/IDF and to count terms frequency in the whole current class instead of just looking in a single document?
Let me explain, I would like the computation to give high score to words that are very frequent for a given class (not just for a given document) but not very frequent in the whole corpus.
Is it possible out of the box or does this need some extra developments?
Thanks :)
I would like the computation to give high score to words that are very frequent for a given class (not just for a given document) but not very frequent in the whole corpus.
You seem to want supervised term weighting. I'm not aware of any off-the-shelf implementation of that, but there's a host of literature about it. E.g. the weighting scheme tf-χ² replaces idf with the result of a χ² independence test, so terms that statistically depend on certain classes get boosted, and there are several others.
Tf-idf itself is by its very nature unsupervised.
I think you're confusing yourself here---what you're asking for is essentially the feature weight on that term for documents of that class. This is what the learning algorithm is intended to optimise. Just worry about a useful representation of documents, which must necessarily be invariant to the class to which they belong (since you won't know what the class is for unseen test documents).
A changed idf may help you in some scene.
You can use the idf defined as:
log(1+p(term in this class)/p(term in other class))
Disadvantages : Each class has a different idf, this can be interpreted as every term in different class has various contribution in distinguishing the category.
Application : By add the idf in Native Bayes, I get a improve in query keyword classification. And it perform good when extracting keywords.
Related
I'm looking for a way to use machine learning to correctly classify FAQs that don't fit in with pre-defined classes, and should get lumped into an "other" class.
The problem: in the training dataset contains about 1500 FAQs, with "other" being the largest class (about 250 questions are lumped in this class). These are typically "odd-ball" questions, that are asked very infrequently. However, when I train a model, the "other" class becomes a model favourite, just because of the size and variance compared to other classes. If I now use this model to classify the FAQs without class, a decent amount will be lumped into "other" where they shouldn't.
What I want: a model that classifies questions with the specific classes first, and only lumps it in "other" when it can't find a good hit with the specific classes.
What I've tried: undersample the "other" class. This works OKish, but I think there should be a better solution.
I'll try to use the number of times a FAQ is asked as second predictor (not sure how yet), but I'm looking for any out-of-the-box solutions or pointers. Thanks!
I can suggest two strategies to do this classification (however, it is better to say clusstering since it is an unsupervised learning):
First method: use NLP (nltk for example), to discover n most frequent words in the questions and consider them as the class labels. To do so, you need to create a corpus by integrating all the question, clean the text by removing punctuation, stopwords, digit, mention, hashtags and so on, then tokenise and lemmatise the text, and find the most common tokens. I believe it is better to keep only nouns, and take most common nouns. Besides, you can compute tf–idf, and decide based on that instead.
Second method: use fuzzy techniques to compute the similarities between text. To do so, you can use fuzzywuzzy library that contains several functions for computing the similarities. For your case fuzzywuzzy.token_set_ratio() would be the right choice, since you are comparing two sentences. However, since you have 1500 questions, you have (n * (n-1)) / 2 = 1124250 combination to compute similarity for, which is a lot. To make it efficient, I suggest to use itertools
Hope these help!
I have trained Gensim's WordToVec on a text corpus,converted it to DocToVec and then used cosine similarity to find the similarity between documents. I need to suggest similar documents. Now suppose among the top 5 suggestions for a particular document, we manually find that 3 of them are not similar.Can this feedback be incorporated in retraining the model?
It's not quite clear what you mean by "converted [a Word2Vec model] to DocToVec". The gensim Doc2Vec class doesn't use or require a Word2Vec model as input.
But, if you have many sets of hand-curated "this is a good suggestion" or "this is a bad suggestion" pairs for your corpus, you can use the model's scoring against all those to compare models, and train many variant models (with different model parameter values like size, window, min_count, sample, etc), picking the one that scores best on your tests.
That sort of automated-parameter-search is the most straightforward way to use performance on real evaluation data to adjust an unsupervised model like Word2Vec.
(Depending on the specifics of your data and problem-domain, you might also start to notice patterns in where the model is better or worse, that help you hand-tune parts of the data preprocessing. For example, a different handling of capitalization or tokenization might be suggested by error cases.)
I need some point of view to know if what I am doing is good or wrong or if there is better way to do it.
I have 10 000 elements. For each of them I have like 500 features.
I am looking to measure the separability between 2 sets of those elements. (I already know those 2 groups I don't try to find them)
For now I am using svm. I train the svm on 2000 of those elements, then I look at how good the score is when I test on the 8000 other elements.
Now I would like to now which features maximize this separation.
My first approach was to test each combination of feature with the svm and follow the score given by the svm. If the score is good those features are relevant to separate those 2 sets of data.
But this takes too much time. 500! possibility.
The second approach was to remove one feature and see how much the score is impacted. If the score changes a lot that feature is relevant. This is faster, but I am not sure if it is right. When there is 500 feature removing just one feature don't change a lot the final score.
Is this a correct way to do it?
Have you tried any other method ? Maybe you can try decision tree or random forest, it would give out your best features based on entropy gain. Can i assume all the features are independent of each other. if not please remove those as well.
Also for Support vectors , you can try to check out this paper:
http://axon.cs.byu.edu/Dan/778/papers/Feature%20Selection/guyon2.pdf
But it's based more on linear SVM.
You can do statistical analysis on the features to get indications of which terms best separate the data. I like Information Gain, but there are others.
I found this paper (Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002) to be a good theoretical treatment of text classification, including feature reduction by a variety of methods from the simple (Term Frequency) to the complex (Information-Theoretic).
These functions try to capture the intuition that the best terms for ci are the
ones distributed most differently in the sets of positive and negative examples of
ci. However, interpretations of this principle vary across different functions. For instance, in the experimental sciences χ2 is used to measure how the results of an observation differ (i.e., are independent) from the results expected according to an initial hypothesis (lower values indicate lower dependence). In DR we measure how independent tk and ci are. The terms tk with the lowest value for χ2(tk, ci) are thus the most independent from ci; since we are interested in the terms which are not, we select the terms for which χ2(tk, ci) is highest.
These techniques help you choose terms that are most useful in separating the training documents into the given classes; the terms with the highest predictive value for your problem. The features with the highest Information Gain are likely to best separate your data.
I've been successful using Information Gain for feature reduction and found this paper (Entropy based feature selection for text categorization Largeron, Christine and Moulin, Christophe and Géry, Mathias - SAC - Pages 924-928 2011) to be a very good practical guide.
Here the authors present a simple formulation of entropy-based feature selection that's useful for implementation in code:
Given a term tj and a category ck, ECCD(tj , ck) can be
computed from a contingency table. Let A be the number
of documents in the category containing tj ; B, the number
of documents in the other categories containing tj ; C, the
number of documents of ck which do not contain tj and D,
the number of documents in the other categories which do
not contain tj (with N = A + B + C + D):
Using this contingency table, Information Gain can be estimated by:
This approach is easy to implement and provides very good Information-Theoretic feature reduction.
You needn't use a single technique either; you can combine them. Term-Frequency is simple, but can also be effective. I've combined the Information Gain approach with Term Frequency to do feature selection successfully. You should experiment with your data to see which technique or techniques work most effectively.
If you want a single feature to discriminate your data, use a decision tree, and look at the root node.
SVM by design looks at combinations of all features.
Have you thought about Linear Discriminant Analysis (LDA)?
LDA aims at discovering a linear combination of features that maximizes the separability. The algorithm works by projecting your data in a space where the variance within classes is minimum and the one between classes is maximum.
You can use it reduce the number of dimensions required to classify, and also use it as a linear classifier.
However with this technique you would lose the original features with their meaning, and you may want to avoid that.
If you want more details I found this article to be a good introduction.
I am trying to do document classification. But I am really confused between feature selections and tf-idf. Are they the same or two different ways of doing classification?
Hope somebody can tell me? I am not really sure that my question will make sense to you guys.
Yes, you are confusion a lot of things.
Feature selection is the abstract term for choosing features (0 or 1). Stopword removal can be seen as feature selection.
TF is one method of extracting features from text: counting words.
IDF is one method of assigning weights to features.
Neither of them is classification... they are popular for text classification, but they are even more popular for information retrieval, which is not classification...
However, many classifiers work on numeric data, so the common process is to 1. Extract features (e.g.: TF) 2. Select features (e.g. remove stopwords) 3. Weight features (e.g. IDF) 4. Train a classifier on the resulting numerical vectors. 5. Predict the classes of new/unlabeled documents.
Taking a look at this explanation may help a lot when it comes to understanding text classifiers.
TF-IDF is a good way to find a document that answers a given query, but it does not necessarily assigns documents with classes.
Examples that may be helpful:
1) You have a bunch of documents with subjects ranging from politics, economics, computer science and the arts. The documents belonging to each subject are separated into the appropriate directories for each subject (you have a labeled dataset). Now, you received a new document whose subject you do not know. In which directory should it be stored? A classifier can answer this question from the documents that are already labeled.
2) Now, you received a query regarding computer science. For instance, you received the query "Good methods for finding textual similarity". Which document in the directory of computer science can provide the best response to that query? TF-IDF would be a good approach to figure that out.
So, when you are classifying documents, you are trying to make a decision about whether a document is a member of a particular class (like, say, 'about birds' or 'not about birds').
Classifiers predict the value of the class given a set of features. A good set of features will be highly discriminative - they will tell you a lot about whether the document is of one class or another.
Tf-idf (term frequency inverse document frequency) is a particular feature that seems to be discriminative for document classification tasks. There are others, like word counts (tf or term frequency) or whether a regexp matches the text or what have you.
Feature selection is the task of selecting good (discriminative) features. Tfidf is probably a good feature to select.
I want to text classification based on the keywords appear in the text, because I do not have sample data to use naive bayes for text classification.
Example:
my document has some few words as "family, mother , father , children ... " that the categories of document are family.Or "football, tennis, score ... " that the category is sport
What is the best algorithm in this case ?.And is there any api java for this problem?
What you have are feature labels, i.e., labels on features rather than instances. There are a few methods for exploiting these, but usually it is assumed that one has instance labels (i.e., labels on documents) in addition to feature labels. This paradigm is referred to as dual-supervision.
Anyway, I know of at least two ways to learn from labeled features alone. The first is Generalized Expectation Criteria, which penalizes model parameters for diverging from a priori beliefs (e.g., that "moether" ought usually to correlate with "family"). This method has the disadvantage of being somewhat complex, but the advantage of having a nicely packaged, open-source Java implementation in the Mallet toolkit (see here, specifically).
A second option would basically be to use Naive Bayes and give large priors to the known word/class associations -- e.g., P("family"|"mother") = .8, or whatever. All unlabeled words would be assigned some prior, presumably reflecting class distribution. You would then effectively being making decisions only based on the prevalence of classes and the labeled term information. Settles proposed a model like this recently, and there is a web-tool available.
You likely will need an auxillary data set for this. You cannot rely on your data set to convey the information that "dad" and "father" and "husband" have a similar meaning.
You can try to do mine for co-occurrences to detect near-synonyms, but this is not very reliable.
Probably wordnet etc. are a good place to disambiguate such words.
You can download the freebase topic collection: http://wiki.freebase.com/wiki/Topic_API.