Simple question again: Is it better to use Ngrams (unigram/ bigrams etc) as simple binary features or rather use their Tfidf scores in ML models such as Support Vectory Machines for performing NLP tasks such as sentiment analysis or text categorization/classification?
As Steve mentioned in the comment, the best answer (and the ML-style way) is to try !
That being said, I'd start with binary features. The goal of your ML model like SVM is to determine the "weight" of these features, so if it is efficient, you don't have to try to set this weight in advance (with TFIDF or other).
Related
I am new to Machine Learning.I am working on a project where the machine learning concept need to be applied.
Problem Statement:
I have large number(say 3000)key words.These need to be classified into seven fixed categories.Each category is having training data(sample keywords).I need to come with a algorithm, when a new keyword is passed to that,it should predict to which category this key word belongs to.
I am not aware of which text classification technique need to applied for this.do we have any tools that can be used.
Please help.
Thanks in advance.
This comes under linear classification. You can use naive-bayes classifier for this. Most of the ml frameworks will have an implementation for naive-bayes. ex: mahout
Yes, I would also suggest to use Naive Bayes, which is more or less the baseline classification algorithm here. On the other hand, there are obviously many other algorithms. Random forests and Support Vector Machines come to mind. See http://machinelearningmastery.com/use-random-forest-testing-179-classifiers-121-datasets/ If you use a standard toolkit, such as Weka, Rapidminer, etc. these algorithms should be available. There is also OpenNLP for Java, which comes with a maximum entropy classifier.
You could use the Word2Vec Word Cosine distance between descriptions of each your category and keywords in the dataset and then simple match each keyword to a category with the closest distance
Alternatively, you could create a training dataset from already matched to category, keywords and use any ML classifier, for example, based on artificial neural networks by using vectors of keywords Cosine distances to each category as an input to your model. But it could require a big quantity of data for training to reach good accuracy. For example, the MNIST dataset contains 70000 of the samples and it allowed me reach 99,62% model's cross validation accuracy with a simple CNN, for another dataset with only 2000 samples I was able reached only about 90% accuracy
There are many classification algorithms. Your example looks to be a text classification problems - some good classifiers to try out would be SVM and naive bayes. For SVM, liblinear and libshorttext classifiers are good options (and have been used in many industrial applcitions):
liblinear: https://www.csie.ntu.edu.tw/~cjlin/liblinear/
libshorttext:https://www.csie.ntu.edu.tw/~cjlin/libshorttext/
They are also included with ML tools such as scikit-learna and WEKA.
With classifiers, it is still some operation to build and validate a pratically useful classifier. One of the challenges is to mix
discrete (boolean and enumerable)
and continuous ('numbers')
predictive variables seamlessly. Some algorithmic preprocessing is generally necessary.
Neural networks do offer the possibility of using both types of variables. However, they require skilled data scientists to yield good results. A straight-forward option is to use an online classifier web service like Insight Classifiers to build and validate a classifier in one go. N-fold cross validation is being used there.
You can represent the presence or absence of each word in a separate column. The outcome variable is desired category.
When given the dataset, normally m instances by n features matrix, how to choose the classifier that is most appropriate for the dataset.
This is just like what algorithm to solve a prime Number. Not every algorithm solve any problem means each problem assigned which finite no. of algorithm. In machine learning you can apply different algorithm on a type of problem.
If matrix contain real numbered features then you can use KNN algorithm can be used. Or if matrix have words as feature then you can use naive bayes classifier which is one of best for text classification. And Machine learning have tons of algorithm you can read them apply to your problem which fits best. Hope you understand what I said.
An interesting but much more general map I found:
http://scikit-learn.org/stable/tutorial/machine_learning_map/
If you have weka, you can use experimenter and choose different algorithms on same data set to evaluate different models.
This project compares many different classifiers on different typical datasets.
If you have no idea, you could use this simple tool auto-weka which will test all the different classifiers you selected within different constraints. Before using auto-weka, you may need to convert your data to ARFF using Weka or just manually (many tutorial on youtube).
The best classifier depends on your data (binary/string/real/tags, patterns, distribution...), what kind of output to predict (binary class / multi-class / evolving classes / a value from regression ?) and the expected performance (time, memory, accuracy). It would also depend on whether you want to update your model frequently or not (ie. if it is a stream, better use an online classifier).
Please note that the best classifier may not be one but an ensemble of different classifiers.
I am using the explorer feature for classification. My .arff data file has 10 features of numeric and binary values; (only the ID of instances is nominal).I have abt 16 instances. The class to predict is Yes/No.i have used Naive bayes but i cantnot interpret the results,,does anyone know how to interpret results from naive Bayes classification?
Naive Bayes doesn't select any important features. As you mentioned, the result of the training of a Naive Bayes classifier is the mean and variance for every feature. The classification of new samples into 'Yes' or 'No' is based on whether the values of features of the sample match best to the mean and variance of the trained features for either 'Yes' or 'No'.
You could use others algorithms to find the most informative attributes. In that case you might want to use a decision tree classifier, e.g. J48 in WEKA (which is the open-source implementation of C4.5 decision tree algorithm). The first node in the resulting decision tree tells you which feature has the most predictive power.
Even better (as stated by Rushdi Shams in the other post); Weka's Explorer offers purpose build options to find the most useful attributes in a dataset. These options can be found under the Select attributes tab.
As Sicco said NB cannot offer you the best features. Decision tree is a good choice because the branching can sometimes tell you the feature that is important- BUT NOT ALWAYS. In order to handle simple to complex featureset, you can use WEKA's SELECT ATTRIBUTE tab. There, you can find search methods and attribute evaluator. Depending on your task, you can choose the one that best suits you. They will provide you a ranking of the features (either from training data or from a k-fold cross validation). Personally, I believe that decision trees perform poor if your dataset is overfitting. In that case, a ranking of features is the standard way to select best features. Most of the times I use infogain and ranker algorithm. When you see your attributes are ranked from 1 to k, it is really nice to figure out the required features and unnecessary ones.
This is in the context of doing sentiment analysis using LingPipe machine learning tool. I have to classify if a sentence in a big paragraph has a positive/negative sentiment. I know of the following approach in LingPipe
Classify if the complete paragraph based on its polarity - negative or positive.
Here, I yet don't know the polarity at the sentence level. We are still at the paragraph level. How do I determine the polarity at the sentence level of a paragraph, of whether a sentence in a paragraph is a positive/negative sentence? I know that LingPipe is capable of classifying if a sentence is subjective/objective. So using this approach,,,,
,,,, should I
First train LingPipe on a large set of sentences that are subjective/objective.
Use the trained model to extract all subjective sentences out of a test paragraph.
Train a LingPipe classifier based on the extracted subjective sentences for polarity by manually labeling them as positive/negative.
Now used the trained polarity model and feed a test subjective sentence (that is done by passing a sentence through the trained subjective/objective) model, and then determine if the statement is positive/negative?
Does the above approach work? In the above proposed approach, we know that LingPipe is capable of accepting a large textual content (paragraph) for polarity classification. Will it do a good job if we just pass a single subjective sentence for polarity classification? I am confused!
You might want to take a look at the multi-level analysis approaches in the literature, e.g.
Li, S., et al. (2010). "Exploiting Combined Multi-level Model for Document Sentiment Analysis," 2010 International Conference on Pattern Recognition.
Yessenalina, A., et al. (2010). "Multi-level Structured Models for Document-level Sentiment Classification," Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1046–1056,MIT, Massachusetts, USA, 9-11 October 2010.
Multi-level analysis approaches are quite common in information retrieval, as in content indexing for vector space similarity search.
Environments such as Ling Pipe are a good way to get started but eventually you need to employ lower level, finer grained tools such as yura suggested.
Most machine leraning libraries including lingpipe are row based(object with planar features) . So if you want do some hierarchical classification with it you should denormolize you data. for example you can have features of paragrahp and sentence at same feature set. If you use by word only clasification you can create such features PARGRAPH_WORDX=true, SENTENCE_WORDX=true.
Some other toolkits allow you to express you model withot denormalisation, it is so called graphical models exampels are CRF, ACRF, Markov Models etc implementation of those you can find in mallet and Factorie.
I would like to classify text documents into four categories. Also I have lot of samples which are already classified that can be used for training. I would like the algorithm to learn on the fly.. please suggest an optimal algorithm that works for this requirement.
If by "on the fly" you mean online learning (where training and classification can be interleaved), I suggest the k-nearest neighbor algorithm. It's available in Weka and in the package TiMBL.
A perceptron will also be able to do this.
"Optimal" isn't a well-defined term in this context.
there are several algorithms which can be learned on fly. Examples: k-nearest neighbors, naive Bayes, neural networks. You can try how appropriate each of these methods are on a sample corpus.
Since you have unlabeled data you might want to use a model where this helps. The first thing that comes to my mind is nonlinear NCA: Learning a Nonlinear Embedding by Preserving
Class Neighbourhood Structure, (Salakhutdinov, Hinton).
Well....I have to say that document classification is kind of different what you guys are thinking.
Typically, in document classification, after preprocessing, the test data is always extremely huge, for example, O(N^2)...Therefore it might be too computationally expensive.
The another typical classifier that came into my mind is discriminant classifier...which doesn't need the generative model for your dataset. After training, you have to do is to put your single entry to the algorithm, and it is gonna be classified.
Good luck with this. For example, you can check E. Alpadin's book, Introduction to Machine Learning.